Pandas python: How to analyze data with pandas in Python

pandas python

Pandas is a Python library that lets you work with data sifted into Excel or CSV spreadsheets . It is one of the most popular and widely used Python libraries, and is especially useful for anyone working in data analysis. Thus, with pandas, it is possible to manage, manipulate and analyze data in an easy and fast way , making the data analysis process more efficient and effective.

In this article, you’ll learn everything you need to know to get started with pandas, from getting started to creating compelling visualizations . Let’s start!

Syntax

The pandas syntax is based on DataFrames , which are Python objects that allow you to manage, manipulate and parse data. In this way, pandas DataFrames are similar to Excel spreadsheets, with the differences being that the data is stored in Python and can be manipulated more flexibly.

To create a DataFrame in pandas, you first need to load the data into a Python object . Here’s an example of how to load a CSV file into a DataFrame:

import pandas as pd

# Load a CSV file into a DataFrame
df = pd.read_csv("data.csv")

Once the DataFrame is created, you can use a number of tools and methods to manage and analyze the data. For example, you can use the method  head() to display the first 10 rows of the DataFrame, or the method  describe() to display statistics for that DataFrame.

# Display the first 10 rows of the DataFrame
df.head()

# Display DataFrame statistics
df.describe()

In addition to these basic methods, pandas offers a number of other methods and tools for managing and analyzing data, such as filters, aggregations and visualizations. In this regard, here is an example of how to apply a filter to the DataFrame:

# Apply a filter to the DataFrame
df[df["age"] >= 30]

This is just an example of pandas syntax. To learn more about the available tools and methods, continue learning in this article and consult the official pandas documentation.

Main features of Pandas in Python

Data management in pandas is one of the main features of the library. In this way, allowing to work with data in a table format, facilitating the analysis and manipulation of the data. Some of pandas main functions for data management include:

  • Creating DataFrames: It is possible to create a DataFrame from an array or a list of lists. The DataFrame can contain multiple columns and rows, and each cell can contain a numeric or categorical value.
  • Reading and writing files: Pandas supports reading and writing several file formats, including CSV, Excel, JSON, SQL and more. Thus, making it possible to import and export data from other data sources.
  • Filters: It is possible to filter the data in a DataFrame using a boolean expression. So this allows you to select only the rows or columns that interest you.
  • Aggregation: Pandas offers several functions to aggregate data in a DataFrame, such as mean, standard deviation, count and more. In this sense, allowing to obtain a better understanding of the data and identify trends.
  • Resampling: Pandas supports data resampling, that is, you can split data into samples for training and testing machine learning models.
  • Data manipulation: Pandas provides several functions to manipulate data in a DataFrame, such as removing rows or columns, renaming columns, sorting and more.

These are just some of pandas’ features for data management. Thus, the library is quite powerful and offers many other features to work with data efficiently and intuitively.

Data Management in Pandas

Pandas is a Python library that lets you work with data in a table format. As such, it is quite popular due to its ease of use and its ability to handle various types of data, including numerical and categorical data. Therefore, to work with pandas, it is important to understand how to work with the cells, rows and columns of a DataFrame.

For example, if you have a DataFrame with 3 rows and 4 columns, the first cell (in the upper left corner) will have index 0, the second cell will have index 1, and so on. The cell in row 2, column 3 will be accessed by index 2 as it is the third cell in the DataFrame.

Working with Cells in Pandas in Python

Cells are the individual elements of a DataFrame. This way, each cell contains a unique value and can be accessed by its position in the table. Thus, the position of a cell is specified by its index, which is a zero or positive number that indicates the row and column in which the cell is located.

To delete a cell from a DataFrame, just use the method  drop() and specify the index of the cell you want to delete. For example:

import pandas as pd

# create a new DataFrame with 3 rows and 4 columns
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})

# delete the cell in row 1, column 2
df = df.drop(0, axis=0)

# print the resulting DataFrame
print(df)

Output:

   A   B   C
2  3   6   9

To change the value of a cell, just access the cell by its index and assign a new value to it. For example:

# change the cell value in row 2, column 1 to 10
df.iloc[1, 0] = 10

# print the resulting DataFrame
print(df)

Output:

   A   B   C
0  1   4   7
1  2  10   8
2  3   6   9

To get the value of a cell, just access the cell by its index. For example:

# get the cell value in row 2, column 1
value = df.iloc[1, 0]

# print the value
print(value)

Output:

10

Working with Columns in Pandas in Python

Columns are the basic elements of a table in pandas. Thus, each column represents a variable or attribute and contains a set of unique values. To work with columns, you can use the DataFrame class. A DataFrame is a table of data consisting of rows and columns.

To work with columns in pandas, you can use several resources. Some of the most common features include:

  • Get Column Name: You can use attribute name to get column name.
    Example:
print(df.columns)
  • Get the number of columns: you can use the shape method and access the second element to get the number of columns.
    Example:
print(df.shape[1])
  • Get the first value of a column: you can use the iloc() method and provide the column index and row index.
    Example:
print(df.iloc[0, 0])
  • To get all values ​​from a column: you can use the locals() method and provide the column index.
    Example:
print(df["col1"])
  • Get a list of all column names: you can use the columns.
    Example:
print(df.columns.tolist())
  • Get a list with the unique values ​​of a column: you can use unique() method
    Example:
print(df["col1"].unique())
  • Get number of rows and columns: you can use shape
    method Example:
print(df.shape)
  • Get the index of a column: you can use set_index() method
    Example:
df.set_index("col1", inplace=True)

Working with Lines in Pandas in Python

Rows are the basic elements of a table in pandas. Each row represents an individual or observation and contains a set of values ​​for each variable. Therefore, to work with lines, you can use the DataFrame class. In this sense, a DataFrame is a table of data consisting of rows and columns.

To work with lines in pandas, you can use several resources. Some of the most common features include:

  • Get the number of rows: you can use shape method and access the first element to get the number of rows.
    Example:
print(df.shape[0])
  • Get the value of a row: You can use the iloc() method and provide the row index and column index.
    Example:
print(df.iloc[0, 0])
  • Get all rows from a column: you can use the locals() method and provide the column index.
    Example:
print(df["linha1"])
  • Get a list of all row indexes: you can use the index method .
    Example:
print(df.index.tolist())
  • Get a list with the unique values ​​of a row: you can use unique() method
    Example:
print(df["linha1"].unique())
  • Get the index of a row: you can use set_index() method
    Example:
df.set_index("linha1", inplace=True)
  • Get first x rows: you can use head()
    method Example:
print(df.head(x))
  • Get last x rows: you can use tail()
    method Example:
print(df.tail(x))

Applying Classifiers with Pandas in python

Sorted is a way to group or sort the data in a DataFrame based on specific criteria. Thus, pandas offers several functionalities to apply classifications, such as:

  • groupby(): The groupby() method allows you to group the rows of a DataFrame based on one or more columns.
    Example:
group = df.groupby("idade")
  • agg(): The agg() method allows you to calculate aggregated values ​​for each group created by the groupby() method.
    Example:
group.agg({"salario": "mean"})
  • pivot_table(): The pivot_table() method allows you to create pivot tables or disaggregate data in a DataFrame.
    Example:
pd.pivot_table(df, values="salario", index="idade", columns="gênero")
  • crosstab(): The crosstab() method allows you to create crosstabs or disaggregate data in a DataFrame.
    Example:
pd.crosstab(df["idade"], df["gênero"])

Data aggregation in Pandas

Data aggregation is an important process in data analysis, which consists of combining one or more columns of data into a single column. Thus, in Python we can use the Pandas library to perform data aggregation operations.

The Pandas library offers several functions for aggregating data. So, some of these functions are:

  • sum(): sums the values ​​of all cells in a column.
  • mean(): Calculates the average of the values ​​of all cells in a column.
  • median(): calculates the average value of the values ​​in a column, considering the average of even values ​​and the average of odd values.
  • min(): Returns the smallest value in a column.
  • max(): Returns the largest value in a column.
  • count(): Returns the number of rows in a column.

In addition to these functions, the Pandas library also offers other functions for data aggregation, such as  groupby(),  merge(),  join(), among others.

Examples using Pandas in conjunction with other functions in python

Let’s give some examples of how to use other functions with the Pandas package in Python:

  1. Using  append():

The method  is used to add a row or a column to an existing DataFrame. append()

import pandas as pd

# Creating a DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'c']})

# Adding a row of data to the DataFrame
df = df.append({'A': 4, 'B': 'd'}, ignore_index=True)

print(df)

Output:

   A  B
0  1  a
1  2  b
2  3  c
3  4  d
  1. Using null:

The value NaN or   (null value) is used to indicate that data is not available.

import pandas as pd

# Creating a DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'c']})

# Adding a NaN value to the DataFrame
df.loc[4] = pd.NaT

print(df)

Output:

   A  B
0  1  a
1  2  b
2  3  c
3  NaN 
4  NaN
  1. Using two functions together:

The function  apply() is used to apply a function to each value in a column. The argument  ifelse() is used to define a value to be returned based on a condition.

import pandas as pd
import numpy as np

# Creating a DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

# Defining a function to check whether a number is even or odd
def is_even_or_odd(x):
    if x % 2 == 0:
        return 1
    else:
        return 0

# Applying the function to a column and returning the result
df['C'] = df['A'].apply(is_even_or_odd)

print(df)

Output:

   A  B   C
0  1  4    0
1  2  5    1
2  3  6    0
  1. Using  range():

The modulus  range() is used to generate a sequence of numbers.

import pandas as pd

# Creating a DataFrame
df = pd.DataFrame({'A': [1, 2, 3]})

# Generating a sequence of numbers with the range() module
seq = range(len(df))

# Creating a new column with the sequence values
df['B'] = sequence

print(df)

Output:

   A  B
0  1  0
1  2  1
2  3  2

So, in this example the sequence of numbers is generated using modulus  range(), based on the number of rows ( len(df)) in the DataFrame. Then we add a sequence of numbers as a new column to the DataFrame.

Working with tables and charts in Pandas

Pandas is a powerful Python library for analyzing and manipulating data. As such, it provides several tools for creating tables and graphs to visualize your data.

In that sense, to create a table with Pandas, just load a CSV file or a Python list and then use the function  to_frame() to convert the list into a DataFrame. Thus, the DataFrame is a data structure similar to an Excel spreadsheet, which we can manipulate with several operations.

Here’s an example of how to load a CSV file and create a table:

import pandas as pd

# Loading the CSV file
df = pd.read_csv('file.csv')

# Display the table
print(df)

In this way, Pandas also provides several formatting options for DataFrame cells, such fillna() as filling in missing values, round() rounding values, str.upper()  converting a column to uppercase, among others.  

Therefore, to create graphs, Pandas offers the function  plot(), which can be used to generate different types of graphs, such as linear graphs, bars, scatters, among others. So, here’s an example of how to create a bar chart:

import pandas as pd

# Loading the CSV file
df = pd.read_csv('file.csv')

# Creating a bar chart
df.plot(kind='bar')

# displaying the graph
plt.show()

Pandas also allows you to customize the charts, such as changing the title, the axes labels, the size of the lines, among other options.

In addition, Pandas offers the option to save the graphs as images in several formats, such as PNG, JPG, PDF, among others. Here is an example of how to save a graphic as a PNG image:

import pandas as pd

# Loading the CSV file
df = pd.read_csv('file.csv')

# Creating a bar chart
df.plot(kind='bar', filename='bar_chart.png')

# Closing the chart window
plt.close()

Therefore, to present your results, the use of visualizations is fundamental. Thus, visualizations help to convey the results in a clear and concise way, making the data easier to understand.

Was this helpful?

Thanks for your feedback!

Schenia T

Data scientist, passionate about technology tools and games. Undergraduate student in Statistics at UFPB. Her hobby is binge-watching series, enjoying good music working or cooking, going to the movies and learning new things!

Leave a Reply

Your email address will not be published. Required fields are marked *