How to Use Pandas for Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) is a crucial step in the data science workflow. It involves summarizing, visualizing, and understanding the main characteristics of a dataset. Pandas, a powerful Python library, is one of the most popular tools for EDA due to its easy - to - use data structures and a wide range of built - in functions. In this blog, we will explore how to use Pandas for EDA, covering fundamental concepts, usage methods, common practices, and best practices.

Table of Contents

  1. Fundamental Concepts
  2. Loading Data
  3. Data Inspection
  4. Data Cleaning
  5. Summarizing Data
  6. Grouping and Aggregation
  7. Visualization with Pandas
  8. Best Practices
  9. Conclusion
  10. References

1. Fundamental Concepts

Pandas Data Structures

  • Series: A one - dimensional labeled array capable of holding any data type. It can be thought of as a single column in a table.
  • DataFrame: A two - dimensional labeled data structure with columns of potentially different types. It is similar to a spreadsheet or a SQL table.
import pandas as pd

# Create a Series
s = pd.Series([1, 3, 5, np.nan, 6, 8])
print(s)

# Create a DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35]
}
df = pd.DataFrame(data)
print(df)

2. Loading Data

Pandas can load data from various sources such as CSV, Excel, SQL databases, etc.

# Load a CSV file
csv_data = pd.read_csv('example.csv')
print(csv_data.head())

# Load an Excel file
excel_data = pd.read_excel('example.xlsx')
print(excel_data.head())

3. Data Inspection

Once the data is loaded, we need to inspect it to understand its structure and content.

# Check the shape of the DataFrame
print('Shape:', csv_data.shape)

# Get the column names
print('Columns:', csv_data.columns)

# View basic information about the DataFrame
csv_data.info()

# Get the summary statistics of numerical columns
print(csv_data.describe())

4. Data Cleaning

Data cleaning is an essential part of EDA. We need to handle missing values, duplicates, and incorrect data.

# Check for missing values
print(csv_data.isnull().sum())

# Drop rows with missing values
cleaned_data = csv_data.dropna()

# Drop duplicate rows
cleaned_data = cleaned_data.drop_duplicates()

5. Summarizing Data

We can calculate various summary statistics for different columns.

# Calculate the mean of a column
mean_age = df['Age'].mean()
print('Mean Age:', mean_age)

# Calculate the median of a column
median_age = df['Age'].median()
print('Median Age:', median_age)

6. Grouping and Aggregation

Grouping data allows us to analyze subsets of the data based on certain criteria.

# Group by a column and calculate the sum
grouped = csv_data.groupby('Category')['Value'].sum()
print(grouped)

# Group by multiple columns and calculate multiple aggregations
agg_data = csv_data.groupby(['Category', 'Sub - category']).agg({
    'Value': ['sum', 'mean']
})
print(agg_data)

7. Visualization with Pandas

Pandas has built - in plotting functions that can be used for quick visualizations.

import matplotlib.pyplot as plt

# Plot a histogram of a numerical column
csv_data['Value'].plot(kind='hist')
plt.show()

# Plot a bar chart of grouped data
grouped.plot(kind='bar')
plt.show()

8. Best Practices

  • Use meaningful variable names: This makes the code more readable and maintainable.
  • Keep the data cleaning process well - documented: It helps in understanding the changes made to the data.
  • Validate assumptions: Check if the data meets the assumptions made during the analysis.
  • Explore different visualizations: Different plots can reveal different aspects of the data.

9. Conclusion

Pandas is a versatile and powerful library for Exploratory Data Analysis. It provides a wide range of functions for data loading, inspection, cleaning, summarization, and visualization. By following the concepts, usage methods, common practices, and best practices discussed in this blog, you can efficiently use Pandas to gain insights from your data.

10. References