import pandas as pd
# Create a Series
s = pd.Series([1, 3, 5, np.nan, 6, 8])
print(s)
# Create a DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35]
}
df = pd.DataFrame(data)
print(df)
Pandas can load data from various sources such as CSV, Excel, SQL databases, etc.
# Load a CSV file
csv_data = pd.read_csv('example.csv')
print(csv_data.head())
# Load an Excel file
excel_data = pd.read_excel('example.xlsx')
print(excel_data.head())
Once the data is loaded, we need to inspect it to understand its structure and content.
# Check the shape of the DataFrame
print('Shape:', csv_data.shape)
# Get the column names
print('Columns:', csv_data.columns)
# View basic information about the DataFrame
csv_data.info()
# Get the summary statistics of numerical columns
print(csv_data.describe())
Data cleaning is an essential part of EDA. We need to handle missing values, duplicates, and incorrect data.
# Check for missing values
print(csv_data.isnull().sum())
# Drop rows with missing values
cleaned_data = csv_data.dropna()
# Drop duplicate rows
cleaned_data = cleaned_data.drop_duplicates()
We can calculate various summary statistics for different columns.
# Calculate the mean of a column
mean_age = df['Age'].mean()
print('Mean Age:', mean_age)
# Calculate the median of a column
median_age = df['Age'].median()
print('Median Age:', median_age)
Grouping data allows us to analyze subsets of the data based on certain criteria.
# Group by a column and calculate the sum
grouped = csv_data.groupby('Category')['Value'].sum()
print(grouped)
# Group by multiple columns and calculate multiple aggregations
agg_data = csv_data.groupby(['Category', 'Sub - category']).agg({
'Value': ['sum', 'mean']
})
print(agg_data)
Pandas has built - in plotting functions that can be used for quick visualizations.
import matplotlib.pyplot as plt
# Plot a histogram of a numerical column
csv_data['Value'].plot(kind='hist')
plt.show()
# Plot a bar chart of grouped data
grouped.plot(kind='bar')
plt.show()
Pandas is a versatile and powerful library for Exploratory Data Analysis. It provides a wide range of functions for data loading, inspection, cleaning, summarization, and visualization. By following the concepts, usage methods, common practices, and best practices discussed in this blog, you can efficiently use Pandas to gain insights from your data.