Missing values are a common problem in datasets. They can occur due to various reasons such as data entry errors, sensor malfunctions, or incomplete surveys. Pandas represents missing values as NaN
(Not a Number) for numerical data and None
for object data types.
Duplicate records are rows in the dataset that have identical values across all columns. They can skew the analysis results and waste computational resources.
Inconsistent data types can cause issues when performing calculations or comparisons. For example, a column that is supposed to contain numerical values may have some string values mixed in.
Outliers are data points that are significantly different from the other data points in the dataset. They can have a large impact on statistical analysis and machine learning models.
import pandas as pd
import numpy as np
# Create a sample DataFrame with missing values
data = {
'A': [1, 2, np.nan, 4],
'B': [5, np.nan, 7, 8],
'C': [9, 10, 11, 12]
}
df = pd.DataFrame(data)
# Check for missing values
print("Missing values in each column:")
print(df.isnull().sum())
# Drop rows with missing values
df_dropped = df.dropna()
print("\nDataFrame after dropping rows with missing values:")
print(df_dropped)
# Fill missing values with a specific value
df_filled = df.fillna(0)
print("\nDataFrame after filling missing values with 0:")
print(df_filled)
# Create a sample DataFrame with duplicate records
data = {
'A': [1, 2, 2, 3],
'B': [4, 5, 5, 6],
'C': [7, 8, 8, 9]
}
df = pd.DataFrame(data)
# Check for duplicate records
print("Duplicate records:")
print(df.duplicated())
# Drop duplicate records
df_dropped = df.drop_duplicates()
print("\nDataFrame after dropping duplicate records:")
print(df_dropped)
# Create a sample DataFrame with inconsistent data types
data = {
'A': [1, 2, '3', 4],
'B': [5, 6, 7, 8]
}
df = pd.DataFrame(data)
# Convert column 'A' to numeric data type
df['A'] = pd.to_numeric(df['A'], errors='coerce')
print("\nDataFrame after converting column 'A' to numeric data type:")
print(df)
# Create a sample DataFrame with outliers
data = {
'A': [1, 2, 3, 4, 100]
}
df = pd.DataFrame(data)
# Calculate the interquartile range (IQR)
Q1 = df['A'].quantile(0.25)
Q3 = df['A'].quantile(0.75)
IQR = Q3 - Q1
# Define the lower and upper bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Remove outliers
df_no_outliers = df[(df['A'] >= lower_bound) & (df['A'] <= upper_bound)]
print("\nDataFrame after removing outliers:")
print(df_no_outliers)
# Load a real-world dataset
url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv'
df = pd.read_csv(url)
# Check for missing values
print("Missing values in each column:")
print(df.isnull().sum())
# Check for duplicate records
print("\nDuplicate records:")
print(df.duplicated().sum())
# Drop duplicate records
df = df.drop_duplicates()
# Print the cleaned dataset
print("\nCleaned dataset:")
print(df.head())
Data cleaning is an essential step in the data analysis pipeline. Pandas provides a rich set of tools to handle various data cleaning tasks, such as handling missing values, duplicate records, inconsistent data types, and outliers. By following the best practices and using the examples provided in this blog post, you can effectively clean your datasets and prepare them for analysis.
dropna()
and fillna()
?dropna()
is used to remove rows or columns that contain missing values, while fillna()
is used to fill the missing values with a specific value or a calculated value.
You can handle outliers by removing them, replacing them with a more appropriate value, or transforming the data using techniques such as logarithmic transformation.
You can try using techniques such as parallel processing or sampling to reduce the computational time.