Data cleaning encompasses several key concepts:
To follow this tutorial, you need to have Python installed on your system along with the Pandas library. You can install Pandas using pip
:
pip install pandas
We will use a sample CSV file for demonstration purposes. First, import the Pandas library and load the data:
import pandas as pd
# Load a CSV file
data = pd.read_csv('sample_data.csv')
print(data.head())
In the above code, pd.read_csv()
is used to read a CSV file. The head()
method is then used to display the first few rows of the dataset.
Pandas provides several methods to handle missing values.
# Check for missing values
missing_values = data.isnull()
print(missing_values.sum())
The isnull()
method returns a DataFrame of boolean values indicating whether each value is missing or not. The sum()
method is then used to count the number of missing values in each column.
# Remove rows with missing values
data_without_missing_rows = data.dropna()
# Remove columns with missing values
data_without_missing_columns = data.dropna(axis=1)
The dropna()
method is used to remove rows or columns with missing values. By default, it removes rows (axis = 0
). Setting axis = 1
will remove columns.
# Fill missing values with a specific value
filled_data = data.fillna(value=0)
# Fill missing values with the mean of the column
mean_value = data['column_name'].mean()
data['column_name'] = data['column_name'].fillna(mean_value)
The fillna()
method is used to fill missing values. You can fill them with a specific value or a calculated statistic like the mean.
To remove duplicate rows, use the drop_duplicates()
method:
# Remove duplicate rows
data_without_duplicates = data.drop_duplicates()
Sometimes, the data types of columns may be incorrect. You can change the data type using the astype()
method:
# Convert a column to integer type
data['column_name'] = data['column_name'].astype(int)
We can use the inter - quartile range (IQR) method to detect and remove outliers.
Q1 = data['column_name'].quantile(0.25)
Q3 = data['column_name'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Filter out outliers
filtered_data = data[(data['column_name'] >= lower_bound) & (data['column_name'] <= upper_bound)]
Data cleaning is a crucial step in data analysis. Pandas provides a rich set of tools to handle various data cleaning tasks such as handling missing values, removing duplicates, correcting data types, and filtering outliers. By following the concepts and methods discussed in this tutorial, you can effectively clean your datasets and prepare them for further analysis.