In Pandas, missing data is represented by NaN
(Not a Number) for numerical data and None
or NaN
for object data types. Pandas provides several functions to handle these missing values. The main data structures in Pandas, Series
and DataFrame
, have built - in methods to deal with missing data.
Pandas provides two main functions to detect missing data: isnull()
and notnull()
.
import pandas as pd
import numpy as np
# Create a sample DataFrame with missing values
data = {'A': [1, np.nan, 3], 'B': [np.nan, 5, 6], 'C': [7, 8, np.nan]}
df = pd.DataFrame(data)
# Detect missing values
print(df.isnull())
# Detect non - missing values
print(df.notnull())
The isnull()
function returns a DataFrame (or Series) of the same shape as the original object, where each element is True
if the corresponding element in the original object is missing and False
otherwise. The notnull()
function does the opposite.
You can remove missing data using the dropna()
method.
# Drop rows with any missing values
df_dropped_rows = df.dropna()
print(df_dropped_rows)
# Drop columns with any missing values
df_dropped_columns = df.dropna(axis = 1)
print(df_dropped_columns)
The axis
parameter can be set to 0
(default) to drop rows or 1
to drop columns. By default, dropna()
drops any row or column that contains at least one missing value. You can also use the thresh
parameter to specify the minimum number of non - missing values required for a row or column to be kept.
# Keep rows with at least 2 non - missing values
df_thresh = df.dropna(thresh = 2)
print(df_thresh)
You can fill missing values with a single value using the fillna()
method.
# Fill missing values with 0
df_filled_0 = df.fillna(0)
print(df_filled_0)
Pandas allows you to fill missing values with the previous or next non - missing value. This is known as forward filling (ffill
) and backward filling (bfill
).
# Forward fill
df_ffill = df.fillna(method = 'ffill')
print(df_ffill)
# Backward fill
df_bfill = df.fillna(method = 'bfill')
print(df_bfill)
You can also fill missing values with statistical measures such as the mean, median, or mode.
# Fill missing values in column 'A' with the mean of column 'A'
df['A'] = df['A'].fillna(df['A'].mean())
print(df)
Handling missing data is an important step in data analysis. Pandas provides a rich set of tools to detect, remove, and fill missing data. By understanding the fundamental concepts and using the appropriate methods, you can ensure that your data analysis is accurate and reliable. Remember to follow best practices and use domain knowledge to make informed decisions when handling missing data.