Pandas Description of DataFrame

In the realm of data analysis and manipulation with Python, pandas is an indispensable library. A DataFrame in pandas is a two - dimensional labeled data structure with columns of potentially different types. It is similar to a spreadsheet or a SQL table. One of the crucial aspects of working with a DataFrame is to understand its characteristics, and that's where the description functionality comes in. The description of a DataFrame provides summary statistics for numerical columns, which can give us insights into the data distribution, central tendency, and variability.

Table of Contents#

  1. Core Concepts
  2. Typical Usage Methods
  3. Common Practices
  4. Best Practices
  5. Code Examples
  6. Conclusion
  7. FAQ
  8. References

Core Concepts#

What is a DataFrame Description?#

When we call the describe() method on a DataFrame in pandas, it computes basic statistical summaries of the numerical columns in the DataFrame. These summaries include the count of non - null values, mean, standard deviation, minimum value, 25th percentile, 50th percentile (median), 75th percentile, and maximum value.

Key Statistical Measures#

  • Count: The number of non - null values in each column.
  • Mean: The average value of the data points in a column.
  • Standard Deviation: A measure of the amount of variation or dispersion of a set of values.
  • Percentiles: Values that divide the data into 100 equal parts. The 25th, 50th, and 75th percentiles are commonly used to understand the data distribution.

Typical Usage Methods#

The most basic way to use the describe() method is to simply call it on a DataFrame object:

import pandas as pd
 
# Create a sample DataFrame
data = {
    'col1': [1, 2, 3, 4, 5],
    'col2': [5, 4, 3, 2, 1],
    'col3': ['a', 'b', 'c', 'd', 'e']
}
df = pd.DataFrame(data)
 
# Get the description of the DataFrame
description = df.describe()
print(description)

In this example, the describe() method will only compute statistics for the numerical columns (col1 and col2), ignoring the non - numerical column (col3).

We can also customize the describe() method. For example, we can include non - numerical columns by passing the include='all' parameter:

description_all = df.describe(include='all')
print(description_all)

Common Practices#

Handling Missing Values#

Before using the describe() method, it's a good practice to handle missing values. We can either drop the rows or columns with missing values using dropna() or fill them with appropriate values using fillna().

# Create a DataFrame with missing values
data_with_nan = {
    'col1': [1, 2, None, 4, 5],
    'col2': [5, None, 3, 2, 1]
}
df_with_nan = pd.DataFrame(data_with_nan)
 
# Drop rows with missing values
df_clean = df_with_nan.dropna()
description_clean = df_clean.describe()
print(description_clean)

Analyzing Different Sub - sets of Data#

We can filter the DataFrame based on certain conditions and then use the describe() method to analyze specific subsets of data.

# Filter the DataFrame
filtered_df = df[df['col1'] > 2]
description_filtered = filtered_df.describe()
print(description_filtered)

Best Practices#

Understanding the Data Type#

Make sure you understand the data types of the columns in your DataFrame. The describe() method will only compute numerical statistics for numerical columns by default. If you have columns that should be numerical but are stored as strings, you need to convert them first.

# Create a DataFrame with a column that should be numerical
data_str_num = {
    'col1': ['1', '2', '3', '4', '5']
}
df_str_num = pd.DataFrame(data_str_num)
 
# Convert the column to numerical type
df_str_num['col1'] = pd.to_numeric(df_str_num['col1'])
description_str_num = df_str_num.describe()
print(description_str_num)

Using the Output for Further Analysis#

The output of the describe() method can be used for further analysis. For example, you can compare the means and standard deviations of different columns to understand their relative variability.

Code Examples#

import pandas as pd
 
# Create a sample DataFrame
data = {
    'age': [25, 30, 35, 40, 45],
    'salary': [50000, 60000, 70000, 80000, 90000],
    'department': ['HR', 'IT', 'IT', 'Finance', 'HR']
}
df = pd.DataFrame(data)
 
# Basic description
basic_description = df.describe()
print("Basic Description:")
print(basic_description)
 
# Include all columns
all_description = df.describe(include='all')
print("\nDescription with all columns:")
print(all_description)
 
# Handling missing values
data_with_nan = {
    'age': [25, None, 35, 40, 45],
    'salary': [50000, 60000, None, 80000, 90000]
}
df_with_nan = pd.DataFrame(data_with_nan)
df_clean = df_with_nan.dropna()
clean_description = df_clean.describe()
print("\nDescription after handling missing values:")
print(clean_description)
 
# Analyzing subsets
filtered_df = df[df['age'] > 30]
filtered_description = filtered_df.describe()
print("\nDescription of filtered subset:")
print(filtered_description)

Conclusion#

The describe() method in pandas is a powerful tool for quickly getting an overview of the numerical columns in a DataFrame. It provides essential statistical information that can help us understand the data distribution, identify potential outliers, and make informed decisions during data analysis. By following the common and best practices, we can use this method more effectively in real - world scenarios.

FAQ#

Q1: Why are some columns missing from the description output?#

The describe() method only computes numerical statistics for numerical columns by default. If you want to include non - numerical columns, you can pass the include='all' parameter.

Q2: Can I get the description for a single column?#

Yes, you can select a single column from the DataFrame and then call the describe() method on it. For example: df['col1'].describe().

Q3: How does the describe() method handle missing values?#

The describe() method computes statistics based on non - null values. So, missing values are ignored when calculating the count, mean, etc.

References#