In Pandas, a categorical data type is a special data type that represents a variable with a fixed and limited set of possible values. It is more memory - efficient than the object data type when dealing with categorical variables, especially when there are many repeated values. A categorical variable consists of two main components: categories and codes. The categories are the unique values that the variable can take, and the codes are the integer values that represent each category.
We can create a categorical data series in several ways. One common way is to convert an existing series or column to a categorical type using the astype()
method.
import pandas as pd
# Create a regular series
data = pd.Series(['apple', 'banana', 'apple', 'cherry'])
# Convert to categorical
categorical_data = data.astype('category')
print(categorical_data)
We can also create a categorical series directly by specifying the categories explicitly.
categories = ['apple', 'banana', 'cherry']
codes = [0, 1, 0, 2]
categorical_series = pd.Categorical.from_codes(codes, categories)
print(categorical_series)
We can access the categories and codes of a categorical series using the categories
and codes
attributes respectively.
import pandas as pd
data = pd.Series(['apple', 'banana', 'apple', 'cherry']).astype('category')
print("Categories:", data.cat.categories)
print("Codes:", data.cat.codes)
Categorical data can be sorted based on the order of the categories. We can define an ordered categorical variable by specifying the ordered
parameter when creating the categorical data.
import pandas as pd
categories = ['low', 'medium', 'high']
data = pd.Series(['low', 'high', 'medium', 'low']).astype(pd.CategoricalDtype(categories=categories, ordered=True))
sorted_data = data.sort_values()
print(sorted_data)
We can rename the categories of a categorical variable using the rename_categories()
method.
import pandas as pd
data = pd.Series(['apple', 'banana', 'apple', 'cherry']).astype('category')
new_categories = ['red fruit', 'yellow fruit', 'red fruit', 'red - black fruit']
renamed_data = data.cat.rename_categories(new_categories)
print(renamed_data)
Most machine learning algorithms cannot handle categorical data directly. We need to convert categorical data into numerical data. One common encoding method is one - hot encoding, which can be easily done using the get_dummies()
function in Pandas.
import pandas as pd
data = pd.DataFrame({'fruit': ['apple', 'banana', 'apple', 'cherry']})
one_hot_encoded = pd.get_dummies(data, columns=['fruit'])
print(one_hot_encoded)
We can group data by categorical variables and perform aggregation operations.
import pandas as pd
data = pd.DataFrame({
'category': ['A', 'B', 'A', 'B'],
'value': [10, 20, 30, 40]
})
grouped = data.groupby('category').sum()
print(grouped)
Using the categorical data type in Pandas can significantly reduce memory usage, especially when dealing with large datasets with many repeated categorical values.
import pandas as pd
import numpy as np
# Create a large series with repeated values
large_series = pd.Series(np.random.choice(['apple', 'banana', 'cherry'], size=100000))
# Calculate memory usage before conversion
memory_before = large_series.memory_usage(deep=True)
# Convert to categorical
categorical_series = large_series.astype('category')
# Calculate memory usage after conversion
memory_after = categorical_series.memory_usage(deep=True)
print(f"Memory before: {memory_before} bytes")
print(f"Memory after: {memory_after} bytes")
We can handle missing values in categorical data by filling them with a specific category or using statistical methods.
import pandas as pd
data = pd.Series(['apple', 'banana', np.nan, 'cherry']).astype('category')
filled_data = data.cat.add_categories(['unknown']).fillna('unknown')
print(filled_data)
Handling categorical data in Pandas is an essential skill for data analysts and machine learning practitioners. Pandas provides a rich set of tools for creating, manipulating, encoding, and analyzing categorical data. By understanding the fundamental concepts, usage methods, common practices, and best practices, we can efficiently handle categorical data and make the most of our data analysis and machine learning projects.