Pandas provides two primary data structures: Series
and DataFrame
. A Series
is a one - dimensional labeled array capable of holding any data type (integers, strings, floating - point numbers, Python objects, etc.). A DataFrame
is a two - dimensional labeled data structure with columns of potentially different types. It can be thought of as a spreadsheet or a SQL table.
Indexing in Pandas allows you to access and manipulate specific rows and columns in a Series
or DataFrame
. Pandas provides multiple ways to index, including label - based indexing (loc
), integer - based indexing (iloc
), and boolean indexing.
Pandas can load data from various sources such as CSV, Excel, SQL databases, and JSON. Here is an example of loading a CSV file:
import pandas as pd
# Load a CSV file
data = pd.read_csv('example.csv')
Once the data is loaded, you can inspect it to understand its structure and content.
# View the first few rows
print(data.head())
# Get basic information about the data
print(data.info())
# Check the shape of the data
rows, columns = data.shape
if rows > 0 and columns > 0:
print("DataFrame contains data.")
else:
print("DataFrame is empty.")
# View descriptive statistics
print(data.describe())
Data cleaning involves handling missing values, outliers, and incorrect data.
# Drop rows with missing values
cleaned_data = data.dropna()
# Fill missing values with a specific value
filled_data = data.fillna(0)
Data transformation can include operations like sorting, grouping, and aggregating data.
# Sort the data by a column
sorted_data = data.sort_values(by='column_name')
# Group the data by a column and calculate the mean
grouped_data = data.groupby('column_name').mean()
Missing values are a common issue in real - world data. You can use different strategies to handle them, such as removing rows or columns with missing values, filling them with mean, median, or a specific value.
# Fill missing values with the mean of the column
data['column_name'] = data['column_name'].fillna(data['column_name'].mean())
Duplicate rows can skew your analysis. You can easily remove them using the drop_duplicates
method.
# Remove duplicate rows
unique_data = data.drop_duplicates()
Filtering data allows you to select specific rows based on certain conditions.
# Filter rows where a column meets a certain condition
filtered_data = data[data['column_name'] > 10]
Pandas is designed to perform operations on entire arrays at once, which is known as vectorization. Vectorized operations are much faster than traditional Python loops.
# Vectorized addition
data['new_column'] = data['column1'] + data['column2']
When working with large datasets, memory usage can be a concern. You can optimize memory by using appropriate data types and downcasting numerical columns.
# Downcast a numerical column to a smaller data type
data['column_name'] = pd.to_numeric(data['column_name'], downcast='float')
Data wrangling is a crucial step in the data analysis pipeline, and Python Pandas provides a comprehensive set of tools to handle this task effectively. By understanding the fundamental concepts, usage methods, common practices, and best practices, you can efficiently clean, transform, and analyze your data. With Pandas, you can save time and effort in data preprocessing, allowing you to focus on extracting valuable insights from your data.