Pandas automatically aligns data based on labels when performing operations between Series or DataFrames. This ensures that data is combined correctly even if the indices or columns are not in the same order.
read_csv
), Excel files (read_excel
), and SQL databases (read_sql
).to_csv
), Excel (to_excel
), and SQL databases (to_sql
).df['column_name']
or multiple columns using df[['col1', 'col2']]
.loc
for label - based indexing (df.loc[row_label]
) and iloc
for position - based indexing (df.iloc[row_index]
).df[df['column'] > value]
selects rows where the value in the specified column is greater than a given value.groupby
and then apply aggregation functions like sum
, mean
, count
on the grouped data. For example, df.groupby('column').sum()
.isnull()
or isna()
to detect missing values in a DataFrame or Series.dropna()
to remove rows or columns containing missing values.fillna()
to fill missing values with a specified value or a method like forward - fill (ffill
) or backward - fill (bfill
).apply()
to apply a function to each element in a Series or each row/column in a DataFrame.int8
or float32
instead of int64
or float64
if the data range allows.pd.to_numeric()
with the downcast
parameter.query()
for complex filtering operations as it can be more efficient than boolean indexing in some cases.import pandas as pd
import numpy as np
# Create a sample DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, np.nan, 35],
'Score': [85, 90, 78, 92]
}
df = pd.DataFrame(data)
# Reading and Writing Data
# Write to CSV
df.to_csv('sample.csv', index=False)
# Read from CSV
new_df = pd.read_csv('sample.csv')
# Data Selection
# Select a single column
ages = df['Age']
# Select multiple columns
name_score = df[['Name', 'Score']]
# Select a row by label (assuming the default index)
first_row = df.loc[0]
# Select a row by position
second_row = df.iloc[1]
# Data Filtering
# Filter rows where Score > 80
high_score = df[df['Score'] > 80]
# Data Aggregation
# Group by Age and calculate the mean score
age_grouped = df.groupby('Age')['Score'].mean()
# Handling Missing Data
# Detect missing values
missing = df.isnull()
# Remove rows with missing values
df_clean = df.dropna()
# Fill missing values with the mean age
mean_age = df['Age'].mean()
df_filled = df.fillna({'Age': mean_age})
# Data Transformation
# Apply a function to a column
def add_five(x):
return x + 5
df['Score_plus_five'] = df['Score'].apply(add_five)
# Memory Optimization
# Downcast the Score column to int8
df['Score'] = pd.to_numeric(df['Score'], downcast='integer')
A Pandas data manipulation cheat sheet is a handy tool for Python developers working with structured data. By understanding the core concepts, typical usage methods, common practices, and best practices, developers can efficiently perform data manipulation tasks. With the provided code examples, developers can quickly implement these concepts in real - world scenarios.
A: Yes, but you may need to consider memory optimization techniques like using appropriate data types and downcasting. You can also use techniques like chunking when reading large files.
A: You can use the Categorical
data type in Pandas. Convert a column to categorical using pd.Categorical()
. This can save memory and improve performance when working with categorical variables.
A: Yes, you can perform operations like merging, joining, and concatenating multiple DataFrames using functions like merge()
, join()
, and concat()
.