Pandas primarily provides two data structures: Series
and DataFrame
.
import pandas as pd
# Create a Series
s = pd.Series([1, 3, 5, np.nan, 6, 8])
print(s)
# Create a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35]}
df = pd.DataFrame(data)
print(df)
Indexing in Pandas allows you to access and modify specific elements or subsets of data. There are several ways to index, including label - based (loc
) and integer - based (iloc
).
# Using loc for label - based indexing
print(df.loc[0, 'Name'])
# Using iloc for integer - based indexing
print(df.iloc[0, 0])
Pandas can read data from various file formats such as CSV, Excel, SQL databases, etc., and also export data to these formats.
# Reading a CSV file
csv_df = pd.read_csv('data.csv')
# Writing to a CSV file
csv_df.to_csv('output.csv')
Data cleaning is an essential step in data science. Pandas provides methods to handle missing values, duplicate rows, and incorrect data types.
# Handling missing values
df_with_nan = pd.DataFrame({'A': [1, np.nan, 3], 'B': [4, 5, np.nan]})
df_filled = df_with_nan.fillna(0)
# Removing duplicate rows
duplicated_df = pd.DataFrame({'A': [1, 1, 2], 'B': [3, 3, 4]})
df_no_duplicates = duplicated_df.drop_duplicates()
Pandas offers a wide range of methods for data manipulation, such as filtering, sorting, and aggregating data.
# Filtering data
filtered_df = df[df['Age'] > 25]
# Sorting data
sorted_df = df.sort_values(by='Age')
# Aggregating data
grouped = df.groupby('Name').sum()
Chaining multiple Pandas operations together can make the code more concise and readable.
result = (pd.read_csv('data.csv')
.dropna()
.sort_values(by='column_name')
.groupby('another_column')
.sum())
Pandas is optimized for vectorized operations, which are much faster than traditional Python loops.
# Vectorized addition
df['New_Age'] = df['Age'] + 5
When dealing with large datasets, memory usage can be a bottleneck. You can optimize memory by using appropriate data types.
# Convert data types to save memory
df['Age'] = df['Age'].astype('int8')
Some Pandas operations create copies of the data, which can be memory - intensive. Use inplace = True
when possible.
# Drop columns in - place
df.drop('column_to_drop', axis = 1, inplace = True)
It’s important to test your Pandas code to ensure its correctness. You can use libraries like pytest
to write unit tests for your data manipulation functions.
import pytest
def test_dataframe_shape():
df = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
assert df.shape == (2, 2)
Pandas is a powerful tool for data science, but using it effectively requires following best practices. By understanding the fundamental concepts, mastering the usage methods, and adopting common and best practices, data scientists can handle data more efficiently, reduce memory usage, and write more reliable code. Whether you are a beginner or an experienced data scientist, these practices will help you get the most out of Pandas in your data science projects.