import pandas as pd
s = pd.Series([1, 3, 5, np.nan, 6, 8])
print(s)
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35]}
df = pd.DataFrame(data)
print(df)
A data pipeline typically consists of three main stages: extraction, transformation, and loading (ETL). In the context of Pandas:
Pandas can read data from multiple file formats. For example, to read a CSV file:
import pandas as pd
df = pd.read_csv('data.csv')
print(df.head())
To read an Excel file:
df = pd.read_excel('data.xlsx')
print(df.head())
Data cleaning involves removing or correcting incorrect, corrupted, or incomplete data. For example, to remove rows with missing values:
cleaned_df = df.dropna()
print(cleaned_df)
We can perform various transformations on the data. For example, adding a new column based on existing columns:
df['NewColumn'] = df['Column1'] + df['Column2']
print(df)
We can write the transformed data to a file. For example, writing to a CSV file:
df.to_csv('transformed_data.csv', index=False)
We can fill missing values with a specific value, such as the mean of the column:
mean_value = df['Column'].mean()
df['Column'] = df['Column'].fillna(mean_value)
print(df)
To remove duplicate rows:
df = df.drop_duplicates()
print(df)
We can combine two or more DataFrames. For example, an inner join:
df1 = pd.DataFrame({'key': ['A', 'B', 'C'], 'value1': [1, 2, 3]})
df2 = pd.DataFrame({'key': ['B', 'C', 'D'], 'value2': [4, 5, 6]})
merged_df = pd.merge(df1, df2, on='key', how='inner')
print(merged_df)
# Slow way
for index, row in df.iterrows():
df.at[index, 'NewColumn'] = row['Column1'] + row['Column2']
# Fast way
df['NewColumn'] = df['Column1'] + df['Column2']
When reading data, errors can occur. We can use try - except blocks to handle these errors gracefully.
try:
df = pd.read_csv('data.csv')
except FileNotFoundError:
print("The file was not found.")
Add comments to your code to explain what each section does. This makes the code more understandable for other developers and for future reference.
# Read the CSV file
df = pd.read_csv('data.csv')
Pandas is an essential library for building resilient data pipelines in Python. By understanding the fundamental concepts, usage methods, common practices, and best practices, you can create efficient and robust data pipelines. Whether you are working on small - scale data analysis projects or large - scale data processing, Pandas provides the tools you need to handle data effectively.