DataFrame
, which is a two - dimensional labeled data structure with columns of potentially different types. It is similar to a spreadsheet or a SQL table. Working on Pandas DataFrame exercises is an excellent way for intermediate - to - advanced Python developers to enhance their skills in data handling, cleaning, and analysis. These exercises not only help in mastering the syntax and functionality of Pandas but also prepare developers for real - world data - centric projects.A DataFrame
consists of rows and columns. Each column can be thought of as a Series
object, which is a one - dimensional labeled array. The rows and columns are labeled, and these labels can be used to access and manipulate the data.
iloc
) or label - based indexing (loc
).A DataFrame
can hold different data types in each column, such as integers, floating - point numbers, strings, and boolean values.
import pandas as pd
# Create a DataFrame from a dictionary
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
print("Created DataFrame:")
print(df)
In this code, we first import the pandas
library. Then, we create a dictionary where the keys represent column names and the values are lists of data for each column. Finally, we pass this dictionary to the pd.DataFrame()
constructor to create a DataFrame
.
# Read a CSV file into a DataFrame
csv_df = pd.read_csv('data.csv')
# Write a DataFrame to a CSV file
df.to_csv('output.csv', index=False)
The read_csv()
function is used to read data from a CSV file into a DataFrame
. The to_csv()
method is used to write the contents of a DataFrame
to a CSV file. The index=False
parameter is used to prevent writing the row index to the file.
# Select a single column
ages = df['Age']
print("\nSelected 'Age' column:")
print(ages)
# Filter rows based on a condition
filtered_df = df[df['Age'] > 30]
print("\nFiltered DataFrame (Age > 30):")
print(filtered_df)
To select a single column, we use the column name as a key. To filter rows based on a condition, we use a boolean expression inside the square brackets.
import numpy as np
# Create a DataFrame with missing values
data_with_nan = {
'Name': ['Alice', 'Bob', np.nan, 'David'],
'Age': [25, np.nan, 35, 40],
'City': ['New York', 'Los Angeles', 'Chicago', np.nan]
}
nan_df = pd.DataFrame(data_with_nan)
# Drop rows with missing values
cleaned_df = nan_df.dropna()
print("\nCleaned DataFrame (dropped rows with missing values):")
print(cleaned_df)
Here, we create a DataFrame
with missing values represented by np.nan
. The dropna()
method is used to remove rows that contain any missing values.
# Create a DataFrame for aggregation
sales_data = {
'Product': ['A', 'B', 'A', 'B'],
'Sales': [100, 200, 150, 250]
}
sales_df = pd.DataFrame(sales_data)
# Calculate the total sales per product
total_sales = sales_df.groupby('Product')['Sales'].sum()
print("\nTotal sales per product:")
print(total_sales)
The groupby()
method is used to group the data by the ‘Product’ column. Then, we select the ‘Sales’ column and apply the sum()
function to calculate the total sales for each product.
int8
or int16
data type instead of the default int64
to save memory.DataFrames
and columns.Pandas DataFrame exercises are a great way to improve your data analysis skills in Python. By understanding core concepts, typical usage methods, and common practices, you can effectively handle, clean, and analyze data using DataFrames
. Following best practices will ensure that your code is efficient and maintainable.
Yes, you can perform arithmetic operations on columns in a DataFrame
. For example, if you have two columns ‘A’ and ‘B’, you can create a new column ‘C’ as df['C'] = df['A'] + df['B']
.
You can fill missing values with a specific value using the fillna()
method. For example, df.fillna(0)
will fill all missing values with 0.
Yes, you can use vectorized operations to perform operations on multiple columns simultaneously. For example, if you want to add two columns ‘A’ and ‘B’ and store the result in a new column ‘C’, you can use df['C'] = df['A'] + df['B']
.