DataFrame
, which can be thought of as a two - dimensional labeled data structure with columns of potentially different types. This blog will guide you through the process of efficiently manipulating DataFrames in Pandas, covering fundamental concepts, usage methods, common practices, and best practices.A Pandas DataFrame
is a 2D table-like structure with rows and columns. Each column can have a different data type (e.g., integers, strings, floats). It can be considered as a collection of Series
objects, where each Series
represents a column. Rows and columns are labeled, which makes it easy to access and manipulate data.
We can create a DataFrame in several ways. Here are some common methods:
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
print(df)
data = [
['Alice', 25, 'New York'],
['Bob', 30, 'Los Angeles'],
['Charlie', 35, 'Chicago']
]
columns = ['Name', 'Age', 'City']
df = pd.DataFrame(data, columns=columns)
print(df)
# Select a single column
ages = df['Age']
print(ages)
# Select multiple columns
name_age = df[['Name', 'Age']]
print(name_age)
# Select a single row by index
first_row = df.loc[0]
print(first_row)
# Select rows based on a condition
adults = df[df['Age'] >= 30]
print(adults)
# Increase all ages by 1
df['Age'] = df['Age'] + 1
print(df)
df['Country'] = ['USA', 'USA', 'USA']
print(df)
df = df.drop('Country', axis=1)
print(df)
# Group by a column and calculate the mean of another column
grouped = df.groupby('City')['Age'].mean()
print(grouped)
# Create another DataFrame
data2 = {
'City': ['New York', 'Los Angeles', 'Chicago'],
'Population': [8500000, 4000000, 2700000]
}
df2 = pd.DataFrame(data2)
# Merge the two DataFrames on the 'City' column
merged_df = pd.merge(df, df2, on='City')
print(merged_df)
df['column'] + constant
instead of a for
loop.df.drop('column', axis = 1, inplace=True)
instead of re - assigning the result to a new variable.np.int8
or np.int16
instead of np.int64
to save memory.Manipulating DataFrames in Pandas is a crucial skill for data analysts and scientists. By understanding the fundamental concepts, usage methods, and best practices, you can efficiently perform various data manipulation tasks such as data selection, filtering, modification, aggregation, and merging. With practice, you will be able to handle large and complex datasets with ease.