A Series
in Pandas is a one - dimensional labeled array capable of holding any data type (integers, strings, floating - point numbers, Python objects, etc.). It can be thought of as a single column of a table.
import pandas as pd
# Create a Series from a list
data = [10, 20, 30, 40]
s = pd.Series(data)
print(s)
A DataFrame
is a two - dimensional labeled data structure with columns of potentially different types. It is similar to a spreadsheet or a SQL table.
# Create a DataFrame from a dictionary
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35]
}
df = pd.DataFrame(data)
print(df)
Pandas can load data from various sources such as CSV, Excel, SQL databases, etc.
# Load data from a CSV file
df = pd.read_csv('data.csv')
print(df.head())
You can select specific rows, columns, or cells from a DataFrame.
# Select a single column
ages = df['Age']
print(ages)
# Select rows based on a condition
young_people = df[df['Age'] < 30]
print(young_people)
Pandas allows you to perform various operations on data, such as adding new columns, modifying existing values, etc.
# Add a new column
df['IsAdult'] = df['Age'] >= 18
print(df)
Missing data is a common issue in real - world datasets. Pandas provides methods to handle missing values.
# Create a DataFrame with missing values
data = {
'Name': ['Alice', 'Bob', None],
'Age': [25, None, 35]
}
df = pd.DataFrame(data)
# Drop rows with missing values
df_clean = df.dropna()
print(df_clean)
# Fill missing values with a specific value
df_filled = df.fillna({'Name': 'Unknown', 'Age': 0})
print(df_filled)
Grouping data by one or more columns and performing aggregations is a powerful feature in Pandas.
# Group by a column and calculate the mean
grouped = df.groupby('Name')['Age'].mean()
print(grouped)
You can combine multiple DataFrames using different types of joins.
df1 = pd.DataFrame({'key': ['A', 'B', 'C'], 'value1': [1, 2, 3]})
df2 = pd.DataFrame({'key': ['B', 'C', 'D'], 'value2': [4, 5, 6]})
# Inner join
merged = pd.merge(df1, df2, on='key', how='inner')
print(merged)
When working with large datasets, memory management is crucial. You can optimize memory usage by choosing appropriate data types.
# Convert a column to a more memory - efficient data type
df['Age'] = df['Age'].astype('int8')
Use vectorized operations instead of loops whenever possible, as vectorized operations are much faster in Pandas.
# Vectorized operation
df['DoubleAge'] = df['Age'] * 2
Python Pandas is an indispensable toolkit for data scientists. Its rich set of data structures and functions make it easy to handle, analyze, and manipulate data. By understanding the fundamental concepts, usage methods, common practices, and best practices of Pandas, data scientists can significantly improve their productivity and the quality of their data analysis. Whether you are dealing with small or large datasets, Pandas provides the necessary tools to get the job done efficiently.