Pandas Data Manipulation Examples

In the realm of data analysis and manipulation in Python, pandas stands out as a powerful and widely - used library. It provides data structures and functions needed to efficiently work with structured data, making it an essential tool for data scientists, analysts, and researchers. This blog post will explore various data manipulation examples using pandas, helping intermediate - to - advanced Python developers gain a deeper understanding and effectively apply these techniques in real - world scenarios.

Table of Contents#

  1. Core Concepts
  2. Typical Usage Methods
  3. Common Practices
  4. Best Practices
  5. Code Examples
  6. Conclusion
  7. FAQ
  8. References

Core Concepts#

DataFrames and Series#

  • Series: A one - dimensional labeled array capable of holding any data type (integers, strings, floating - point numbers, Python objects, etc.). It is similar to a column in a spreadsheet.
  • DataFrame: A two - dimensional labeled data structure with columns of potentially different types. It can be thought of as a spreadsheet or a SQL table.

Indexing#

  • Label - based indexing: Uses row and column labels to access data. For example, using column names or row indices.
  • Position - based indexing: Uses integer positions to access data, similar to traditional Python list indexing.

Data Alignment#

pandas automatically aligns data based on the index when performing operations between different Series or DataFrames. This ensures that data is combined correctly.

Typical Usage Methods#

Reading Data#

  • pandas.read_csv(): Reads data from a CSV file into a DataFrame.
  • pandas.read_excel(): Reads data from an Excel file into a DataFrame.

Data Selection#

  • By label: df.loc[] is used for label - based indexing. For example, df.loc[row_label, column_label].
  • By position: df.iloc[] is used for position - based indexing. For example, df.iloc[row_index, column_index].

Data Modification#

  • Adding columns: You can add a new column to a DataFrame by simply assigning a Series or a scalar value to a new column name. For example, df['new_column'] = new_series.
  • Updating values: You can update values in a DataFrame using indexing and assignment. For example, df.loc[row_label, column_label] = new_value.

Common Practices#

Data Cleaning#

  • Handling missing values: Use methods like dropna() to remove rows or columns with missing values, or fillna() to fill missing values with a specified value.
  • Removing duplicates: Use drop_duplicates() to remove duplicate rows from a DataFrame.

Aggregation#

  • Group data using groupby() and then apply aggregation functions such as sum(), mean(), count() on the grouped data. For example, df.groupby('column_name').sum().

Sorting#

  • Use sort_values() to sort a DataFrame by one or more columns. For example, df.sort_values(by = 'column_name').

Best Practices#

Use Vectorized Operations#

pandas is optimized for vectorized operations, which are much faster than traditional Python loops. For example, instead of using a for loop to add two columns, you can simply do df['new_column'] = df['col1'] + df['col2'].

Chaining Operations#

Chain multiple operations together to make your code more concise and readable. For example, df.groupby('column_name').sum().sort_values(by = 'sum_column').

Memory Management#

When working with large datasets, use data types that consume less memory. For example, use astype() to convert columns to more memory - efficient data types.

Code Examples#

import pandas as pd
 
# Reading data from a CSV file
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 40],
    'Salary': [50000, 60000, 70000, 80000]
}
df = pd.DataFrame(data)
 
# Data Selection
# Select a single column
ages = df['Age']
print("Ages column:")
print(ages)
 
# Select a single row by label (using index)
first_row = df.loc[0]
print("\nFirst row:")
print(first_row)
 
# Select a subset of rows and columns
subset = df.loc[1:2, ['Name', 'Salary']]
print("\nSubset of rows and columns:")
print(subset)
 
# Data Modification
# Add a new column
df['Bonus'] = [5000, 6000, 7000, 8000]
print("\nDataFrame after adding a new column:")
print(df)
 
# Update a value
df.loc[2, 'Salary'] = 75000
print("\nDataFrame after updating a value:")
print(df)
 
# Data Cleaning
# Create a DataFrame with missing values
data_with_nan = {
    'Name': ['Alice', 'Bob', None, 'David'],
    'Age': [25, None, 35, 40],
    'Salary': [50000, 60000, 70000, None]
}
df_nan = pd.DataFrame(data_with_nan)
 
# Drop rows with missing values
df_cleaned = df_nan.dropna()
print("\nDataFrame after dropping rows with missing values:")
print(df_cleaned)
 
# Aggregation
# Group by a column and calculate the sum
grouped = df.groupby('Age').sum()
print("\nGrouped data by Age and summed other columns:")
print(grouped)
 
# Sorting
sorted_df = df.sort_values(by='Salary')
print("\nDataFrame sorted by Salary:")
print(sorted_df)

Conclusion#

pandas is a versatile and powerful library for data manipulation in Python. By understanding the core concepts, typical usage methods, common practices, and best practices, intermediate - to - advanced Python developers can effectively handle and analyze structured data. The code examples provided in this blog demonstrate how to perform various data manipulation tasks, from data selection and modification to cleaning, aggregation, and sorting.

FAQ#

Q1: What is the difference between loc and iloc?#

loc is used for label - based indexing, where you use row and column labels to access data. iloc is used for position - based indexing, where you use integer positions to access data.

Q2: How can I handle missing values in a more sophisticated way?#

Apart from dropping or filling with a single value, you can use interpolation methods like interpolate() to estimate missing values based on neighboring values.

Q3: Can I perform operations on multiple DataFrames at once?#

Yes, you can perform operations like merging, joining, and concatenating multiple DataFrames using methods such as merge(), join(), and concat().

References#