Pandas Data Manipulation Interview Questions

Pandas is a powerful open - source data analysis and manipulation library for Python. In technical interviews, especially those related to data science, data analysis, and machine learning, pandas data manipulation questions are quite common. This blog aims to cover the core concepts, typical usage methods, common practices, and best practices associated with pandas data manipulation interview questions. By the end of this article, intermediate - to - advanced Python developers will have a comprehensive understanding of these topics and be able to apply them in real - world scenarios.

Table of Contents

  1. Core Concepts
  2. Typical Usage Methods
  3. Common Practices
  4. Best Practices
  5. Code Examples
  6. Conclusion
  7. FAQ
  8. References

Core Concepts

Data Structures

  • Series: A one - dimensional labeled array capable of holding any data type. It is similar to a column in a spreadsheet or a single variable in a statistical analysis.
  • DataFrame: A two - dimensional labeled data structure with columns of potentially different types. It is similar to a spreadsheet or a SQL table.

Indexing and Selection

  • Label - based Indexing: Using labels (column names or row indices) to select data. For example, using df.loc[] to select rows and columns by label.
  • Position - based Indexing: Using integer positions to select data. For example, using df.iloc[] to select rows and columns by integer position.

Data Cleaning

  • Handling Missing Values: Removing or filling missing values using methods like dropna() and fillna().
  • Duplicate Removal: Identifying and removing duplicate rows using drop_duplicates().

Typical Usage Methods

Reading and Writing Data

  • Reading: Use functions like read_csv(), read_excel(), and read_sql() to read data from various sources into a DataFrame.
  • Writing: Use methods like to_csv(), to_excel(), and to_sql() to write a DataFrame to different file formats or databases.

Data Selection and Filtering

  • Single Column Selection: Use the column name in square brackets, e.g., df['column_name'] to select a single column.
  • Multiple Column Selection: Pass a list of column names, e.g., df[['col1', 'col2']].
  • Row Filtering: Use boolean indexing, e.g., df[df['column'] > value] to filter rows based on a condition.

Aggregation and Grouping

  • Grouping: Use the groupby() method to group data based on one or more columns.
  • Aggregation: Apply aggregation functions like sum(), mean(), count() to the grouped data.

Common Practices

Data Preprocessing

  • Scaling: Normalize numerical data using methods like min - max scaling or standardization.
  • Encoding Categorical Variables: Convert categorical variables into numerical values using techniques like one - hot encoding.

Feature Engineering

  • Creating New Features: Combine existing columns to create new features. For example, calculate the ratio of two columns.
  • Binning: Divide numerical data into bins or intervals.

Best Practices

Code Readability

  • Use Descriptive Variable Names: Name DataFrames, Series, and other variables in a way that clearly indicates their purpose.
  • Add Comments: Explain complex operations or the purpose of a block of code.

Performance Optimization

  • Avoid Unnecessary Copies: Use in - place operations when possible to save memory.
  • Use Vectorized Operations: Pandas is optimized for vectorized operations, so avoid using loops for element - wise operations.

Code Examples

import pandas as pd
import numpy as np

# Create a sample DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 40],
    'Salary': [50000, 60000, 70000, 80000]
}
df = pd.DataFrame(data)

# 1. Select a single column
age_column = df['Age']
print("Single column selection:")
print(age_column)

# 2. Select multiple columns
name_salary = df[['Name', 'Salary']]
print("\nMultiple column selection:")
print(name_salary)

# 3. Filter rows based on a condition
filtered_df = df[df['Age'] > 30]
print("\nRow filtering:")
print(filtered_df)

# 4. Group by a column and calculate the mean
grouped = df.groupby('Age').mean()
print("\nGrouping and aggregation:")
print(grouped)

# 5. Handling missing values
df_with_nan = pd.DataFrame({'col1': [1, np.nan, 3], 'col2': [4, 5, np.nan]})
filled_df = df_with_nan.fillna(0)
print("\nHandling missing values:")
print(filled_df)

# 6. Removing duplicates
df_with_duplicates = pd.DataFrame({'col': [1, 1, 2, 2]})
unique_df = df_with_duplicates.drop_duplicates()
print("\nRemoving duplicates:")
print(unique_df)

Conclusion

Pandas data manipulation is a crucial skill for data - related roles. By understanding the core concepts, typical usage methods, common practices, and best practices, developers can effectively answer pandas data manipulation interview questions and apply these skills in real - world data analysis and machine learning projects.

FAQ

Q1: What is the difference between loc and iloc?

A1: loc is label - based indexing, which means you use row and column labels to select data. iloc is position - based indexing, where you use integer positions to select data.

Q2: How can I handle missing values in a DataFrame?

A2: You can use methods like dropna() to remove rows or columns with missing values, or fillna() to fill the missing values with a specific value (e.g., mean, median, or a constant).

Q3: What is the advantage of using vectorized operations in pandas?

A3: Vectorized operations are faster and more memory - efficient than using loops for element - wise operations because they are implemented in optimized C code under the hood.

References