A Pandas DataFrame
is a two - dimensional labeled data structure with columns of potentially different types. Each row in a DataFrame
represents an observation or a record. When you load a CSV file using pandas.read_csv()
, the data is stored in a DataFrame
, and rows can be accessed and manipulated using various indexing and slicing techniques.
Pandas provides two main ways to access rows:
loc
): Allows you to access rows by their index labels.iloc
): Allows you to access rows by their integer positions.import pandas as pd
# Load a CSV file into a DataFrame
file_path = 'example.csv'
df = pd.read_csv(file_path)
loc
# Access a single row by its index label
single_row = df.loc['row_label']
# Access multiple rows by their index labels
multiple_rows = df.loc[['label1', 'label2']]
# Access a range of rows by their index labels
range_rows = df.loc['start_label':'end_label']
iloc
# Access a single row by its integer position
single_row_iloc = df.iloc[0]
# Access multiple rows by their integer positions
multiple_rows_iloc = df.iloc[[0, 1, 2]]
# Access a range of rows by their integer positions
range_rows_iloc = df.iloc[0:3]
# Iterate over rows using iterrows()
for index, row in df.iterrows():
print(f"Index: {index}, Row: {row}")
# Filter rows where a column value meets a certain condition
filtered_df = df[df['column_name'] > 10]
# Create a new row as a dictionary
new_row = {'col1': 1, 'col2': 2, 'col3': 3}
# Append the new row to the DataFrame
df = df.append(new_row, ignore_index=True)
# Delete rows by index labels
df = df.drop(['label1', 'label2'])
# Delete rows by integer positions
df = df.drop(df.index[[0, 1]])
When working with large CSV files, consider loading only the necessary columns using the usecols
parameter in read_csv()
.
df = pd.read_csv(file_path, usecols=['col1', 'col2'])
Avoid using iterrows()
for large datasets as it can be slow. Instead, use vectorized operations whenever possible. For example, to perform an operation on a column for all rows:
df['new_col'] = df['old_col'] * 2
When accessing rows, always check if the index or position exists to avoid KeyError
or IndexError
.
if 'label' in df.index:
row = df.loc['label']
Working with rows in a Pandas DataFrame loaded from a CSV file is a fundamental skill in data analysis. By understanding core concepts like indexing, and mastering typical usage methods, common practices, and best practices, you can efficiently manipulate and analyze your data. Whether it’s filtering, adding, or deleting rows, Pandas provides a rich set of tools to handle these tasks.
loc
and iloc
?loc
is used for label - based indexing, meaning you access rows using their index labels. iloc
is used for integer - based indexing, where you access rows using their integer positions.
iterrows()
slow for large datasets?iterrows()
is slow because it returns a copy of each row, and Python has to perform a lot of overhead for each iteration. Vectorized operations are faster as they are implemented in optimized C code.
You can use methods like dropna()
to remove rows with missing values or fillna()
to fill the missing values with a specified value.