A Pandas DataFrame is a two - dimensional labeled data structure with columns of potentially different types. It is similar to a spreadsheet or a SQL table, where data is organized in rows and columns.
Regular expressions are a sequence of characters that form a search pattern. They can be used to match, search, and manipulate text. In Python, the re
module provides support for regular expressions. Pandas leverages this functionality to perform pattern - based filtering on DataFrame columns.
Pandas provides several methods to filter DataFrames using regular expressions. The most commonly used methods are str.contains()
, str.match()
, and str.fullmatch()
.
str.contains()
: Checks if the string in each row of a Series contains the specified pattern.str.match()
: Checks if the string in each row of a Series starts with the specified pattern.str.fullmatch()
: Checks if the string in each row of a Series exactly matches the specified pattern.str.contains()
import pandas as pd
# Create a sample DataFrame
data = {
'Name': ['John Doe', 'Jane Smith', 'Bob Johnson'],
'Email': ['[email protected]', '[email protected]', '[email protected]']
}
df = pd.DataFrame(data)
# Filter rows where the Email column contains '.org'
filtered_df = df[df['Email'].str.contains('.org')]
print(filtered_df)
In this example, the str.contains()
method is used to filter rows where the Email
column contains the string .org
.
str.match()
# Filter rows where the Name column starts with 'Jane'
filtered_df = df[df['Name'].str.match('Jane')]
print(filtered_df)
Here, the str.match()
method is used to filter rows where the Name
column starts with the string Jane
.
str.fullmatch()
# Filter rows where the Name column exactly matches 'John Doe'
filtered_df = df[df['Name'].str.fullmatch('John Doe')]
print(filtered_df)
The str.fullmatch()
method is used to filter rows where the Name
column exactly matches the string John Doe
.
By default, the regex matching in Pandas is case - sensitive. To perform case - insensitive matching, you can set the case
parameter to False
in the str.contains()
, str.match()
, or str.fullmatch()
methods.
# Case - insensitive matching
filtered_df = df[df['Name'].str.contains('john', case=False)]
print(filtered_df)
When working with real - world data, there may be missing values in the columns. By default, str.contains()
, str.match()
, and str.fullmatch()
return NaN
for missing values. You can use the na
parameter to handle these missing values.
# Create a DataFrame with missing values
data_with_nan = {
'Name': ['John Doe', None, 'Bob Johnson'],
'Email': ['[email protected]', None, '[email protected]']
}
df_with_nan = pd.DataFrame(data_with_nan)
# Filter rows where the Email column contains '.net' and handle NaN values
filtered_df = df_with_nan[df_with_nan['Email'].str.contains('.net', na=False)]
print(filtered_df)
If you need to use the same regular expression multiple times, it is a good practice to compile it using the re.compile()
function. This can improve the performance, especially when working with large datasets.
import re
# Compile a regular expression
pattern = re.compile('.com')
filtered_df = df[df['Email'].str.contains(pattern)]
print(filtered_df)
When using regular expressions, you can use anchors like ^
(start of the string) and $
(end of the string) to perform more precise matching. For example, to match emails that end with .com
, you can use the pattern \.com$
.
# Filter rows where the Email column ends with '.com'
filtered_df = df[df['Email'].str.contains('\.com$')]
print(filtered_df)
import pandas as pd
# Create a sample DataFrame
data = {
'Phone': ['123 - 456 - 7890', 'abc - def - ghi', '234 - 567 - 8901']
}
df = pd.DataFrame(data)
# Filter rows where the Phone column matches a valid phone number pattern
pattern = r'^\d{3}-\d{3}-\d{4}$'
filtered_df = df[df['Phone'].str.match(pattern)]
print(filtered_df)
import pandas as pd
# Create a sample DataFrame
data = {
'Name': ['John Doe', 'Jane Smith', 'Bob Johnson'],
'Email': ['[email protected]', '[email protected]', '[email protected]']
}
df = pd.DataFrame(data)
# Filter rows where the Name column starts with 'J' and the Email column ends with '.com'
name_pattern = r'^J'
email_pattern = r'\.com$'
filtered_df = df[df['Name'].str.match(name_pattern) & df['Email'].str.contains(email_pattern)]
print(filtered_df)
Filtering a Pandas DataFrame using regular expressions is a powerful technique that allows you to perform complex pattern - based filtering on your data. By understanding the core concepts, typical usage methods, common practices, and best practices, you can effectively use regex to manipulate and analyze your data. Whether you are working with text data, numerical data represented as strings, or any other structured data, regex filtering in Pandas can help you extract the information you need.
Yes, you can use logical operators like &
(and) and |
(or) to combine regex - based filtering conditions on multiple columns. See the code example above for filtering with multiple regex conditions.
You can use the re.escape()
function to escape special characters in your pattern. For example:
import re
import pandas as pd
data = {
'Text': ['Hello! World', 'Goodbye? World']
}
df = pd.DataFrame(data)
special_char = '!'
escaped_char = re.escape(special_char)
pattern = fr'.*{escaped_char}.*'
filtered_df = df[df['Text'].str.contains(pattern)]
print(filtered_df)
Yes, if the numerical data is stored as strings in the DataFrame. However, if the data is in numerical format, you may need to convert it to strings first using the astype(str)
method.
re
module documentation:
https://docs.python.org/3/library/re.html