In R, the data.frame
is a fundamental data structure for storing tabular data. It is a two - dimensional data structure where each column can have a different data type.
# Create a data.frame in R
df_r <- data.frame(
name = c("Alice", "Bob", "Charlie"),
age = c(25, 30, 35),
salary = c(50000, 60000, 70000)
)
print(df_r)
In Python, Pandas provides the DataFrame
object, which serves a similar purpose.
import pandas as pd
# Create a DataFrame in Pandas
data = {
'name': ['Alice', 'Bob', 'Charlie'],
'age': [25, 30, 35],
'salary': [50000, 60000, 70000]
}
df_pd = pd.DataFrame(data)
print(df_pd)
In R, you can select rows and columns using square brackets. For example, to select the first row and the ’name’ column:
print(df_r[1, 'name'])
In Pandas, you have multiple ways to index and select data. The loc
and iloc
methods are commonly used. loc
is label - based indexing, while iloc
is integer - based indexing.
# Select the first row and the 'name' column using loc
print(df_pd.loc[0, 'name'])
# Select the first row and the first column using iloc
print(df_pd.iloc[0, 0])
In R, you can use functions like read.csv
and write.csv
to read and write CSV files.
# Write the data.frame to a CSV file
write.csv(df_r, "data_r.csv", row.names = FALSE)
# Read the CSV file back
df_r_new = read.csv("data_r.csv")
print(df_r_new)
In Pandas, you can use read_csv
and to_csv
methods.
# Write the DataFrame to a CSV file
df_pd.to_csv("data_pd.csv", index=False)
# Read the CSV file back
df_pd_new = pd.read_csv("data_pd.csv")
print(df_pd_new)
In R, you can use functions like mutate
from the dplyr
package to create new columns.
library(dplyr)
df_r_mutated <- df_r %>% mutate(bonus = salary * 0.1)
print(df_r_mutated)
In Pandas, you can simply assign a new column to the DataFrame.
df_pd['bonus'] = df_pd['salary'] * 0.1
print(df_pd)
In R, missing values are represented by NA
. You can use functions like is.na
to check for missing values and na.omit
to remove rows with missing values.
df_r_with_na <- data.frame(
value = c(1, NA, 3)
)
print(is.na(df_r_with_na))
df_r_cleaned <- na.omit(df_r_with_na)
print(df_r_cleaned)
In Pandas, missing values are represented by NaN
. You can use isna
to check for missing values and dropna
to remove rows with missing values.
df_pd_with_na = pd.DataFrame({'value': [1, float('nan'), 3]})
print(df_pd_with_na.isna())
df_pd_cleaned = df_pd_with_na.dropna()
print(df_pd_cleaned)
In R, you can use the group_by
and summarize
functions from the dplyr
package to perform grouping and aggregation.
df_r_grouped <- df_r %>%
group_by(age) %>%
summarize(avg_salary = mean(salary))
print(df_r_grouped)
In Pandas, you can use the groupby
method followed by an aggregation function.
df_pd_grouped = df_pd.groupby('age')['salary'].mean()
print(df_pd_grouped)
df
, use employee_data
if the DataFrame contains employee - related information.int8
instead of int64
.df_pd['age'] = df_pd['age'].astype('int8')
Transitioning from R to Python and using Pandas for data manipulation can be a smooth process if you understand the fundamental concepts, usage methods, common practices, and best practices. While there are differences between R and Python, there are also many similarities, especially when it comes to data analysis tasks. By following the guidelines in this blog, you can start using Pandas effectively and take advantage of the rich ecosystem of Python for data science.