A Pandas Series
is a one - dimensional labeled array capable of holding any data type (integers, strings, floating - point numbers, Python objects, etc.). It can be thought of as a single column in a table. You can create a Series
from a list, dictionary, or NumPy array.
import pandas as pd
# Create a Series from a list
data = [10, 20, 30, 40]
s = pd.Series(data)
print(s)
A DataFrame
is a two - dimensional labeled data structure with columns of potentially different types. It is similar to a spreadsheet or a SQL table. You can create a DataFrame
from a dictionary of Series
, lists, or by reading data from external sources like CSV files.
# Create a DataFrame from a dictionary
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35]}
df = pd.DataFrame(data)
print(df)
To start using Pandas in a Jupyter Notebook, you first need to install it if it’s not already installed. You can use pip
or conda
for installation.
# Install Pandas using pip
!pip install pandas
# Import Pandas in Jupyter Notebook
import pandas as pd
Pandas provides various functions to import and export data in different formats. The most common format is CSV (Comma - Separated Values).
# Import data from a CSV file
df = pd.read_csv('data.csv')
# Export data to a CSV file
df.to_csv('new_data.csv', index=False)
You can select specific columns or rows from a DataFrame
using different methods.
# Select a single column
ages = df['Age']
# Select rows based on a condition
adults = df[df['Age'] >= 18]
print(adults)
Pandas allows you to perform various operations on data, such as adding columns, calculating statistics, and sorting.
# Add a new column
df['IsAdult'] = df['Age'] >= 18
# Calculate the mean age
mean_age = df['Age'].mean()
print(mean_age)
# Sort the DataFrame by age
sorted_df = df.sort_values(by='Age')
print(sorted_df)
Exploratory Data Analysis (EDA) is an important step in understanding your data. Pandas provides functions to get basic information about the data, such as the number of rows and columns, data types, and summary statistics.
# Get basic information about the DataFrame
print(df.info())
# Get summary statistics
print(df.describe())
Real - world data often contains missing values, duplicates, or incorrect data. Pandas can help you clean your data.
# Check for missing values
print(df.isnull().sum())
# Drop rows with missing values
df = df.dropna()
# Remove duplicate rows
df = df.drop_duplicates()
When working with Pandas in a Jupyter Notebook, it’s important to organize your code into logical sections. Use markdown cells to explain your thought process and separate different parts of your analysis.
For large datasets, performance can become an issue. You can use techniques like using the appropriate data types, avoiding unnecessary copies, and using vectorized operations.
# Use the appropriate data type
df['Age'] = df['Age'].astype('int8')
Pandas in combination with Jupyter Notebook offers a powerful and flexible environment for data analysis. By understanding the fundamental concepts, usage methods, common practices, and best practices, you can enhance your workflow and perform efficient data analysis. Whether you are a beginner or an experienced data analyst, these tools can help you gain insights from your data more effectively.