A Pandas DataFrame is a two - dimensional labeled data structure with columns of potentially different types. It can be thought of as a spreadsheet or a SQL table. DataFrames have both row and column labels, and they support a wide range of data types, including numerical, categorical, and datetime data.
Here is a simple example of creating a Pandas DataFrame:
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
print(df)
SQL Tables are used to store data in a relational database. A table consists of rows (also called records) and columns (also called fields). Each column has a specific data type, such as integer, varchar, or date. Tables are related to each other through keys, which allow for complex queries and data manipulation.
Here is an example of creating a simple SQL table:
CREATE TABLE employees (
id INT PRIMARY KEY,
name VARCHAR(50),
age INT,
city VARCHAR(50)
);
We can create a DataFrame from various data sources, such as dictionaries, lists, or CSV files.
import pandas as pd
# From a dictionary
data = {
'Product': ['Apple', 'Banana', 'Cherry'],
'Price': [1.5, 0.5, 2.0]
}
df = pd.DataFrame(data)
# From a list of lists
data_list = [['Apple', 1.5], ['Banana', 0.5], ['Cherry', 2.0]]
columns = ['Product', 'Price']
df_list = pd.DataFrame(data_list, columns=columns)
We use the CREATE TABLE
statement to create a table and the INSERT INTO
statement to add data to the table.
-- Create a table
CREATE TABLE products (
id INT PRIMARY KEY,
product_name VARCHAR(50),
price DECIMAL(5, 2)
);
-- Insert data
INSERT INTO products (id, product_name, price)
VALUES (1, 'Apple', 1.50), (2, 'Banana', 0.50), (3, 'Cherry', 2.00);
Pandas provides functions to read data from various file formats, such as CSV, Excel, and JSON.
import pandas as pd
# Read a CSV file
df_csv = pd.read_csv('data.csv')
We use the SELECT
statement to read data from a SQL table.
SELECT * FROM products;
We can select columns and rows using different indexing methods.
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35]
}
df = pd.DataFrame(data)
# Select a single column
ages = df['Age']
# Select multiple columns
name_age = df[['Name', 'Age']]
# Select rows by index
first_row = df.loc[0]
We use the SELECT
statement to select columns from a table.
-- Select a single column
SELECT age FROM employees;
-- Select multiple columns
SELECT name, age FROM employees;
-- Select a specific row
SELECT * FROM employees WHERE id = 1;
We can filter rows based on conditions.
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35]
}
df = pd.DataFrame(data)
# Filter rows where age is greater than 25
filtered_df = df[df['Age'] > 25]
print(filtered_df)
We use the WHERE
clause to filter rows in a SQL query.
SELECT * FROM employees WHERE age > 25;
Pandas provides functions like sum()
, mean()
, and count()
for aggregating data.
import pandas as pd
data = {
'Department': ['Sales', 'Sales', 'Marketing'],
'Revenue': [1000, 2000, 1500]
}
df = pd.DataFrame(data)
# Calculate the total revenue per department
total_revenue = df.groupby('Department')['Revenue'].sum()
print(total_revenue)
We use the GROUP BY
clause along with aggregate functions like SUM
, AVG
, and COUNT
in SQL.
SELECT department, SUM(revenue)
FROM sales_data
GROUP BY department;
Pandas provides many functions for data cleaning, such as dropna()
to remove missing values and fillna()
to fill missing values.
import pandas as pd
import numpy as np
data = {
'Name': ['Alice', np.nan, 'Charlie'],
'Age': [25, 30, np.nan]
}
df = pd.DataFrame(data)
# Drop rows with missing values
clean_df = df.dropna()
# Fill missing values with a specific value
filled_df = df.fillna(0)
In SQL, we can use functions like IS NULL
and UPDATE
to handle missing values.
-- Delete rows with missing values
DELETE FROM employees WHERE age IS NULL;
-- Update missing values
UPDATE employees
SET age = 0
WHERE age IS NULL;
We can join two or more DataFrames using functions like merge()
, join()
, or concat()
.
import pandas as pd
df1 = pd.DataFrame({
'id': [1, 2, 3],
'name': ['Alice', 'Bob', 'Charlie']
})
df2 = pd.DataFrame({
'id': [2, 3, 4],
'city': ['Los Angeles', 'Chicago', 'New York']
})
merged_df = pd.merge(df1, df2, on='id', how='inner')
print(merged_df)
In SQL, we use the JOIN
keyword to combine rows from two or more tables based on a related column.
SELECT employees.name, employees.city
FROM employees
JOIN departments ON employees.department_id = departments.id;
import pandas as pd
# Read a large CSV file in chunks
chunk_size = 1000
for chunk in pd.read_csv('large_file.csv', chunksize=chunk_size):
# Process each chunk
print(chunk.head())
WHERE
, JOIN
, or ORDER BY
clauses.CREATE INDEX idx_age ON employees (age);
Pandas DataFrames and SQL Tables are both powerful tools for data manipulation, but they have different strengths and use cases. Pandas is great for in - memory data analysis in Python, offering flexibility and ease of use. SQL Tables, on the other hand, are designed for large - scale data storage and efficient querying in a relational database. By understanding their differences and similarities, data analysts and scientists can choose the right tool for the job and make the most of their data.