Data Analysis with Pandas
Master data manipulation and analysis using Python's Pandas library
Getting Started with Pandas
Learn how to efficiently manipulate, analyze, and visualize data using Pandas. From basic operations to advanced data analysis techniques.
Prerequisites
- Basic Python programming
- Understanding of data structures
- Basic statistics knowledge
- Jupyter Notebook environment
1. Data Loading and Inspection
Learn how to load and inspect different types of data.
import pandas as pd
import numpy as np
# Load different file types
df_csv = pd.read_csv('data.csv')
df_excel = pd.read_excel('data.xlsx')
df_json = pd.read_json('data.json')
# Basic inspection
print(df.head())
print(df.info())
print(df.describe())
# Check for missing values
print(df.isnull().sum())
2. Data Cleaning
Handle missing values, duplicates, and data type conversions.
# Handle missing values
df.fillna(method='ffill', inplace=True) # Forward fill
df.fillna(df.mean(), inplace=True) # Fill with mean
df.dropna(subset=['important_column'], inplace=True)
# Remove duplicates
df.drop_duplicates(inplace=True)
# Convert data types
df['date'] = pd.to_datetime(df['date'])
df['category'] = df['category'].astype('category')
df['amount'] = pd.to_numeric(df['amount'], errors='coerce')
3. Data Manipulation
Transform and reshape your data.
# Filtering and sorting
high_value = df[df['amount'] > 1000]
sorted_df = df.sort_values('date', ascending=False)
# Grouping and aggregation
grouped = df.groupby('category').agg({
'amount': ['sum', 'mean', 'count'],
'date': 'max'
}).reset_index()
# Pivot tables
pivot_table = pd.pivot_table(
df,
values='amount',
index='category',
columns='year',
aggfunc='sum',
fill_value=0
)
4. Data Analysis and Visualization
Analyze and visualize your data using Pandas with Matplotlib and Seaborn.
import matplotlib.pyplot as plt
import seaborn as sns
# Basic statistics
print(df['amount'].describe())
print(df['category'].value_counts())
print(df.corr())
# Visualizations
plt.figure(figsize=(12, 6))
# Time series plot
df.plot(x='date', y='amount', kind='line')
# Distribution plot
sns.histplot(data=df, x='amount', hue='category')
# Box plot
sns.boxplot(data=df, x='category', y='amount')
plt.show()