Data Analysis with Pandas

Master data manipulation and analysis using Python's Pandas library

Getting Started with Pandas

Learn how to efficiently manipulate, analyze, and visualize data using Pandas. From basic operations to advanced data analysis techniques.

Prerequisites

  • Basic Python programming
  • Understanding of data structures
  • Basic statistics knowledge
  • Jupyter Notebook environment

1. Data Loading and Inspection

Learn how to load and inspect different types of data.

import pandas as pd
import numpy as np

# Load different file types
df_csv = pd.read_csv('data.csv')
df_excel = pd.read_excel('data.xlsx')
df_json = pd.read_json('data.json')

# Basic inspection
print(df.head())
print(df.info())
print(df.describe())

# Check for missing values
print(df.isnull().sum())

2. Data Cleaning

Handle missing values, duplicates, and data type conversions.

# Handle missing values
df.fillna(method='ffill', inplace=True)  # Forward fill
df.fillna(df.mean(), inplace=True)       # Fill with mean
df.dropna(subset=['important_column'], inplace=True)

# Remove duplicates
df.drop_duplicates(inplace=True)

# Convert data types
df['date'] = pd.to_datetime(df['date'])
df['category'] = df['category'].astype('category')
df['amount'] = pd.to_numeric(df['amount'], errors='coerce')

3. Data Manipulation

Transform and reshape your data.

# Filtering and sorting
high_value = df[df['amount'] > 1000]
sorted_df = df.sort_values('date', ascending=False)

# Grouping and aggregation
grouped = df.groupby('category').agg({
    'amount': ['sum', 'mean', 'count'],
    'date': 'max'
}).reset_index()

# Pivot tables
pivot_table = pd.pivot_table(
    df,
    values='amount',
    index='category',
    columns='year',
    aggfunc='sum',
    fill_value=0
)

4. Data Analysis and Visualization

Analyze and visualize your data using Pandas with Matplotlib and Seaborn.

import matplotlib.pyplot as plt
import seaborn as sns

# Basic statistics
print(df['amount'].describe())
print(df['category'].value_counts())
print(df.corr())

# Visualizations
plt.figure(figsize=(12, 6))

# Time series plot
df.plot(x='date', y='amount', kind='line')

# Distribution plot
sns.histplot(data=df, x='amount', hue='category')

# Box plot
sns.boxplot(data=df, x='category', y='amount')

plt.show()