Machine Learning with scikit-learn

Build and evaluate machine learning models using Python's scikit-learn library

Getting Started with scikit-learn

Learn how to implement various machine learning algorithms using scikit-learn, from data preprocessing to model evaluation and hyperparameter tuning.

Prerequisites

Basic Python programming knowledge
Understanding of NumPy and Pandas
Basic statistics and mathematics
Jupyter Notebook environment

1. Data Preprocessing

Learn essential preprocessing techniques for preparing your data.

from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split

# Load and preprocess data
X = df.drop('target', axis=1)
y = df['target']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

2. Model Training

Implement and train various machine learning models.

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

# Train multiple models
models = {
    'logistic': LogisticRegression(),
    'random_forest': RandomForestClassifier(),
    'svm': SVC()
}

# Train each model
for name, model in models.items():
    model.fit(X_train_scaled, y_train)
    print(f"{name} score: {model.score(X_test_scaled, y_test):.4f}")

3. Model Evaluation

Evaluate model performance using various metrics.

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Make predictions
y_pred = model.predict(X_test_scaled)

# Print classification report
print(classification_report(y_test, y_pred))

# Plot confusion matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d')
plt.show()

4. Hyperparameter Tuning

Optimize model performance through hyperparameter tuning.

from sklearn.model_selection import GridSearchCV

# Define parameter grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 20, 30, None],
    'min_samples_split': [2, 5, 10]
}

# Perform grid search
grid_search = GridSearchCV(
    RandomForestClassifier(),
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

grid_search.fit(X_train_scaled, y_train)
print("Best parameters:", grid_search.best_params_)
print("Best score:", grid_search.best_score_)

Machine Learning with scikit-learn

Getting Started with scikit-learn

Prerequisites

1. Data Preprocessing

2. Model Training

3. Model Evaluation

4. Hyperparameter Tuning

Practice Projects

Iris Classification

Diabetes Prediction

Additional Resources

Documentation

Community