Machine Learning with scikit-learn

Build and evaluate machine learning models using Python's scikit-learn library

Getting Started with scikit-learn

Learn how to implement various machine learning algorithms using scikit-learn, from data preprocessing to model evaluation and hyperparameter tuning.

Prerequisites

  • Basic Python programming knowledge
  • Understanding of NumPy and Pandas
  • Basic statistics and mathematics
  • Jupyter Notebook environment

1. Data Preprocessing

Learn essential preprocessing techniques for preparing your data.

from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split

# Load and preprocess data
X = df.drop('target', axis=1)
y = df['target']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

2. Model Training

Implement and train various machine learning models.

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

# Train multiple models
models = {
    'logistic': LogisticRegression(),
    'random_forest': RandomForestClassifier(),
    'svm': SVC()
}

# Train each model
for name, model in models.items():
    model.fit(X_train_scaled, y_train)
    print(f"{name} score: {model.score(X_test_scaled, y_test):.4f}")

3. Model Evaluation

Evaluate model performance using various metrics.

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Make predictions
y_pred = model.predict(X_test_scaled)

# Print classification report
print(classification_report(y_test, y_pred))

# Plot confusion matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d')
plt.show()

4. Hyperparameter Tuning

Optimize model performance through hyperparameter tuning.

from sklearn.model_selection import GridSearchCV

# Define parameter grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 20, 30, None],
    'min_samples_split': [2, 5, 10]
}

# Perform grid search
grid_search = GridSearchCV(
    RandomForestClassifier(),
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

grid_search.fit(X_train_scaled, y_train)
print("Best parameters:", grid_search.best_params_)
print("Best score:", grid_search.best_score_)