Diabetes Prediction

A healthcare-focused binary classification project using Logistic Regression.

← Back to scikit-learn Tutorial

Project Overview

In this project, we utilize the Pima Indians Diabetes Database to predict whether a patient has diabetes based on diagnostic measurements. This is a classic example of binary classification in the medical field.

Learning Objectives

  • Handle missing values and preprocess tabular medical data
  • Apply feature scaling for linear models
  • Train a Logistic Regression model
  • Evaluate precision, recall, and ROC curves

1. Preprocessing the Data

Diagnostic data often requires scaling (standardization) before training linear models like Logistic Regression.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Assume 'df' is our loaded Diabetes dataset
X = df.drop('Outcome', axis=1)
y = df['Outcome']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

2. Logistic Regression Training

Logistic Regression is highly interpretable, making it a favorite for medical datasets.

from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(X_train_scaled, y_train)

# View the learned coefficients for each medical feature
for feature, coef in zip(X.columns, model.coef_[0]):
    print(f"{feature}: {coef:.4f}")

3. Evaluation

Because the classes might be imbalanced, we rely heavily on Precision, Recall, and the F1-Score rather than just pure accuracy.

from sklearn.metrics import classification_report, roc_auc_score

y_pred = model.predict(X_test_scaled)
y_prob = model.predict_proba(X_test_scaled)[:, 1]

print(classification_report(y_test, y_pred))
print(f"ROC-AUC Score: {roc_auc_score(y_test, y_prob):.4f}")

Want to try it yourself?

Download the full interactive Jupyter Notebook for this project to run the code and visualize the ROC curves.