Diabetes Prediction
A healthcare-focused binary classification project using Logistic Regression.
← Back to scikit-learn Tutorial
Project Overview
In this project, we utilize the Pima Indians Diabetes Database to predict whether a patient has diabetes based on diagnostic measurements. This is a classic example of binary classification in the medical field.
Learning Objectives
- Handle missing values and preprocess tabular medical data
- Apply feature scaling for linear models
- Train a Logistic Regression model
- Evaluate precision, recall, and ROC curves
1. Preprocessing the Data
Diagnostic data often requires scaling (standardization) before training linear models like Logistic Regression.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Assume 'df' is our loaded Diabetes dataset
X = df.drop('Outcome', axis=1)
y = df['Outcome']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)2. Logistic Regression Training
Logistic Regression is highly interpretable, making it a favorite for medical datasets.
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(X_train_scaled, y_train)
# View the learned coefficients for each medical feature
for feature, coef in zip(X.columns, model.coef_[0]):
print(f"{feature}: {coef:.4f}")3. Evaluation
Because the classes might be imbalanced, we rely heavily on Precision, Recall, and the F1-Score rather than just pure accuracy.
from sklearn.metrics import classification_report, roc_auc_score
y_pred = model.predict(X_test_scaled)
y_prob = model.predict_proba(X_test_scaled)[:, 1]
print(classification_report(y_test, y_pred))
print(f"ROC-AUC Score: {roc_auc_score(y_test, y_prob):.4f}")Want to try it yourself?
Download the full interactive Jupyter Notebook for this project to run the code and visualize the ROC curves.