DataSci Blog - Data Science & Machine Learning

Project Overview

The Iris dataset is considered the "Hello World" of machine learning. In this project, we'll build a model to classify iris flowers into three species (Setosa, Versicolor, and Virginica) based on the length and width of their sepals and petals.

Learning Objectives

Load a built-in dataset from scikit-learn
Split data into training and testing sets
Train a Random Forest Classifier
Evaluate model accuracy and read a confusion matrix

1. Loading the Data

We start by importing the necessary libraries and loading the Iris dataset directly from scikit-learn.

import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = iris.target

print(f"Dataset shape: {X.shape}")
print(f"Target classes: {iris.target_names}")

2. Splitting and Training

Now, we'll reserve 20% of our data for testing our model's performance, and train a Random Forest on the remaining 80%.

from sklearn.ensemble import RandomForestClassifier

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

print("Model trained successfully!")

3. Evaluation

Finally, we'll evaluate how accurately the model predicts the species of our held-out test set.

from sklearn.metrics import accuracy_score, classification_report

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%\n")
print(classification_report(y_test, y_pred, target_names=iris.target_names))

Want to try it yourself?

Download the full interactive Jupyter Notebook for this project to run the code and visualize the results.

Download .ipynb