Supervised Learning / Data Prep

Feature Engineering

"Garbage In, Garbage Out." Transforming raw data into high-quality features is the hidden secret behind the best ML models.

Transform

The Art of the Data

Applied machine learning is basically feature engineering. It is the process of using domain knowledge to extract features from raw data via data mining techniques. These features can be used to improve the performance of machine learning algorithms.

Normalization & Scaling

Algorithms like SVM and K-Means are highly sensitive to the scale of the data. Normalization ensures that features like 'Age' (0-100) and 'Income' ($0-$1M) are put on the same scale (0 to 1) so neither dominates.

One-Hot Encoding

Machine learning models require numbers, not text. One-hot encoding converts categorical variables (like 'Color: Red/Blue/Green') into separate binary columns indicating presence or absence.

Handling Missing Data

Real-world data is messy. Dropping rows with missing values isn't always the answer. Imputation techniques allow filling in blanks with statistical means, medians, or even AI-predicted values.

Feature Selection

More data isn't always better. Too many features causes the 'curse of dimensionality'. Using algorithms like PCA or observing correlation matrices lets us drop irrelevant or redundant columns.

feature_engineering.py


import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# 1. Load Data
df = pd.read_csv('data.csv')

# 2. Fill missing values with median
df['Age'] = df['Age'].fillna(df['Age'].median())

# 3. Scale numerical feature
scaler = StandardScaler()
df['Salary_Scaled'] = scaler.fit_transform(df[['Salary']])

# 4. One-Hot Encode categorical variable
encoded_ports = pd.get_dummies(df['Embarked'], prefix='Port')
df = pd.concat([df, encoded_ports], axis=1)