The Art of the Data
Applied machine learning is basically feature engineering. It is the process of using domain knowledge to extract features from raw data via data mining techniques. These features can be used to improve the performance of machine learning algorithms.
Normalization & Scaling
Algorithms like SVM and K-Means are highly sensitive to the scale of the data. Normalization ensures that features like 'Age' (0-100) and 'Income' ($0-$1M) are put on the same scale (0 to 1) so neither dominates.
One-Hot Encoding
Machine learning models require numbers, not text. One-hot encoding converts categorical variables (like 'Color: Red/Blue/Green') into separate binary columns indicating presence or absence.
Handling Missing Data
Real-world data is messy. Dropping rows with missing values isn't always the answer. Imputation techniques allow filling in blanks with statistical means, medians, or even AI-predicted values.
Feature Selection
More data isn't always better. Too many features causes the 'curse of dimensionality'. Using algorithms like PCA or observing correlation matrices lets us drop irrelevant or redundant columns.
import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder
# 1. Load Data
df = pd.read_csv('data.csv')
# 2. Fill missing values with median
df['Age'] = df['Age'].fillna(df['Age'].median())
# 3. Scale numerical feature
scaler = StandardScaler()
df['Salary_Scaled'] = scaler.fit_transform(df[['Salary']])
# 4. One-Hot Encode categorical variable
encoded_ports = pd.get_dummies(df['Embarked'], prefix='Port')
df = pd.concat([df, encoded_ports], axis=1)