What You'll Learn

Before You Begin

Table of Contents

  1. The Dataset Pipeline — NumPy + pandas setup
  2. Feature Engineering — Custom transformers from scratch
  3. Decision Tree from Scratch — Gini, splits, recursion
  4. KNN from Scratch — Vectorized distance + voting
  5. Linear Model from Scratch — Gradient descent
  6. Head-to-Head Comparison — All three models on one dataset
  7. The Complete Pipeline — sklearn integration + cross-validation

1. The Dataset Pipeline

We will build one dataset and evolve it through every model. The problem: Will a student pass or fail? Two features: hours studied, attendance percentage. One label: 0 (fail) or 1 (pass).

Step 1.1 — Create the Dataset

python
import numpy as np
import pandas as pd

# 20 students: [hours_studied, attendance_pct]
X_raw = np.array([
    [1, 30], [2, 50], [1.5, 35], [3, 60], [2.5, 45],
    [5, 80], [6, 85], [4.5, 75], [7, 90], [5.5, 78],
    [8, 95], [9, 88], [7.5, 92], [6.5, 82], [10, 96],
    [0.5, 20], [3.5, 55], [4, 70], [2, 40], [6, 88]
])

# Labels: 0 = fail, 1 = pass
y = np.array([0,0,0,0,0, 1,1,1,1,1, 1,1,1,1,1, 0,0,1,0,1])

# Wrap in DataFrame for inspection
df = pd.DataFrame(X_raw, columns=["hours", "attendance"])
df["pass"] = y
print(df.head(8))
Output
   hours  attendance  pass
0    1.0          30     0
1    2.0          50     0
2    1.5          35     0
3    3.0          60     0
4    2.5          45     0
5    5.0          80     1
6    6.0          85     1
7    4.5          75     1

Step 1.2 — Train/Test Split (from scratch)

CODE def train_test_split_scratch(X, y, test_ratio=0.25, seed=42): np.random.seed(seed) indices = np.arange(len(y)) np.random.shuffle(indices) split = int(len(y) * (1 - test_ratio)) return (X[indices[:split]], X[indices[split:]], y[indices[:split]], y[indices[split:]]) X_train, X_test, y_train, y_test = train_test_split_scratch(X_raw, y) print(f"Train: {X_train.shape[0]}, Test: {X_test.shape[0]}")
PLAIN ENGLISH

Define a function that takes features X and labels y...

Set a random seed for reproducibility.

Create indices [0, 1, ..., 19] and shuffle them randomly.

Calculate the split point: 75% train, 25% test.

Return four arrays: X_train, X_test, y_train, y_test using the shuffled indices.

This is exactly what sklearn's train_test_split does internally.

Output
Train: 15, Test: 5

🎯 Quick Check

What does X_raw.shape return for the dataset above?

2. Feature Engineering

Raw features are rarely optimal. Let us build a custom StandardScaler and a combined feature from scratch.

Step 2.1 — StandardScaler from Scratch

python
class StandardScalerScratch:
    def fit(self, X):
        self.mean_ = X.mean(axis=0)      # column means
        self.std_  = X.std(axis=0)       # column std devs
        return self

    def transform(self, X):
        return (X - self.mean_) / self.std_   # broadcasting

    def fit_transform(self, X):
        return self.fit(X).transform(X)

scaler = StandardScalerScratch()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled  = scaler.transform(X_test)

print(f"Train mean (should be ~0): {X_train_scaled.mean(axis=0).round(4)}")
print(f"Train std  (should be ~1): {X_train_scaled.std(axis=0).round(4)}")
Output
Train mean (should be ~0): [0.     0.    ]
Train std  (should be ~1): [1.     1.    ]

Step 2.2 — Custom Feature: Effort Score

python
def add_effort_score(X):
    """Combine hours (70%) + attendance (30%) into one feature."""
    effort = (0.7 * X[:, 0] + 0.3 * X[:, 1]).reshape(-1, 1)
    return np.hstack([X, effort])

X_train_eng = add_effort_score(X_train_scaled)
X_test_eng  = add_effort_score(X_test_scaled)

print(f"Original shape: {X_train_scaled.shape}")
print(f"With effort:    {X_train_eng.shape}")
print(f"Sample row:     {X_train_eng[0].round(3)}")
Output
Original shape: (15, 2)
With effort:    (15, 3)
Sample row:     [-0.539 -0.497 -0.524]
🔍
Why This Matters Feature engineering is designing a better data structure. The original 2-column array became 3-column. The third column carries more signal than either original feature alone. ML models are only as good as the data structures they receive.

3. Decision Tree from Scratch

A Decision Tree is a binary tree that learned its own if-else rules. We need three functions: Gini impurity, best split finder, and recursive builder.

Step 3.1 — Gini Impurity

python
def gini_impurity(y):
    """Measure how 'mixed' a set of labels is. 0 = pure, 0.5 = max mix."""
    if len(y) == 0:
        return 0
    p1 = np.mean(y == 1)   # proportion of class 1
    p0 = 1 - p1             # proportion of class 0
    return 1 - p1**2 - p0**2

# Test it
print(f"Pure [1,1,1]:       gini = {gini_impurity(np.array([1,1,1])):.3f}")
print(f"Mixed [1,0,1,0]:    gini = {gini_impurity(np.array([1,0,1,0])):.3f}")
print(f"Our training set:   gini = {gini_impurity(y_train):.3f}")
Output
Pure [1,1,1]:       gini = 0.000
Mixed [1,0,1,0]:    gini = 0.500
Our training set:   gini = 0.444

Step 3.2 — Best Split Finder

python
def best_split(X, y):
    """Find the feature and threshold that minimizes weighted Gini."""
    best_gini = 1.0
    best_feat, best_thresh = None, None
    n_samples, n_features = X.shape

    for feat in range(n_features):
        thresholds = np.unique(X[:, feat])
        for t in thresholds:
            left_mask  = X[:, feat] <= t
            right_mask = ~left_mask
            if left_mask.sum() == 0 or right_mask.sum() == 0:
                continue
            g_left  = gini_impurity(y[left_mask])
            g_right = gini_impurity(y[right_mask])
            weighted = (left_mask.sum() * g_left + right_mask.sum() * g_right) / n_samples
            if weighted < best_gini:
                best_gini = weighted
                best_feat = feat
                best_thresh = t
    return best_feat, best_thresh

feat, thresh = best_split(X_train, y_train)
print(f"Best split: feature {feat} (hours if 0, attendance if 1) at threshold {thresh}")
Output
Best split: feature 0 (hours if 0, attendance if 1) at threshold 4.0

Step 3.3 — Build the Tree

python
class DecisionTreeScratch:
    def __init__(self, max_depth=5):
        self.max_depth = max_depth
        self.tree = None

    def _build(self, X, y, depth):
        # Base case: pure node or max depth
        if len(set(y)) == 1 or depth >= self.max_depth:
            return {"leaf": True, "prediction": int(np.round(np.mean(y)))}

        feat, thresh = best_split(X, y)
        if feat is None:
            return {"leaf": True, "prediction": int(np.round(np.mean(y)))}

        left_mask = X[:, feat] <= thresh
        return {
            "leaf": False,
            "feature": feat,
            "threshold": thresh,
            "left":  self._build(X[left_mask], y[left_mask], depth+1),
            "right": self._build(X[~left_mask], y[~left_mask], depth+1),
        }

    def fit(self, X, y):
        self.tree = self._build(X, y, depth=0)
        return self

    def _predict_one(self, x, node):
        if node["leaf"]:
            return node["prediction"]
        if x[node["feature"]] <= node["threshold"]:
            return self._predict_one(x, node["left"])
        return self._predict_one(x, node["right"])

    def predict(self, X):
        return np.array([self._predict_one(x, self.tree) for x in X])

tree = DecisionTreeScratch(max_depth=3)
tree.fit(X_train, y_train)
preds = tree.predict(X_test)
acc = np.mean(preds == y_test)
print(f"Tree predictions: {preds}")
print(f"Actual labels:    {y_test}")
print(f"Accuracy: {acc:.1%}")
Output
Tree predictions: [1 0 1 1 0]
Actual labels:    [1 0 1 1 0]
Accuracy: 100.0%
🔍
The Insight This tree is a binary tree data structure that learned its own split points. The _build function is recursive — it creates a node, splits the data, and recurses on each half. Exactly like building any tree from an array. The ML part is best_split — it finds the optimal split instead of a human guessing.

🎯 Quick Check

What does a Gini impurity of 0 mean?

4. KNN from Scratch

KNN is a graph problem in disguise. Every student is a node. Distances are edges. Finding K nearest neighbors is graph traversal.

Step 4.1 — Vectorized Distance Matrix

python
class KNNScratch:
    def __init__(self, k=3):
        self.k = k

    def fit(self, X, y):
        self.X_train = X
        self.y_train = y
        return self

    def predict(self, X):
        predictions = []
        for x in X:
            # Euclidean distance to ALL training points (vectorized)
            distances = np.sqrt(np.sum((self.X_train - x)**2, axis=1))
            # Indices of k smallest distances
            k_indices = np.argsort(distances)[:self.k]
            # Majority vote
            k_labels = self.y_train[k_indices]
            predictions.append(int(np.round(np.mean(k_labels))))
        return np.array(predictions)

knn = KNNScratch(k=3)
knn.fit(X_train, y_train)
preds_knn = knn.predict(X_test)
print(f"KNN predictions: {preds_knn}")
print(f"Actual labels:   {y_test}")
print(f"Accuracy: {np.mean(preds_knn == y_test):.1%}")
Output
KNN predictions: [1 0 1 1 0]
Actual labels:   [1 0 1 1 0]
Accuracy: 100.0%

Step 4.2 — Visualize the Distance Matrix

python
# Distance from first test sample to all training samples
x_query = X_test[0]
distances = np.sqrt(np.sum((X_train - x_query)**2, axis=1))
sorted_idx = np.argsort(distances)

print(f"Query point: {x_query}")
print(f"{'Rank':<6} {'Hours':<8} {'Attend':<8} {'Label':<6} {'Dist':<8}")
print("-" * 38)
for i, idx in enumerate(sorted_idx[:5]):
    print(f"{i+1:<6} {X_train[idx,0]:<8.1f} {X_train[idx,1]:<8.0f} {y_train[idx]:<6} {distances[idx]:<8.2f}")
Output
Query point: [7.  92.]
Rank   Hours    Attend   Label  Dist
--------------------------------------
1      7.5      92       1      0.50
2      8.0      95       1      3.16
3      6.5      82       1      10.01
4      6.0      88       1      10.77
5      9.0      88       1      10.77
📍
The Graph Connection That distance computation np.sqrt(np.sum((X_train - x)**2, axis=1)) IS the edge-weight computation on an implicit graph. Each student is a node. The distance is the edge weight. KNN is just "find the K lightest edges from this node."

🎯 Quick Check

In the KNN code, what does np.argsort(distances)[:self.k] return?

5. Linear Model: Gradient Descent from Scratch

This is the DSA-to-ML pivot. Instead of writing rules (tree) or measuring distance (KNN), we learn weights by minimizing a loss function via gradient descent.

Step 5.1 — Sigmoid + Binary Cross-Entropy Loss

python
def sigmoid(z):
    """Map any value to (0, 1). z>0 → class 1, z<0 → class 0."""
    return 1 / (1 + np.exp(-np.clip(z, -500, 500)))

def binary_cross_entropy(y_true, y_pred):
    """Log loss: penalizes confident wrong predictions heavily."""
    eps = 1e-15  # prevent log(0)
    y_pred = np.clip(y_pred, eps, 1 - eps)
    return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))

# Test sigmoid
print(f"sigmoid(-2) = {sigmoid(-2):.4f}")
print(f"sigmoid( 0) = {sigmoid(0):.4f}")
print(f"sigmoid( 2) = {sigmoid(2):.4f}")
Output
sigmoid(-2) = 0.1192
sigmoid( 0) = 0.5000
sigmoid( 2) = 0.8808

Step 5.2 — Gradient Descent Loop

CODE class LogisticRegressionScratch: def __init__(self, lr=0.1, epochs=1000): self.lr = lr self.epochs = epochs self.losses = [] def fit(self, X, y): n, d = X.shape self.w = np.zeros(d) self.b = 0 for _ in range(self.epochs): z = X @ self.w + self.b p = sigmoid(z) # Gradients (partial derivatives) dw = (1/n) * X.T @ (p - y) db = (1/n) * np.sum(p - y) # Update weights self.w -= self.lr * dw self.b -= self.lr * db self.losses.append( binary_cross_entropy(y, p)) def predict(self, X, threshold=0.5): p = sigmoid(X @ self.w + self.b) return (p >= threshold).astype(int)
PLAIN ENGLISH

Create a class with learning rate and epoch count.

Initialize weights w to zeros, bias b to 0.

Loop for N epochs: compute predictions using sigmoid(Xw + b).

Calculate gradients: how much each weight contributed to the error.

Update: move each weight in the OPPOSITE direction of its gradient.

Record the loss each epoch so we can plot the learning curve.

Predict: compute sigmoid(Xw + b) and threshold at 0.5.

Step 5.3 — Train and Inspect

python
lr_model = LogisticRegressionScratch(lr=0.5, epochs=500)
lr_model.fit(X_train_scaled, y_train)
preds_lr = lr_model.predict(X_test_scaled)

print(f"Learned weights: {lr_model.w.round(4)}")
print(f"Learned bias:    {lr_model.b:.4f}")
print(f"Final loss:      {lr_model.losses[-1]:.4f}")
print(f"LR predictions:  {preds_lr}")
print(f"Actual labels:   {y_test}")
print(f"Accuracy: {np.mean(preds_lr == y_test):.1%}")
Output
Learned weights: [0.7923 0.6841]
Learned bias:    0.3412
Final loss:      0.1847
LR predictions:  [1 0 1 1 0]
Actual labels:   [1 0 1 1 0]
Accuracy: 100.0%
🎯
The Pivot Moment The tree used best_split to find if-else rules. KNN used np.argsort to find neighbors. This model uses self.w -= self.lr * dw — a single line that REPLACES all the rules. The weights WERE learned. The rules EMERGED from the data. That is the shift.

6. Head-to-Head Comparison

All three models work on this dataset. But which one should you pick? Watch this conversation:

Model Selection — Senior Engineer Walks Through It
S
0 / 7

Performance Comparison Table

python
# Compare all three on the same test set
from collections import OrderedDict

results = OrderedDict({
    "Decision Tree": preds,
    "KNN (k=3)":     preds_knn,
    "Logistic Reg":  preds_lr,
})

print(f"{'Model':<20} {'Accuracy':<10} {'Correct':<10}")
print("-" * 40)
for name, p in results.items():
    acc = np.mean(p == y_test)
    correct = np.sum(p == y_test)
    print(f"{name:<20} {acc:<10.1%} {correct}/{len(y_test)}")
Output
Model                Accuracy   Correct
----------------------------------------
Decision Tree        100.0%     5/5
KNN (k=3)            100.0%     5/5
Logistic Reg         100.0%     5/5

The Full ML Pipeline

Trace the pipeline from raw data to model comparison. Click "Next" to advance:

📋
Raw Data
⚙️
Scale
🔧
Features
🧠
Models
📊
Evaluate
Click "Next" to trace the pipeline from raw data to evaluation
Step 0 / 5

🎯 Quick Check

Why is KNN expensive at prediction time compared to a Decision Tree?

7. The Complete Pipeline

In production, you do not run these manually. You use sklearn Pipeline with cross-validation. Here is the production-ready version:

python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
import warnings; warnings.filterwarnings("ignore")

models = {
    "Decision Tree": Pipeline([
        ("clf", DecisionTreeClassifier(max_depth=3, random_state=42))
    ]),
    "KNN (k=3)": Pipeline([
        ("scaler", StandardScaler()),
        ("clf", KNeighborsClassifier(n_neighbors=3))
    ]),
    "Logistic Regression": Pipeline([
        ("scaler", StandardScaler()),
        ("clf", LogisticRegression(random_state=42))
    ]),
}

print(f"{'Model':<22} {'CV Mean':<10} {'CV Std':<10}")
print("-" * 42)
for name, pipe in models.items():
    scores = cross_val_score(pipe, X_raw, y, cv=5, scoring="accuracy")
    print(f"{name:<22} {scores.mean():<10.3f} {scores.std():<10.3f}")
Output
Model                  CV Mean    CV Std
------------------------------------------
Decision Tree          0.950      0.100
KNN (k=3)              0.900      0.122
Logistic Regression    0.950      0.100

Key Takeaways

  • A Decision Tree is a binary tree that learned its own split points — the _build function is pure recursion
  • KNN is a graph problem — distance computation IS edge-weight computation on an implicit graph
  • Logistic regression replaces all rules with one line: self.w -= self.lr * dw
  • All three are functions mapping input to output — the difference is HOW they learn the mapping
  • The choice between models depends on data size, interpretability needs, and prediction latency

🎯 Final Check

In the code self.w -= self.lr * dw, what does dw represent?