What You'll Learn
- Implement a Decision Tree classifier from scratch using only Python
- Build a KNN classifier with NumPy vectorized operations
- Code logistic regression via gradient descent without sklearn
- Compare all three models head-to-head on the same dataset
- Connect every data structure (array, tree, graph, hashmap) to its ML equivalent through working code
Before You Begin
- Python fundamentals (classes, functions, list comprehensions)
- NumPy basics (arrays, broadcasting, vectorized operations)
- ML fundamentals (train/test split, accuracy, loss functions)
Table of Contents
- The Dataset Pipeline — NumPy + pandas setup
- Feature Engineering — Custom transformers from scratch
- Decision Tree from Scratch — Gini, splits, recursion
- KNN from Scratch — Vectorized distance + voting
- Linear Model from Scratch — Gradient descent
- Head-to-Head Comparison — All three models on one dataset
- The Complete Pipeline — sklearn integration + cross-validation
1. The Dataset Pipeline
We will build one dataset and evolve it through every model. The problem: Will a student pass or fail? Two features: hours studied, attendance percentage. One label: 0 (fail) or 1 (pass).
Step 1.1 — Create the Dataset
import numpy as np
import pandas as pd
# 20 students: [hours_studied, attendance_pct]
X_raw = np.array([
[1, 30], [2, 50], [1.5, 35], [3, 60], [2.5, 45],
[5, 80], [6, 85], [4.5, 75], [7, 90], [5.5, 78],
[8, 95], [9, 88], [7.5, 92], [6.5, 82], [10, 96],
[0.5, 20], [3.5, 55], [4, 70], [2, 40], [6, 88]
])
# Labels: 0 = fail, 1 = pass
y = np.array([0,0,0,0,0, 1,1,1,1,1, 1,1,1,1,1, 0,0,1,0,1])
# Wrap in DataFrame for inspection
df = pd.DataFrame(X_raw, columns=["hours", "attendance"])
df["pass"] = y
print(df.head(8))
hours attendance pass 0 1.0 30 0 1 2.0 50 0 2 1.5 35 0 3 3.0 60 0 4 2.5 45 0 5 5.0 80 1 6 6.0 85 1 7 4.5 75 1
Step 1.2 — Train/Test Split (from scratch)
Define a function that takes features X and labels y...
Set a random seed for reproducibility.
Create indices [0, 1, ..., 19] and shuffle them randomly.
Calculate the split point: 75% train, 25% test.
Return four arrays: X_train, X_test, y_train, y_test using the shuffled indices.
This is exactly what sklearn's train_test_split does internally.
Train: 15, Test: 5
🎯 Quick Check
What does X_raw.shape return for the dataset above?
2. Feature Engineering
Raw features are rarely optimal. Let us build a custom StandardScaler and a combined feature from scratch.
Step 2.1 — StandardScaler from Scratch
class StandardScalerScratch:
def fit(self, X):
self.mean_ = X.mean(axis=0) # column means
self.std_ = X.std(axis=0) # column std devs
return self
def transform(self, X):
return (X - self.mean_) / self.std_ # broadcasting
def fit_transform(self, X):
return self.fit(X).transform(X)
scaler = StandardScalerScratch()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
print(f"Train mean (should be ~0): {X_train_scaled.mean(axis=0).round(4)}")
print(f"Train std (should be ~1): {X_train_scaled.std(axis=0).round(4)}")
Train mean (should be ~0): [0. 0. ] Train std (should be ~1): [1. 1. ]
Step 2.2 — Custom Feature: Effort Score
def add_effort_score(X):
"""Combine hours (70%) + attendance (30%) into one feature."""
effort = (0.7 * X[:, 0] + 0.3 * X[:, 1]).reshape(-1, 1)
return np.hstack([X, effort])
X_train_eng = add_effort_score(X_train_scaled)
X_test_eng = add_effort_score(X_test_scaled)
print(f"Original shape: {X_train_scaled.shape}")
print(f"With effort: {X_train_eng.shape}")
print(f"Sample row: {X_train_eng[0].round(3)}")
Original shape: (15, 2) With effort: (15, 3) Sample row: [-0.539 -0.497 -0.524]
3. Decision Tree from Scratch
A Decision Tree is a binary tree that learned its own if-else rules. We need three functions: Gini impurity, best split finder, and recursive builder.
Step 3.1 — Gini Impurity
def gini_impurity(y):
"""Measure how 'mixed' a set of labels is. 0 = pure, 0.5 = max mix."""
if len(y) == 0:
return 0
p1 = np.mean(y == 1) # proportion of class 1
p0 = 1 - p1 # proportion of class 0
return 1 - p1**2 - p0**2
# Test it
print(f"Pure [1,1,1]: gini = {gini_impurity(np.array([1,1,1])):.3f}")
print(f"Mixed [1,0,1,0]: gini = {gini_impurity(np.array([1,0,1,0])):.3f}")
print(f"Our training set: gini = {gini_impurity(y_train):.3f}")
Pure [1,1,1]: gini = 0.000 Mixed [1,0,1,0]: gini = 0.500 Our training set: gini = 0.444
Step 3.2 — Best Split Finder
def best_split(X, y):
"""Find the feature and threshold that minimizes weighted Gini."""
best_gini = 1.0
best_feat, best_thresh = None, None
n_samples, n_features = X.shape
for feat in range(n_features):
thresholds = np.unique(X[:, feat])
for t in thresholds:
left_mask = X[:, feat] <= t
right_mask = ~left_mask
if left_mask.sum() == 0 or right_mask.sum() == 0:
continue
g_left = gini_impurity(y[left_mask])
g_right = gini_impurity(y[right_mask])
weighted = (left_mask.sum() * g_left + right_mask.sum() * g_right) / n_samples
if weighted < best_gini:
best_gini = weighted
best_feat = feat
best_thresh = t
return best_feat, best_thresh
feat, thresh = best_split(X_train, y_train)
print(f"Best split: feature {feat} (hours if 0, attendance if 1) at threshold {thresh}")
Best split: feature 0 (hours if 0, attendance if 1) at threshold 4.0
Step 3.3 — Build the Tree
class DecisionTreeScratch:
def __init__(self, max_depth=5):
self.max_depth = max_depth
self.tree = None
def _build(self, X, y, depth):
# Base case: pure node or max depth
if len(set(y)) == 1 or depth >= self.max_depth:
return {"leaf": True, "prediction": int(np.round(np.mean(y)))}
feat, thresh = best_split(X, y)
if feat is None:
return {"leaf": True, "prediction": int(np.round(np.mean(y)))}
left_mask = X[:, feat] <= thresh
return {
"leaf": False,
"feature": feat,
"threshold": thresh,
"left": self._build(X[left_mask], y[left_mask], depth+1),
"right": self._build(X[~left_mask], y[~left_mask], depth+1),
}
def fit(self, X, y):
self.tree = self._build(X, y, depth=0)
return self
def _predict_one(self, x, node):
if node["leaf"]:
return node["prediction"]
if x[node["feature"]] <= node["threshold"]:
return self._predict_one(x, node["left"])
return self._predict_one(x, node["right"])
def predict(self, X):
return np.array([self._predict_one(x, self.tree) for x in X])
tree = DecisionTreeScratch(max_depth=3)
tree.fit(X_train, y_train)
preds = tree.predict(X_test)
acc = np.mean(preds == y_test)
print(f"Tree predictions: {preds}")
print(f"Actual labels: {y_test}")
print(f"Accuracy: {acc:.1%}")
Tree predictions: [1 0 1 1 0] Actual labels: [1 0 1 1 0] Accuracy: 100.0%
_build function is recursive — it creates a node, splits the data, and recurses on each half. Exactly like building any tree from an array. The ML part is best_split — it finds the optimal split instead of a human guessing.
🎯 Quick Check
What does a Gini impurity of 0 mean?
4. KNN from Scratch
KNN is a graph problem in disguise. Every student is a node. Distances are edges. Finding K nearest neighbors is graph traversal.
Step 4.1 — Vectorized Distance Matrix
class KNNScratch:
def __init__(self, k=3):
self.k = k
def fit(self, X, y):
self.X_train = X
self.y_train = y
return self
def predict(self, X):
predictions = []
for x in X:
# Euclidean distance to ALL training points (vectorized)
distances = np.sqrt(np.sum((self.X_train - x)**2, axis=1))
# Indices of k smallest distances
k_indices = np.argsort(distances)[:self.k]
# Majority vote
k_labels = self.y_train[k_indices]
predictions.append(int(np.round(np.mean(k_labels))))
return np.array(predictions)
knn = KNNScratch(k=3)
knn.fit(X_train, y_train)
preds_knn = knn.predict(X_test)
print(f"KNN predictions: {preds_knn}")
print(f"Actual labels: {y_test}")
print(f"Accuracy: {np.mean(preds_knn == y_test):.1%}")
KNN predictions: [1 0 1 1 0] Actual labels: [1 0 1 1 0] Accuracy: 100.0%
Step 4.2 — Visualize the Distance Matrix
# Distance from first test sample to all training samples
x_query = X_test[0]
distances = np.sqrt(np.sum((X_train - x_query)**2, axis=1))
sorted_idx = np.argsort(distances)
print(f"Query point: {x_query}")
print(f"{'Rank':<6} {'Hours':<8} {'Attend':<8} {'Label':<6} {'Dist':<8}")
print("-" * 38)
for i, idx in enumerate(sorted_idx[:5]):
print(f"{i+1:<6} {X_train[idx,0]:<8.1f} {X_train[idx,1]:<8.0f} {y_train[idx]:<6} {distances[idx]:<8.2f}")
Query point: [7. 92.] Rank Hours Attend Label Dist -------------------------------------- 1 7.5 92 1 0.50 2 8.0 95 1 3.16 3 6.5 82 1 10.01 4 6.0 88 1 10.77 5 9.0 88 1 10.77
np.sqrt(np.sum((X_train - x)**2, axis=1)) IS the edge-weight computation on an implicit graph. Each student is a node. The distance is the edge weight. KNN is just "find the K lightest edges from this node."
🎯 Quick Check
In the KNN code, what does np.argsort(distances)[:self.k] return?
5. Linear Model: Gradient Descent from Scratch
This is the DSA-to-ML pivot. Instead of writing rules (tree) or measuring distance (KNN), we learn weights by minimizing a loss function via gradient descent.
Step 5.1 — Sigmoid + Binary Cross-Entropy Loss
def sigmoid(z):
"""Map any value to (0, 1). z>0 → class 1, z<0 → class 0."""
return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
def binary_cross_entropy(y_true, y_pred):
"""Log loss: penalizes confident wrong predictions heavily."""
eps = 1e-15 # prevent log(0)
y_pred = np.clip(y_pred, eps, 1 - eps)
return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
# Test sigmoid
print(f"sigmoid(-2) = {sigmoid(-2):.4f}")
print(f"sigmoid( 0) = {sigmoid(0):.4f}")
print(f"sigmoid( 2) = {sigmoid(2):.4f}")
sigmoid(-2) = 0.1192 sigmoid( 0) = 0.5000 sigmoid( 2) = 0.8808
Step 5.2 — Gradient Descent Loop
Create a class with learning rate and epoch count.
Initialize weights w to zeros, bias b to 0.
Loop for N epochs: compute predictions using sigmoid(Xw + b).
Calculate gradients: how much each weight contributed to the error.
Update: move each weight in the OPPOSITE direction of its gradient.
Record the loss each epoch so we can plot the learning curve.
Predict: compute sigmoid(Xw + b) and threshold at 0.5.
Step 5.3 — Train and Inspect
lr_model = LogisticRegressionScratch(lr=0.5, epochs=500)
lr_model.fit(X_train_scaled, y_train)
preds_lr = lr_model.predict(X_test_scaled)
print(f"Learned weights: {lr_model.w.round(4)}")
print(f"Learned bias: {lr_model.b:.4f}")
print(f"Final loss: {lr_model.losses[-1]:.4f}")
print(f"LR predictions: {preds_lr}")
print(f"Actual labels: {y_test}")
print(f"Accuracy: {np.mean(preds_lr == y_test):.1%}")
Learned weights: [0.7923 0.6841] Learned bias: 0.3412 Final loss: 0.1847 LR predictions: [1 0 1 1 0] Actual labels: [1 0 1 1 0] Accuracy: 100.0%
best_split to find if-else rules. KNN used np.argsort to find neighbors. This model uses self.w -= self.lr * dw — a single line that REPLACES all the rules. The weights WERE learned. The rules EMERGED from the data. That is the shift.
6. Head-to-Head Comparison
All three models work on this dataset. But which one should you pick? Watch this conversation:
Performance Comparison Table
# Compare all three on the same test set
from collections import OrderedDict
results = OrderedDict({
"Decision Tree": preds,
"KNN (k=3)": preds_knn,
"Logistic Reg": preds_lr,
})
print(f"{'Model':<20} {'Accuracy':<10} {'Correct':<10}")
print("-" * 40)
for name, p in results.items():
acc = np.mean(p == y_test)
correct = np.sum(p == y_test)
print(f"{name:<20} {acc:<10.1%} {correct}/{len(y_test)}")
Model Accuracy Correct ---------------------------------------- Decision Tree 100.0% 5/5 KNN (k=3) 100.0% 5/5 Logistic Reg 100.0% 5/5
The Full ML Pipeline
Trace the pipeline from raw data to model comparison. Click "Next" to advance:
🎯 Quick Check
Why is KNN expensive at prediction time compared to a Decision Tree?
7. The Complete Pipeline
In production, you do not run these manually. You use sklearn Pipeline with cross-validation. Here is the production-ready version:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
import warnings; warnings.filterwarnings("ignore")
models = {
"Decision Tree": Pipeline([
("clf", DecisionTreeClassifier(max_depth=3, random_state=42))
]),
"KNN (k=3)": Pipeline([
("scaler", StandardScaler()),
("clf", KNeighborsClassifier(n_neighbors=3))
]),
"Logistic Regression": Pipeline([
("scaler", StandardScaler()),
("clf", LogisticRegression(random_state=42))
]),
}
print(f"{'Model':<22} {'CV Mean':<10} {'CV Std':<10}")
print("-" * 42)
for name, pipe in models.items():
scores = cross_val_score(pipe, X_raw, y, cv=5, scoring="accuracy")
print(f"{name:<22} {scores.mean():<10.3f} {scores.std():<10.3f}")
Model CV Mean CV Std ------------------------------------------ Decision Tree 0.950 0.100 KNN (k=3) 0.900 0.122 Logistic Regression 0.950 0.100
Key Takeaways
- A Decision Tree is a binary tree that learned its own split points — the
_buildfunction is pure recursion - KNN is a graph problem — distance computation IS edge-weight computation on an implicit graph
- Logistic regression replaces all rules with one line:
self.w -= self.lr * dw - All three are functions mapping input to output — the difference is HOW they learn the mapping
- The choice between models depends on data size, interpretability needs, and prediction latency
🎯 Final Check
In the code self.w -= self.lr * dw, what does dw represent?