Deep Tutorial · Recommendation Systems

Building a Recommendation Engine
in Plain PHP from Scratch

Target platform: aplly.in / exam-prep sites Stack: Plain PHP · MySQL · No frameworks Depth: Algorithm-first Reading time: ~45 min
Table of Contents
  1. What is a Recommendation Engine?
  2. Database Schema Design
  3. Signal Collection in PHP
  4. Content-Based Filtering (Deep Dive)
  5. Collaborative Filtering (Deep Dive)
  6. The Hybrid Combiner
  7. Re-Ranking & Domain Rules
  8. Cold Start Problem
  9. The PHP API Endpoint
  10. Caching & Performance
  11. Evaluation Metrics
  12. Putting It All Together
Section 01

What Is a Recommendation Engine?

A recommendation engine is a system that predicts which items a user is most likely to find useful, given everything you know about the user, the items, and how similar users have behaved. On an exam-prep platform like aplly.in, this means surfacing the right mock test, case study, or topic at the right time — not just the most popular one.

There are three core paradigms, and every real-world system is a blend of them:

Paradigm Core question Data needed Works when Fails when
Content-Based "What other items are similar to things this user liked?" Item metadata (tags, category, difficulty) User has some history Items are under-tagged or all similar
Collaborative Filtering "What did similar users like?" User-item interaction matrix Many users, dense interactions New users/items (cold start)
Popularity-Based "What do most people find useful right now?" Aggregate interaction counts No user history at all User has specific niche needs
Architecture insight
The system we build uses all three. Popularity handles new/anonymous users, content-based kicks in after a user's first 2–3 interactions, and collaborative filtering becomes the dominant signal once the user base grows past ~200 active users. The key is a hybrid combiner that blends them with tunable weights.

The Full Pipeline

Before writing a single line of code, understand the data flow:

📡
Signals
Raw user events: clicks, scores, bookmarks, time
🧮
Models
CB + CF algorithms produce candidate lists
⚖️
Combine
Weighted merge of all candidate scores
🎛️
Re-rank
Apply business rules, diversity, difficulty
🔌
API
Serve top-N as JSON, cache in Redis/file
Section 02

Database Schema Design

Everything starts with the schema. The recommendation engine is only as good as the signals it can query. Design these tables from day one — retrofitting is painful.

SQL — MySQL schema.sql
-- ─────────────────────────────────────────────
-- CONTENT: every item that can be recommended
-- ─────────────────────────────────────────────
CREATE TABLE content (
  id            INT AUTO_INCREMENT PRIMARY KEY,
  title         VARCHAR(255) NOT NULL,
  content_type  ENUM('mock_test', 'case_study') NOT NULL,
  category      VARCHAR(80),   -- 'ssc', 'banking', 'rto' …
  difficulty    TINYINT,        -- 1 = easy, 5 = very hard
  question_count SMALLINT,
  duration_min  SMALLINT,
  popularity    INT DEFAULT 0,   -- denormalised, updated by cron
  created_at    TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  INDEX idx_category (category),
  INDEX idx_difficulty (difficulty)
);

-- ─────────────────────────────────────────────
-- CONTENT TAGS: many-to-many tag system
-- Each content item carries multiple semantic tags
-- ─────────────────────────────────────────────
CREATE TABLE content_tags (
  content_id  INT NOT NULL,
  tag         VARCHAR(80) NOT NULL,
  weight      FLOAT DEFAULT 1.0,  -- primary topic = 1.0, secondary = 0.5
  PRIMARY KEY (content_id, tag),
  INDEX idx_tag (tag)
);

-- Example rows for aplly.in:
-- (12, 'quantitative-aptitude', 1.0)
-- (12, 'ssc-cgl', 1.0)
-- (12, 'number-system', 0.7)
-- (12, 'difficulty-medium', 0.5)

-- ─────────────────────────────────────────────
-- USER EVENTS: the raw interaction log
-- This is the single most important table
-- ─────────────────────────────────────────────
CREATE TABLE user_events (
  id            BIGINT AUTO_INCREMENT PRIMARY KEY,
  user_id       INT NOT NULL,
  content_id    INT NOT NULL,
  event_type    ENUM('view','start','complete','bookmark','score') NOT NULL,
  score_pct     FLOAT NULL,       -- 0–100, only for 'score' events
  duration_sec  INT NULL,          -- seconds spent
  session_id    VARCHAR(64),       -- group events in same session
  created_at    TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  INDEX idx_user     (user_id),
  INDEX idx_content  (content_id),
  INDEX idx_user_con (user_id, content_id),
  INDEX idx_created  (created_at)
);

-- ─────────────────────────────────────────────
-- USER_ITEM_MATRIX: pre-computed interaction scores
-- Rebuilt every night by a cron job
-- ─────────────────────────────────────────────
CREATE TABLE user_item_scores (
  user_id    INT NOT NULL,
  content_id INT NOT NULL,
  score      FLOAT NOT NULL,   -- 0.0–1.0 normalised
  updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (user_id, content_id),
  INDEX idx_score (user_id, score DESC)
);

-- ─────────────────────────────────────────────
-- ITEM_SIMILARITY: pre-computed CB similarity pairs
-- Also rebuilt nightly — avoids real-time cosine math
-- ─────────────────────────────────────────────
CREATE TABLE item_similarity (
  item_a     INT NOT NULL,
  item_b     INT NOT NULL,
  similarity FLOAT NOT NULL,
  PRIMARY KEY (item_a, item_b),
  INDEX idx_sim (item_a, similarity DESC)
);
Design choice
The user_item_scores and item_similarity tables are materialised views built by offline cron jobs. Never compute cosine similarity on every API request — that doesn't scale past a few hundred content items. Pre-compute, store, serve.
Section 03

Signal Collection in PHP

A recommendation engine is only as good as its training signal. Before any algorithm runs, you need to reliably capture what users do. On aplly.in, every meaningful interaction needs to be logged as a weighted signal.

Signal Weights

Not all interactions are equal. Define implicit weights upfront:

EventWeightReasoning
bookmark1.0Explicit intent signal — strongest
complete + score > 80%0.9Finished and did well — deep engagement
complete + score 40–80%0.7Struggled but persisted — high relevance
complete + score < 40%0.5May be too hard — weak positive signal
start (spent >5 min)0.4Engaged but didn't finish
start (spent <2 min)0.1Likely a bounce — weak signal
view (page load only)0.05Curiosity, not intent
PHP signals/EventLogger.php
<?php

class EventLogger {

  /** Implicit weight table — tune these from A/B tests over time */
  const WEIGHTS = [
    'bookmark'  => 1.0,
    'complete'  => 0.7,   // base; overridden by score below
    'start'     => 0.4,   // overridden by duration below
    'view'      => 0.05,
  ];

  private $db;

  public function __construct(PDO $db) {
    $this->db = $db;
  }

  /**
   * Log a single user event and immediately upsert
   * the pre-computed user_item_scores row.
   *
   * @param int    $userId
   * @param int    $contentId
   * @param string $eventType  view | start | complete | bookmark | score
   * @param array  $meta       ['score_pct' => 74, 'duration_sec' => 320, ...]
   */
  public function log(
    int $userId,
    int $contentId,
    string $eventType,
    array $meta = []
  ): void {

    // 1. Insert raw event (never update — append-only log)
    $stmt = $this->db->prepare("
      INSERT INTO user_events
        (user_id, content_id, event_type, score_pct, duration_sec, session_id)
      VALUES (?, ?, ?, ?, ?, ?)
    ");
    $stmt->execute([
      $userId,
      $contentId,
      $eventType,
      $meta['score_pct']     ?? null,
      $meta['duration_sec']  ?? null,
      $meta['session_id']    ?? null,
    ]);

    // 2. Calculate implicit weight for this event
    $weight = $this->computeWeight($eventType, $meta);

    // 3. Upsert into pre-computed scores table
    //    We take max() — never let a bad bounce overwrite a bookmark
    $upsert = $this->db->prepare("
      INSERT INTO user_item_scores (user_id, content_id, score)
      VALUES (?, ?, ?)
      ON DUPLICATE KEY UPDATE
        score = GREATEST(score, VALUES(score))
    ");
    $upsert->execute([$userId, $contentId, $weight]);
  }

  private function computeWeight(string $event, array $meta): float {
    switch ($event) {
      case 'complete':
        $score = $meta['score_pct'] ?? 50;
        if ($score >= 80) return 0.9;
        if ($score >= 40) return 0.7;
        return 0.5;  // completed but scored badly

      case 'start':
        $dur = $meta['duration_sec'] ?? 0;
        if ($dur >= 300) return 0.4;  // 5+ min
        if ($dur >= 60)  return 0.2;  // 1–5 min
        return 0.1;                      // bounce

      default:
        return self::WEIGHTS[$event] ?? 0.05;
    }
  }
}
Section 04

Content-Based Filtering — Deep Dive

Content-based filtering answers: "Given items this user has interacted with, find other items with similar properties." It works from day one because it only needs item metadata — no other users required.

The Core Algorithm: TF-IDF + Cosine Similarity

Each content item is represented as a tag vector. The similarity between two items is the cosine of the angle between their vectors. Identical items: cosine = 1.0. Completely different: cosine = 0.0.

Algorithm: Cosine Similarity
Two items A and B are represented as vectors over a shared tag space. Each dimension is the tag weight (0.0–1.0). Cosine similarity measures the orientation of the vectors — it ignores magnitude (item length) and focuses on composition.
sim(A, B)  =  A · B ‖A‖ × ‖B‖  =  Σ (Ai × Bi) √Σ Ai2  ×  √Σ Bi2

Worked Example

Imagine our tag universe has 5 dimensions. Two SSC CGL tests:

Tag Test A — SSC CGL Mock 1 Test B — SSC CGL Mock 2 Test C — IBPS PO Mock 1
ssc-cgl1.01.00.0
quantitative-aptitude0.70.80.6
reasoning0.50.50.7
banking0.00.01.0
difficulty-medium0.50.40.5

sim(A, B): dot = (1×1)+(0.7×0.8)+(0.5×0.5)+(0×0)+(0.5×0.4) = 1+0.56+0.25+0+0.2 = 2.01
‖A‖ = √(1+0.49+0.25+0+0.25) = √1.99 ≈ 1.41   ‖B‖ ≈ 1.48
sim = 2.01 / (1.41 × 1.48) ≈ 0.964 — very similar, as expected.

sim(A, C): dot = (1×0)+(0.7×0.6)+(0.5×0.7)+(0×1)+(0.5×0.5) = 0+0.42+0.35+0+0.25 = 1.02
‖C‖ ≈ 1.48 → sim ≈ 0.488 — moderate, shares QA and reasoning but different category.

Pre-Computing Item Similarities (Cron Job)

Run this nightly. For N items, this is O(N²) but N is small (hundreds of tests on aplly.in). For 500 items, this runs in under 1 second.

PHP cron/build_item_similarity.php
<?php
/**
 * Cron: Rebuild item-item cosine similarity matrix
 * Schedule: nightly at 2 AM
 * crontab: 0 2 * * * /usr/bin/php /var/www/cron/build_item_similarity.php
 */

require_once '../config/db.php';

// ── Step 1: Load all content tag vectors ─────────────────────────────
$rows = $db->query("
  SELECT ct.content_id, ct.tag, ct.weight
  FROM content_tags ct
  JOIN content c ON c.id = ct.content_id
  ORDER BY ct.content_id
")->fetchAll(PDO::FETCH_ASSOC);

// Build: $vectors[content_id][tag] = weight
$vectors = [];
foreach ($rows as $row) {
  $vectors[$row['content_id']][$row['tag']] = (float) $row['weight'];
}

$itemIds = array_keys($vectors);

// ── Step 2: Pre-compute L2 norms (denominator of cosine) ─────────────
$norms = [];
foreach ($vectors as $id => $vec) {
  $norms[$id] = sqrt(array_sum(array_map(fn($w) => $w * $w, $vec)));
}

// ── Step 3: Compute pairwise cosine similarity ───────────────────────
$similarities = [];

for ($i = 0; $i < count($itemIds); $i++) {
  for ($j = $i + 1; $j < count($itemIds); $j++) {
    $a = $itemIds[$i];
    $b = $itemIds[$j];

    // Dot product: only iterate tags that BOTH items share
    $shared   = array_intersect_key($vectors[$a], $vectors[$b]);
    $dotProd  = 0.0;
    foreach ($shared as $tag => $wA) {
      $dotProd += $wA * $vectors[$b][$tag];
    }

    // Avoid division by zero for untagged items
    if ($norms[$a] === 0.0 || $norms[$b] === 0.0) continue;

    $sim = $dotProd / ($norms[$a] * $norms[$b]);

    // Only store pairs with sim > 0.1 — prune irrelevant pairs
    if ($sim > 0.1) {
      $similarities[] = [$a, $b, $sim];
      $similarities[] = [$b, $a, $sim];  // symmetric
    }
  }
}

// ── Step 4: Bulk-write to item_similarity table ───────────────────────
$db->exec("TRUNCATE TABLE item_similarity");

$stmt = $db->prepare("
  INSERT INTO item_similarity (item_a, item_b, similarity)
  VALUES (?, ?, ?)
");

$db->beginTransaction();
foreach ($similarities as [$a, $b, $sim]) {
  $stmt->execute([$a, $b, $sim]);
}
$db->commit();

echo "Done. Wrote " . count($similarities) . " similarity pairs.\n";

Content-Based Recommendation Query

Once similarities are pre-computed, the query is fast and simple:

PHP recommenders/ContentBasedRecommender.php
<?php

class ContentBasedRecommender {

  public function __construct(private readonly PDO $db) {}

  /**
   * Get content-based recommendations for a user.
   *
   * Strategy:
   *   1. Find items the user has interacted with (their "profile")
   *   2. Look up their most similar neighbours in item_similarity
   *   3. Filter out items the user already knows
   *   4. Aggregate similarity scores weighted by interaction strength
   */
  public function recommend(int $userId, int $limit = 20): array {

    // Items user has interacted with and their implicit scores
    $interacted = $this->db->prepare("
      SELECT content_id, score
      FROM user_item_scores
      WHERE user_id = ?
      ORDER BY score DESC
      LIMIT 30
    ");
    $interacted->execute([$userId]);
    $profile = $interacted->fetchAll(PDO::FETCH_KEY_PAIR);
    // $profile = [content_id => score, ...]

    if (empty($profile)) return [];

    $seenIds   = array_keys($profile);
    $placeholders = implode(',', array_fill(0, count($seenIds), '?'));

    // For each item in the user's profile, find their similar neighbours
    $simStmt = $this->db->prepare("
      SELECT item_a, item_b, similarity
      FROM item_similarity
      WHERE item_a IN ($placeholders)
        AND item_b NOT IN ($placeholders)  -- exclude already-seen items
      ORDER BY similarity DESC
    ");
    $simStmt->execute([...$seenIds, ...$seenIds]);
    $simRows = $simStmt->fetchAll(PDO::FETCH_ASSOC);

    /*
     * Score aggregation:
     * For each candidate item B, accumulate:
     *   score(B) += profile_score(A) × similarity(A, B)
     *
     * This means: if the user loved a high-scoring item A,
     * and B is very similar to A, B gets a strong score.
     */
    $scores = [];
    foreach ($simRows as $row) {
      $sourceScore = $profile[$row['item_a']] ?? 0;
      $b            = $row['item_b'];
      $scores[$b]  = ($scores[$b] ?? 0) + $sourceScore * $row['similarity'];
    }

    // Normalise to 0–1 range
    if (!empty($scores)) {
      $max = max($scores);
      foreach ($scores as &$s) $s /= $max;
    }

    arsort($scores);
    return array_slice($scores, 0, $limit, true);
    // Returns: [content_id => score, ...]
  }
}
Section 05

Collaborative Filtering — Deep Dive

Collaborative filtering says: "Find users who behaved like this user, and recommend what they liked." It can discover items the content-based engine will never find — because it doesn't care about item properties, only who liked what.

There are two flavours. We implement both:

User-User CF

  • Question: Who else behaves like this user?
  • Scales: With item count (good for growing content)
  • Fails: When user base is small or sparse
  • Best for: Early stage with few items

Item-Item CF

  • Question: What items co-occur in user histories?
  • Scales: With user count (good for growing audience)
  • Fails: Needs many users first
  • Best for: Stable item catalogs, many users

The User-Item Matrix

The foundation of collaborative filtering is a matrix where rows are users and columns are content items. Each cell is the implicit interaction score (0 = no interaction):

 SSC CGL 1   SSC CGL 2   IBPS PO 1   RTO Test   Banking Mock
User A
0.9
0.7
0.0
0.0
0.4
User B
1.0
0.9
0.0
0.5
0.0
User C ←
0.7
0.0
0.0
0.0
0.0
User D
0.0
0.0
1.0
0.0
0.9
User C only took SSC CGL 1. CF finds Users A & B as similar → recommends SSC CGL 2, RTO Test

User-User Similarity (Cosine on Interaction Vectors)

The same cosine formula applies — but now the vectors are rows of the user-item matrix instead of item tag vectors.

PHP cron/build_user_similarity.php
<?php
/**
 * Cron: Build user-user similarity (K-Nearest Neighbours approach)
 * We only need the TOP-K similar users per user, not all pairs.
 * K = 20 is usually enough for exam platforms.
 */

require_once '../config/db.php';

// Load the full user-item interaction matrix
$rows = $db->query("
  SELECT user_id, content_id, score
  FROM user_item_scores
  WHERE score > 0.05  -- ignore pure page views
")->fetchAll(PDO::FETCH_ASSOC);

// Build interaction map: $matrix[user_id][content_id] = score
$matrix = [];
foreach ($rows as $row) {
  $matrix[$row['user_id']][$row['content_id']] = (float) $row['score'];
}

// Pre-compute L2 norms for all users
$norms = [];
foreach ($matrix as $uid => $vec) {
  $norms[$uid] = sqrt(array_sum(array_map(fn($s) => $s**2, $vec)));
}

$userIds = array_keys($matrix);
$K       = 20;  // keep top-20 neighbours per user

$db->exec("TRUNCATE TABLE user_similarity");
$stmt = $db->prepare("
  INSERT INTO user_similarity (user_a, user_b, similarity)
  VALUES (?, ?, ?)
");

foreach ($userIds as $uid) {
  $sims = [];

  foreach ($userIds as $other) {
    if ($other === $uid) continue;

    // Only compare over shared items — sparse shortcut
    $shared  = array_intersect_key($matrix[$uid], $matrix[$other]);
    if (count($shared) < 2) continue; // need overlap of ≥2 items

    $dot = 0.0;
    foreach ($shared as $cid => $score) {
      $dot += $score * $matrix[$other][$cid];
    }

    $denom = $norms[$uid] * $norms[$other];
    if ($denom == 0) continue;

    $sims[$other] = $dot / $denom;
  }

  // Keep only top-K similar users
  arsort($sims);
  $topK = array_slice($sims, 0, $K, true);

  $db->beginTransaction();
  foreach ($topK as $otherUid => $sim) {
    $stmt->execute([$uid, $otherUid, $sim]);
  }
  $db->commit();
}

Generating CF Recommendations

Given a user's K nearest neighbours, aggregate their item scores weighted by how similar each neighbour is to our user:

predicted(u, i)  =  Σv ∈ N(u)  sim(u, v) × r(v, i) Σv ∈ N(u)  |sim(u, v)|

Where N(u) = neighbour set, sim(u,v) = user similarity, r(v,i) = neighbour v's rating of item i.

PHP recommenders/CollaborativeRecommender.php
<?php

class CollaborativeRecommender {

  public function __construct(private readonly PDO $db) {}

  public function recommend(int $userId, int $limit = 20): array {

    // Items this user already knows — exclude from recommendations
    $seenStmt = $this->db->prepare("
      SELECT content_id FROM user_item_scores WHERE user_id = ?
    ");
    $seenStmt->execute([$userId]);
    $seen = $seenStmt->fetchAll(PDO::FETCH_COLUMN);

    // Fetch top-K neighbours from pre-computed similarity table
    $nbStmt = $this->db->prepare("
      SELECT user_b AS neighbour_id, similarity
      FROM user_similarity
      WHERE user_a = ?
      ORDER BY similarity DESC
      LIMIT 20
    ");
    $nbStmt->execute([$userId]);
    $neighbours = $nbStmt->fetchAll(PDO::FETCH_ASSOC);

    if (empty($neighbours)) return [];

    $neighbourIds = array_column($neighbours, 'neighbour_id');
    $simMap       = array_column($neighbours, 'similarity', 'neighbour_id');

    $seenPh  = implode(',', array_fill(0, count($seen)       ?: 1, '?'));
    $nbPh    = implode(',', array_fill(0, count($neighbourIds), '?'));

    // Get all items that neighbours have interacted with (not seen by our user)
    $itemStmt = $this->db->prepare("
      SELECT user_id, content_id, score
      FROM user_item_scores
      WHERE user_id IN ($nbPh)
        AND content_id NOT IN ($seenPh)
    ");
    $params = array_merge($neighbourIds, empty($seen) ? [-1] : $seen);
    $itemStmt->execute($params);
    $items = $itemStmt->fetchAll(PDO::FETCH_ASSOC);

    /*
     * Weighted sum formula:
     *   predicted(u, i) = Σ sim(u,v) * r(v,i)  / Σ |sim(u,v)|
     *
     * numerator[i]   = running sum of sim-weighted ratings
     * denominator[i] = running sum of similarities
     */
    $numerator   = [];
    $denominator = [];

    foreach ($items as $row) {
      $vid = $row['user_id'];
      $cid = $row['content_id'];
      $sim = $simMap[$vid'] ?? 0;

      $numerator[$cid]   = ($numerator[$cid]   ?? 0) + $sim * (float)$row['score'];
      $denominator[$cid] = ($denominator[$cid] ?? 0) + abs($sim);
    }

    $predictions = [];
    foreach ($numerator as $cid => $num) {
      if ($denominator[$cid] > 0) {
        $predictions[$cid] = $num / $denominator[$cid];
      }
    }

    arsort($predictions);
    return array_slice($predictions, 0, $limit, true);
  }
}
Section 06

The Hybrid Combiner

Neither content-based nor collaborative filtering is best on its own. The hybrid combiner merges their outputs with tunable weights and applies score normalisation so that scores from both engines are on the same 0–1 scale before combining.

Hybrid Strategy
Switched hybrid: The weight given to each engine changes based on the user's interaction count. New users lean on content-based; experienced users get more collaborative signal. This is more intelligent than a fixed 50/50 split.
PHP recommenders/HybridRecommender.php
<?php

class HybridRecommender {

  public function __construct(
    private readonly PDO                    $db,
    private readonly ContentBasedRecommender $cbEngine,
    private readonly CollaborativeRecommender $cfEngine,
    private readonly PopularityRecommender   $popEngine,
  ) {}

  public function recommend(int $userId, int $limit = 10): array {

    // Determine how many interactions this user has
    $countStmt = $this->db->prepare("
      SELECT COUNT(*) FROM user_item_scores WHERE user_id = ?
    ");
    $countStmt->execute([$userId]);
    $interactionCount = (int) $countStmt->fetchColumn();

    // ── Adaptive weights (switched hybrid) ───────────────────────────
    //
    //  interactions:  0–2        → cold start (popularity only)
    //  interactions:  3–9        → mostly content-based
    //  interactions:  10–29      → balanced CB + CF
    //  interactions:  30+        → CF leads, CB supports
    //
    //  These thresholds should be tuned from your data over time.

    if ($interactionCount < 3) {
      return $this->popEngine->recommend($userId, $limit);
    }

    $wCB  = 0.7; $wCF  = 0.3;   // 3–9 interactions
    if ($interactionCount >= 10) { $wCB = 0.5; $wCF = 0.5; }
    if ($interactionCount >= 30) { $wCB = 0.3; $wCF = 0.7; }

    // Get candidate lists from each engine (more than needed, then trim)
    $cbCandidates  = $this->cbEngine->recommend($userId, 30);
    $cfCandidates  = $this->cfEngine->recommend($userId, 30);

    // Merge: aggregate weighted scores across all candidates
    $merged = [];

    foreach ($cbCandidates as $cid => $score) {
      $merged[$cid] = ($merged[$cid] ?? 0) + $wCB * $score;
    }
    foreach ($cfCandidates as $cid => $score) {
      $merged[$cid] = ($merged[$cid] ?? 0) + $wCF * $score;
    }

    arsort($merged);

    // Enrich with content metadata before returning
    return $this->enrichWithMetadata($merged, $limit);
  }

  private function enrichWithMetadata(array $scores, int $limit): array {
    if (empty($scores)) return [];

    $topN  = array_slice($scores, 0, $limit * 3, true); // get 3× for re-ranking
    $ids   = array_keys($topN);
    $ph    = implode(',', array_fill(0, count($ids), '?'));

    $stmt = $this->db->prepare("
      SELECT id, title, content_type, category, difficulty, duration_min
      FROM content WHERE id IN ($ph)
    ");
    $stmt->execute($ids);
    $meta = $stmt->fetchAll(PDO::FETCH_UNIQUE | PDO::FETCH_ASSOC);

    $result = [];
    foreach ($ids as $id) {
      if (!isset($meta[$id])) continue;
      $result[] = array_merge($meta[$id], ['rec_score' => $topN[$id]]);
    }
    return $result;
  }
}
Section 07

Re-Ranking & Domain-Specific Rules

The hybrid combiner gives you a mathematically optimal list. But "mathematically optimal" ignores things that matter on an exam prep platform: the user should face a difficulty progression, they shouldn't see the same exam category five times in a row, and items they struggled with should resurface (spaced repetition).

Re-ranking applies these rules to the combiner's output. It runs after scoring, not during — this keeps your algorithms clean and your business logic separate.

PHP recommenders/ReRanker.php
<?php

class ReRanker {

  /**
   * Apply all business rules to re-rank a candidate list.
   *
   * @param array $candidates  [['id'=>, 'rec_score'=>, 'difficulty'=>, 'category'=>, ...], ...]
   * @param int   $userId
   * @param int   $limit
   * @return array  Top $limit items after re-ranking
   */
  public function rerank(
    array $candidates,
    int   $userId,
    int   $limit = 10
  ): array {

    $candidates = $this->applySpacedRepetitionBoost($candidates, $userId);
    $candidates = $this->applyDifficultyProgression($candidates, $userId);
    $candidates = $this->applyRecencyDecay($candidates, $userId);

    // Sort by final adjusted score
    usort($candidates, fn($a, $b) => $b['rec_score'] <=> $a['rec_score']);

    // Apply diversity: cap each category at 2 items in final list
    return $this->applyDiversityFilter($candidates, $limit);
  }

  // ─────────────────────────────────────────────────────────────────────
  // RULE 1: Spaced Repetition Boost
  //
  // Ebbinghaus forgetting curve: material that was hard for the user
  // and hasn't been revisited should resurface after 1–7 days.
  //
  // Boost = 0.3 × (1 - last_score) × time_factor
  //   - (1 - last_score): the worse they did, the bigger the boost
  //   - time_factor: peaks at 3 days, zero at 0 days, fades after 14 days
  // ─────────────────────────────────────────────────────────────────────
  private function applySpacedRepetitionBoost(array $items, int $userId): array {
    $ids = array_column($items, 'id');
    $ph  = implode(',', array_fill(0, count($ids), '?'));

    // Get most recent score for this user on each candidate item
    $stmt = $this->db->prepare("
      SELECT content_id, score_pct, created_at
      FROM user_events
      WHERE user_id = ? AND event_type = 'score'
        AND content_id IN ($ph)
      ORDER BY created_at DESC
    ");
    $stmt->execute([$userId, ...$ids]);
    $scores = [];
    foreach ($stmt->fetchAll(PDO::FETCH_ASSOC) as $row) {
      if (!isset($scores[$row['content_id']])) {
        $scores[$row['content_id']] = $row;
      }
    }

    foreach ($items as &$item) {
      if (!isset($scores[$item['id']])) continue;
      $data     = $scores[$item['id']];
      $daysSince = (time() - strtotime($data['created_at'])) / 86400;
      $lastScore = (float) $data['score_pct'] / 100;

      // Time factor: bell curve peaking at day 3, zero at 0 and 14
      $timeFactor = max(0, ($daysSince / 3) * exp(1 - $daysSince / 3));
      $boost      = 0.3 * (1 - $lastScore) * min(1, $timeFactor);

      $item['rec_score'] += $boost;
    }
    return $items;
  }

  // ─────────────────────────────────────────────────────────────────────
  // RULE 2: Difficulty Progression
  //
  // If a user consistently scores > 75% on difficulty-2 items,
  // boost difficulty-3 items and penalise difficulty-1.
  // ─────────────────────────────────────────────────────────────────────
  private function applyDifficultyProgression(array $items, int $userId): array {
    // Find user's average score per difficulty level
    $stmt = $this->db->prepare("
      SELECT c.difficulty, AVG(e.score_pct) as avg_score, COUNT(*) as cnt
      FROM user_events e
      JOIN content c ON c.id = e.content_id
      WHERE e.user_id = ? AND e.event_type = 'score'
      GROUP BY c.difficulty
      HAVING cnt >= 2
    ");
    $stmt->execute([$userId]);
    $avgByDiff = $stmt->fetchAll(PDO::FETCH_KEY_PAIR);

    // Find the user's "current" difficulty: highest diff where avg >= 70%
    $currentDiff = 1;
    foreach ($avgByDiff as $diff => $avg) {
      if ($avg >= 70 && $diff > $currentDiff) $currentDiff = $diff;
    }

    $targetDiff = min(5, $currentDiff + 1);  // challenge the user slightly

    foreach ($items as &$item) {
      $diff = (int) ($item['difficulty'] ?? 2);
      $dist = abs($diff - $targetDiff);

      // Gaussian penalty: items far from target difficulty are penalised
      //   e^(-dist²/2) = 1.0 at dist=0, 0.61 at dist=1, 0.13 at dist=2
      $diffFactor = exp(-($dist ** 2) / 2);
      $item['rec_score'] *= (0.6 + 0.4 * $diffFactor);
    }
    return $items;
  }

  // ─────────────────────────────────────────────────────────────────────
  // RULE 3: Recency Decay
  //
  // Penalise items that were shown to this user recently but not engaged.
  // Prevents the same items recycling endlessly in recommendations.
  // ─────────────────────────────────────────────────────────────────────
  private function applyRecencyDecay(array $items, int $userId): array {
    // (In production you'd log recommendation impressions too)
    // Here we use view events without follow-up action as a proxy
    $stmt = $this->db->prepare("
      SELECT content_id, MAX(created_at) as last_seen
      FROM user_events
      WHERE user_id = ? AND event_type = 'view'
      GROUP BY content_id
    ");
    $stmt->execute([$userId]);
    $viewLog = $stmt->fetchAll(PDO::FETCH_KEY_PAIR);

    foreach ($items as &$item) {
      if (!isset($viewLog[$item['id']])) continue;
      $daysSince = (time() - strtotime($viewLog[$item['id']])) / 86400;
      // Decay: full penalty if seen today, no penalty if seen 7+ days ago
      $decay = min(1, $daysSince / 7);
      $item['rec_score'] *= (0.5 + 0.5 * $decay);
    }
    return $items;
  }

  // ─────────────────────────────────────────────────────────────────────
  // RULE 4: Diversity Filter
  // Maximum 2 items from same category in top-N results.
  // Prevents "all SSC CGL, all the time" problem.
  // ─────────────────────────────────────────────────────────────────────
  private function applyDiversityFilter(array $items, int $limit): array {
    $result     = [];
    $catCounts  = [];
    const MAX_PER_CATEGORY = 2;

    foreach ($items as $item) {
      if (count($result) >= $limit) break;

      $cat = $item['category'] ?? 'other';
      if (($catCounts[$cat] ?? 0) >= self::MAX_PER_CATEGORY) continue;

      $result[]         = $item;
      $catCounts[$cat]  = ($catCounts[$cat] ?? 0) + 1;
    }

    // If diversity filter left us short of $limit, fill from the leftovers
    if (count($result) < $limit) {
      foreach ($items as $item) {
        if (in_array($item, $result)) continue;
        $result[] = $item;
        if (count($result) >= $limit) break;
      }
    }

    return $result;
  }
}
Section 08

Cold Start Problem

The cold start problem occurs in two forms: new user cold start (no interaction history) and new item cold start (no one has interacted with a new test yet). Each needs a different solution.

New User Cold Start: Onboarding Quiz

The fastest way to solve cold start is to ask the user directly during registration. Show a 3-question onboarding quiz and instantly populate their interaction profile:

PHP onboarding/OnboardingSeeder.php
<?php

class OnboardingSeeder {

  /**
   * After onboarding quiz: seed the user's implicit profile
   * from their declared preferences so day-1 recs are meaningful.
   *
   * Quiz question 1: "Which exam are you preparing for?"
   *   → seeds category preference at weight 1.0
   *
   * Quiz question 2: "What is your current level?"
   *   → seeds difficulty preference
   *
   * Quiz question 3: "How much time per day can you practice?"
   *   → seeds duration preference (affects which tests are suitable)
   */
  public function seed(
    int    $userId,
    string $targetExam,   // e.g. 'ssc', 'banking', 'railways'
    int    $selfLevel,     // 1 = beginner, 3 = intermediate, 5 = advanced
    int    $minutesPerDay  // 15, 30, 60
  ): void {

    // Find 2-3 starter items matching their declared exam + difficulty
    $stmt = $this->db->prepare("
      SELECT id FROM content
      WHERE category = ?
        AND difficulty BETWEEN ? AND ?
        AND duration_min <= ?
      ORDER BY popularity DESC
      LIMIT 3
    ");
    $stmt->execute([
      $targetExam,
      max(1, $selfLevel - 1),   // ± 1 difficulty band
      min(5, $selfLevel + 1),
      $minutesPerDay
    ]);
    $starterItems = $stmt->fetchAll(PDO::FETCH_COLUMN);

    // Inject synthetic "view" events at medium weight
    // — enough to bootstrap the CB engine, but labelled 'onboarding'
    //   so it can be excluded from score analysis later
    $insert = $this->db->prepare("
      INSERT INTO user_item_scores (user_id, content_id, score)
      VALUES (?, ?, 0.3)
      ON DUPLICATE KEY UPDATE score = GREATEST(score, 0.3)
    ");
    foreach ($starterItems as $itemId) {
      $insert->execute([$userId, $itemId]);
    }
  }
}

New Item Cold Start: Warm-Up Strategy

Strategy for new content
When a new mock test is added to aplly.in, it has zero interactions. Inject it into the popularity pool for the first 7 days regardless of its actual views — this forces some users to encounter it. Once it accumulates ~10 interactions, the regular CB and CF engines take over naturally.
SQL Popularity query with new-item injection
-- Popularity recommender that promotes new items
SELECT
  id,
  title,
  category,
  difficulty,
  -- real popularity score + artificial boost for items < 7 days old
  popularity + CASE
    WHEN DATEDIFF(NOW(), created_at) < 7
    THEN 50   -- inject as if it had 50 extra views
    ELSE 0
  END AS effective_popularity
FROM content
WHERE category = :category
ORDER BY effective_popularity DESC
LIMIT 10;
Section 09

The PHP API Endpoint

The API endpoint wires all the pieces together: it receives a request, calls the hybrid recommender + re-ranker, and returns a clean JSON response.

PHP api/recommend.php
<?php
/**
 * GET /api/recommend.php?limit=10
 *
 * Requires: Authorization: Bearer {token}  header
 * Returns: JSON recommendation payload
 *
 * Response format:
 * {
 *   "status": "ok",
 *   "user_id": 42,
 *   "engine": "hybrid-cb70-cf30",
 *   "from_cache": false,
 *   "generated_at": "2026-03-22T10:00:00Z",
 *   "recommendations": [
 *     {
 *       "id": 7,
 *       "title": "SSC CGL Mock Test 3",
 *       "content_type": "mock_test",
 *       "category": "ssc",
 *       "difficulty": 2,
 *       "duration_min": 30,
 *       "rec_score": 0.847,
 *       "reason": "Similar to tests you scored well on"
 *     }, ...
 *   ]
 * }
 */

require_once '../config/db.php';
require_once '../auth/Auth.php';
require_once '../recommenders/ContentBasedRecommender.php';
require_once '../recommenders/CollaborativeRecommender.php';
require_once '../recommenders/PopularityRecommender.php';
require_once '../recommenders/HybridRecommender.php';
require_once '../recommenders/ReRanker.php';
require_once '../cache/FileCache.php';

header('Content-Type: application/json');
header('Access-Control-Allow-Origin: *');

// ── Auth: validate Bearer token ───────────────────────────────────────
$auth   = new Auth($db);
$userId = $auth->getUserFromToken($_SERVER['HTTP_AUTHORIZATION'] ?? '');

if (!$userId) {
  http_response_code(401);
  echo json_encode(['error' => 'Unauthorized']);
  exit;
}

$limit = max(1, min(20, (int) ($_GET['limit'] ?? 10)));

// ── Cache check: serve from 30-min cache if available ─────────────────
$cache    = new FileCache('/tmp/rec_cache', 1800); // 30 min TTL
$cacheKey = "rec_{$userId}_{$limit}";

if ($cached = $cache->get($cacheKey)) {
  $cached['from_cache'] = true;
  echo json_encode($cached);
  exit;
}

// ── Run the recommendation pipeline ───────────────────────────────────
$cb      = new ContentBasedRecommender($db);
$cf      = new CollaborativeRecommender($db);
$pop     = new PopularityRecommender($db);
$hybrid  = new HybridRecommender($db, $cb, $cf, $pop);
$reranker= new ReRanker($db);

$candidates = $hybrid->recommend($userId, $limit);
$final      = $reranker->rerank($candidates, $userId, $limit);

// Attach human-readable reason strings (transparency layer)
foreach ($final as &$item) {
  $item['reason'] = $item['rec_score'] > 0.75
    ? 'Highly matched to your exam goal'
    : 'Based on users with similar practice patterns';
}

$response = [
  'status'          => 'ok',
  'user_id'         => $userId,
  'engine'          => 'hybrid',
  'from_cache'      => false,
  'generated_at'    => gmdate('Y-m-d\TH:i:s\Z'),
  'recommendations' => $final,
];

$cache->set($cacheKey, $response);
echo json_encode($response, JSON_PRETTY_PRINT);
Section 10

Caching & Performance

Every recommendation query touches multiple tables. Without caching, a busy exam platform would hammer the DB on every page load. Use a two-layer caching strategy:

LayerMechanismTTLWhat it caches
Layer 1PHP file cache (/tmp)30 minFinal recommendation JSON per user
Layer 2MySQL pre-computed tablesDaily cronitem_similarity, user_similarity, user_item_scores
Busted byNew user event loggedImmediateInvalidate user's cache key on any new interaction
PHP cache/FileCache.php — simple zero-dependency file cache
<?php

class FileCache {

  public function __construct(
    private readonly string $dir,
    private readonly int    $ttl = 1800
  ) {
    if (!is_dir($this->dir)) mkdir($this->dir, 0755, true);
  }

  public function get(string $key): ?array {
    $path = $this->path($key);
    if (!file_exists($path)) return null;
    if (filemtime($path) < time() - $this->ttl) {
      unlink($path); return null;
    }
    return json_decode(file_get_contents($path), true);
  }

  public function set(string $key, array $data): void {
    file_put_contents($this->path($key), json_encode($data));
  }

  public function delete(string $key): void {
    $path = $this->path($key);
    if (file_exists($path)) unlink($path);
  }

  private function path(string $key): string {
    return $this->dir . '/' . md5($key) . '.json';
  }
}
Section 11

Evaluation Metrics — Is Your Engine Actually Good?

You cannot improve what you don't measure. These three metrics are the standard toolkit for evaluating a recommendation system offline (no live A/B test needed to start).

Precision@K

Of the K items you recommended, what fraction did the user actually engage with?

Precision@K  =  |RecommendedK ∩ Relevant| K

A Precision@10 of 0.4 means 4 out of every 10 recommended items led to engagement. For an exam-prep platform, aim for P@10 > 0.30 as a starting benchmark.

NDCG@K (Normalised Discounted Cumulative Gain)

Unlike Precision, NDCG cares about where in the list the good items appear. A relevant item at position 1 is worth more than one at position 10.

DCG@K  =  Σi=1..K reli log2(i + 1)     NDCG@K  =  DCG@K IDCG@K

IDCG is the DCG of a perfect ranking. NDCG = 1.0 means your ranking is perfect; 0.0 is terrible. Target NDCG@10 > 0.50 as a starting goal.

Coverage

What percentage of the content catalog is ever recommended to at least one user? Low coverage (<30%) means you're effectively ignoring most of your content.

PHP eval/MetricsCalculator.php — offline evaluation using holdout set
<?php
/**
 * Offline evaluation:
 * For each user, hold out their last 5 interactions as ground truth.
 * Pretend they don't exist, run the recommender, measure precision.
 */

class MetricsCalculator {

  public function precisionAtK(array $recommended, array $relevant, int $k): float {
    $topK   = array_slice($recommended, 0, $k);
    $hits   = count(array_intersect($topK, $relevant));
    return $k > 0 ? $hits / $k : 0.0;
  }

  public function ndcgAtK(array $recommended, array $relevant, int $k): float {
    $dcg  = 0.0;
    $idcg = 0.0;

    for ($i = 0; $i < min($k, count($recommended)); $i++) {
      $rel  = in_array($recommended[$i], $relevant) ? 1 : 0;
      $dcg += $rel / log($i + 2, 2);
    }

    // IDCG: perfect ordering (all relevant items first)
    for ($i = 0; $i < min($k, count($relevant)); $i++) {
      $idcg += 1.0 / log($i + 2, 2);
    }

    return $idcg > 0 ? $dcg / $idcg : 0.0;
  }

  public function evaluateEngine(HybridRecommender $engine, array $testUsers): array {
    $p10s   = [];
    $ndcg10s= [];

    foreach ($testUsers as ['user_id' => $uid, 'held_out' => $heldOut]) {
      $recs     = $engine->recommend($uid, 10);
      $recIds   = array_column($recs, 'id');
      $p10s[]   = $this->precisionAtK($recIds, $heldOut, 10);
      $ndcg10s[]= $this->ndcgAtK($recIds, $heldOut, 10);
    }

    return [
      'precision@10' => round(array_sum($p10s) / count($p10s), 4),
      'ndcg@10'      => round(array_sum($ndcg10s) / count($ndcg10s), 4),
    ];
  }
}
Section 12

Putting It All Together — Deployment Checklist

  1. Deploy the schema — Run schema.sql. Tag all existing content items in content_tags. Add difficulty and category metadata to all mock tests.
  2. Instrument event logging — Call EventLogger::log() from every test start, test completion, score submission, and bookmark action.
  3. Run the onboarding seeder — Show new registrants a 3-question quiz. Call OnboardingSeeder::seed() immediately after registration.
  4. Schedule cron jobs — Add to crontab: item similarity (nightly 2 AM), user similarity (nightly 3 AM), popularity counts (every hour). These three jobs are the offline backbone of the engine.
  5. Deploy the API endpointapi/recommend.php behind your auth middleware. Test with curl -H "Authorization: Bearer TOKEN" /api/recommend.php?limit=10.
  6. Build the frontend widget — A "Recommended for you" section on the homepage and category pages. Hit the API, render cards. Lazy-load it so it doesn't block page render.
  7. Instrument the cache invalidation — In EventLogger::log(), after writing the event, call $cache->delete("rec_{$userId}_*") to bust the user's recommendation cache.
  8. Run offline evaluation after 2 weeks — Use MetricsCalculator::evaluateEngine() with a holdout set. Record your Precision@10 and NDCG@10. This is your baseline.
  9. Tune the weights — Adjust $wCB/$wCF ratios, spaced repetition boost (0.3), and diversity cap (2 per category) based on real metrics. Increment slowly — measure every change.
  10. Upgrade to Redis when readyFileCache works fine up to ~10,000 users. Beyond that, swap it for a RedisCache implementation behind the same interface.
Tuning cadence
Do not tune algorithm weights more than once per week. You need at least 3–5 days of interaction data for any weight change to show statistical signal. Tune too fast and you'll be chasing noise, not signal.

What to Build Next

FeatureValueWhen to build
A/B testing framework Measure impact of algorithm changes on real CTR Month 2
Real-time score updates Update user_item_scores instantly on test submission (vs cron lag) Month 1
Sequence-aware recs "You just finished SSC CGL Mock 1 — try Mock 2 next" (session context) Month 3
Matrix factorisation (SVD) Better CF at scale — replace user-user with latent factor model Month 4+
Redis cache Sub-millisecond cache for high-traffic periods (exam seasons) Month 3+
Notification trigger "You haven't practiced in 3 days — here's what to review" Month 5+
End of tutorial
Practice. Apply. Master.
Plain PHP · MySQL · No frameworks
Sections 01–12 · ~45 min read