A recommendation engine is a system that predicts which items a user is most likely to find useful, given everything you know about the user, the items, and how similar users have behaved. On an exam-prep platform like aplly.in, this means surfacing the right mock test, case study, or topic at the right time — not just the most popular one.
There are three core paradigms, and every real-world system is a blend of them:
| Paradigm | Core question | Data needed | Works when | Fails when |
|---|---|---|---|---|
| Content-Based | "What other items are similar to things this user liked?" | Item metadata (tags, category, difficulty) | User has some history | Items are under-tagged or all similar |
| Collaborative Filtering | "What did similar users like?" | User-item interaction matrix | Many users, dense interactions | New users/items (cold start) |
| Popularity-Based | "What do most people find useful right now?" | Aggregate interaction counts | No user history at all | User has specific niche needs |
Before writing a single line of code, understand the data flow:
Everything starts with the schema. The recommendation engine is only as good as the signals it can query. Design these tables from day one — retrofitting is painful.
-- ─────────────────────────────────────────────
-- CONTENT: every item that can be recommended
-- ─────────────────────────────────────────────
CREATE TABLE content (
id INT AUTO_INCREMENT PRIMARY KEY,
title VARCHAR(255) NOT NULL,
content_type ENUM('mock_test', 'case_study') NOT NULL,
category VARCHAR(80), -- 'ssc', 'banking', 'rto' …
difficulty TINYINT, -- 1 = easy, 5 = very hard
question_count SMALLINT,
duration_min SMALLINT,
popularity INT DEFAULT 0, -- denormalised, updated by cron
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
INDEX idx_category (category),
INDEX idx_difficulty (difficulty)
);
-- ─────────────────────────────────────────────
-- CONTENT TAGS: many-to-many tag system
-- Each content item carries multiple semantic tags
-- ─────────────────────────────────────────────
CREATE TABLE content_tags (
content_id INT NOT NULL,
tag VARCHAR(80) NOT NULL,
weight FLOAT DEFAULT 1.0, -- primary topic = 1.0, secondary = 0.5
PRIMARY KEY (content_id, tag),
INDEX idx_tag (tag)
);
-- Example rows for aplly.in:
-- (12, 'quantitative-aptitude', 1.0)
-- (12, 'ssc-cgl', 1.0)
-- (12, 'number-system', 0.7)
-- (12, 'difficulty-medium', 0.5)
-- ─────────────────────────────────────────────
-- USER EVENTS: the raw interaction log
-- This is the single most important table
-- ─────────────────────────────────────────────
CREATE TABLE user_events (
id BIGINT AUTO_INCREMENT PRIMARY KEY,
user_id INT NOT NULL,
content_id INT NOT NULL,
event_type ENUM('view','start','complete','bookmark','score') NOT NULL,
score_pct FLOAT NULL, -- 0–100, only for 'score' events
duration_sec INT NULL, -- seconds spent
session_id VARCHAR(64), -- group events in same session
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
INDEX idx_user (user_id),
INDEX idx_content (content_id),
INDEX idx_user_con (user_id, content_id),
INDEX idx_created (created_at)
);
-- ─────────────────────────────────────────────
-- USER_ITEM_MATRIX: pre-computed interaction scores
-- Rebuilt every night by a cron job
-- ─────────────────────────────────────────────
CREATE TABLE user_item_scores (
user_id INT NOT NULL,
content_id INT NOT NULL,
score FLOAT NOT NULL, -- 0.0–1.0 normalised
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (user_id, content_id),
INDEX idx_score (user_id, score DESC)
);
-- ─────────────────────────────────────────────
-- ITEM_SIMILARITY: pre-computed CB similarity pairs
-- Also rebuilt nightly — avoids real-time cosine math
-- ─────────────────────────────────────────────
CREATE TABLE item_similarity (
item_a INT NOT NULL,
item_b INT NOT NULL,
similarity FLOAT NOT NULL,
PRIMARY KEY (item_a, item_b),
INDEX idx_sim (item_a, similarity DESC)
);
user_item_scores and item_similarity tables are materialised views built by offline cron jobs. Never compute cosine similarity on every API request — that doesn't scale past a few hundred content items. Pre-compute, store, serve.
A recommendation engine is only as good as its training signal. Before any algorithm runs, you need to reliably capture what users do. On aplly.in, every meaningful interaction needs to be logged as a weighted signal.
Not all interactions are equal. Define implicit weights upfront:
| Event | Weight | Reasoning |
|---|---|---|
bookmark | 1.0 | Explicit intent signal — strongest |
complete + score > 80% | 0.9 | Finished and did well — deep engagement |
complete + score 40–80% | 0.7 | Struggled but persisted — high relevance |
complete + score < 40% | 0.5 | May be too hard — weak positive signal |
start (spent >5 min) | 0.4 | Engaged but didn't finish |
start (spent <2 min) | 0.1 | Likely a bounce — weak signal |
view (page load only) | 0.05 | Curiosity, not intent |
<?php
class EventLogger {
/** Implicit weight table — tune these from A/B tests over time */
const WEIGHTS = [
'bookmark' => 1.0,
'complete' => 0.7, // base; overridden by score below
'start' => 0.4, // overridden by duration below
'view' => 0.05,
];
private $db;
public function __construct(PDO $db) {
$this->db = $db;
}
/**
* Log a single user event and immediately upsert
* the pre-computed user_item_scores row.
*
* @param int $userId
* @param int $contentId
* @param string $eventType view | start | complete | bookmark | score
* @param array $meta ['score_pct' => 74, 'duration_sec' => 320, ...]
*/
public function log(
int $userId,
int $contentId,
string $eventType,
array $meta = []
): void {
// 1. Insert raw event (never update — append-only log)
$stmt = $this->db->prepare("
INSERT INTO user_events
(user_id, content_id, event_type, score_pct, duration_sec, session_id)
VALUES (?, ?, ?, ?, ?, ?)
");
$stmt->execute([
$userId,
$contentId,
$eventType,
$meta['score_pct'] ?? null,
$meta['duration_sec'] ?? null,
$meta['session_id'] ?? null,
]);
// 2. Calculate implicit weight for this event
$weight = $this->computeWeight($eventType, $meta);
// 3. Upsert into pre-computed scores table
// We take max() — never let a bad bounce overwrite a bookmark
$upsert = $this->db->prepare("
INSERT INTO user_item_scores (user_id, content_id, score)
VALUES (?, ?, ?)
ON DUPLICATE KEY UPDATE
score = GREATEST(score, VALUES(score))
");
$upsert->execute([$userId, $contentId, $weight]);
}
private function computeWeight(string $event, array $meta): float {
switch ($event) {
case 'complete':
$score = $meta['score_pct'] ?? 50;
if ($score >= 80) return 0.9;
if ($score >= 40) return 0.7;
return 0.5; // completed but scored badly
case 'start':
$dur = $meta['duration_sec'] ?? 0;
if ($dur >= 300) return 0.4; // 5+ min
if ($dur >= 60) return 0.2; // 1–5 min
return 0.1; // bounce
default:
return self::WEIGHTS[$event] ?? 0.05;
}
}
}
Content-based filtering answers: "Given items this user has interacted with, find other items with similar properties." It works from day one because it only needs item metadata — no other users required.
Each content item is represented as a tag vector. The similarity between two items is the cosine of the angle between their vectors. Identical items: cosine = 1.0. Completely different: cosine = 0.0.
Imagine our tag universe has 5 dimensions. Two SSC CGL tests:
| Tag | Test A — SSC CGL Mock 1 | Test B — SSC CGL Mock 2 | Test C — IBPS PO Mock 1 |
|---|---|---|---|
| ssc-cgl | 1.0 | 1.0 | 0.0 |
| quantitative-aptitude | 0.7 | 0.8 | 0.6 |
| reasoning | 0.5 | 0.5 | 0.7 |
| banking | 0.0 | 0.0 | 1.0 |
| difficulty-medium | 0.5 | 0.4 | 0.5 |
sim(A, B): dot = (1×1)+(0.7×0.8)+(0.5×0.5)+(0×0)+(0.5×0.4) = 1+0.56+0.25+0+0.2 = 2.01
‖A‖ = √(1+0.49+0.25+0+0.25) = √1.99 ≈ 1.41 ‖B‖ ≈ 1.48
sim = 2.01 / (1.41 × 1.48) ≈ 0.964 — very similar, as expected.
sim(A, C): dot = (1×0)+(0.7×0.6)+(0.5×0.7)+(0×1)+(0.5×0.5) = 0+0.42+0.35+0+0.25 = 1.02
‖C‖ ≈ 1.48 → sim ≈ 0.488 — moderate, shares QA and reasoning but different category.
Run this nightly. For N items, this is O(N²) but N is small (hundreds of tests on aplly.in). For 500 items, this runs in under 1 second.
<?php
/**
* Cron: Rebuild item-item cosine similarity matrix
* Schedule: nightly at 2 AM
* crontab: 0 2 * * * /usr/bin/php /var/www/cron/build_item_similarity.php
*/
require_once '../config/db.php';
// ── Step 1: Load all content tag vectors ─────────────────────────────
$rows = $db->query("
SELECT ct.content_id, ct.tag, ct.weight
FROM content_tags ct
JOIN content c ON c.id = ct.content_id
ORDER BY ct.content_id
")->fetchAll(PDO::FETCH_ASSOC);
// Build: $vectors[content_id][tag] = weight
$vectors = [];
foreach ($rows as $row) {
$vectors[$row['content_id']][$row['tag']] = (float) $row['weight'];
}
$itemIds = array_keys($vectors);
// ── Step 2: Pre-compute L2 norms (denominator of cosine) ─────────────
$norms = [];
foreach ($vectors as $id => $vec) {
$norms[$id] = sqrt(array_sum(array_map(fn($w) => $w * $w, $vec)));
}
// ── Step 3: Compute pairwise cosine similarity ───────────────────────
$similarities = [];
for ($i = 0; $i < count($itemIds); $i++) {
for ($j = $i + 1; $j < count($itemIds); $j++) {
$a = $itemIds[$i];
$b = $itemIds[$j];
// Dot product: only iterate tags that BOTH items share
$shared = array_intersect_key($vectors[$a], $vectors[$b]);
$dotProd = 0.0;
foreach ($shared as $tag => $wA) {
$dotProd += $wA * $vectors[$b][$tag];
}
// Avoid division by zero for untagged items
if ($norms[$a] === 0.0 || $norms[$b] === 0.0) continue;
$sim = $dotProd / ($norms[$a] * $norms[$b]);
// Only store pairs with sim > 0.1 — prune irrelevant pairs
if ($sim > 0.1) {
$similarities[] = [$a, $b, $sim];
$similarities[] = [$b, $a, $sim]; // symmetric
}
}
}
// ── Step 4: Bulk-write to item_similarity table ───────────────────────
$db->exec("TRUNCATE TABLE item_similarity");
$stmt = $db->prepare("
INSERT INTO item_similarity (item_a, item_b, similarity)
VALUES (?, ?, ?)
");
$db->beginTransaction();
foreach ($similarities as [$a, $b, $sim]) {
$stmt->execute([$a, $b, $sim]);
}
$db->commit();
echo "Done. Wrote " . count($similarities) . " similarity pairs.\n";
Once similarities are pre-computed, the query is fast and simple:
<?php
class ContentBasedRecommender {
public function __construct(private readonly PDO $db) {}
/**
* Get content-based recommendations for a user.
*
* Strategy:
* 1. Find items the user has interacted with (their "profile")
* 2. Look up their most similar neighbours in item_similarity
* 3. Filter out items the user already knows
* 4. Aggregate similarity scores weighted by interaction strength
*/
public function recommend(int $userId, int $limit = 20): array {
// Items user has interacted with and their implicit scores
$interacted = $this->db->prepare("
SELECT content_id, score
FROM user_item_scores
WHERE user_id = ?
ORDER BY score DESC
LIMIT 30
");
$interacted->execute([$userId]);
$profile = $interacted->fetchAll(PDO::FETCH_KEY_PAIR);
// $profile = [content_id => score, ...]
if (empty($profile)) return [];
$seenIds = array_keys($profile);
$placeholders = implode(',', array_fill(0, count($seenIds), '?'));
// For each item in the user's profile, find their similar neighbours
$simStmt = $this->db->prepare("
SELECT item_a, item_b, similarity
FROM item_similarity
WHERE item_a IN ($placeholders)
AND item_b NOT IN ($placeholders) -- exclude already-seen items
ORDER BY similarity DESC
");
$simStmt->execute([...$seenIds, ...$seenIds]);
$simRows = $simStmt->fetchAll(PDO::FETCH_ASSOC);
/*
* Score aggregation:
* For each candidate item B, accumulate:
* score(B) += profile_score(A) × similarity(A, B)
*
* This means: if the user loved a high-scoring item A,
* and B is very similar to A, B gets a strong score.
*/
$scores = [];
foreach ($simRows as $row) {
$sourceScore = $profile[$row['item_a']] ?? 0;
$b = $row['item_b'];
$scores[$b] = ($scores[$b] ?? 0) + $sourceScore * $row['similarity'];
}
// Normalise to 0–1 range
if (!empty($scores)) {
$max = max($scores);
foreach ($scores as &$s) $s /= $max;
}
arsort($scores);
return array_slice($scores, 0, $limit, true);
// Returns: [content_id => score, ...]
}
}
Collaborative filtering says: "Find users who behaved like this user, and recommend what they liked." It can discover items the content-based engine will never find — because it doesn't care about item properties, only who liked what.
There are two flavours. We implement both:
The foundation of collaborative filtering is a matrix where rows are users and columns are content items. Each cell is the implicit interaction score (0 = no interaction):
The same cosine formula applies — but now the vectors are rows of the user-item matrix instead of item tag vectors.
<?php
/**
* Cron: Build user-user similarity (K-Nearest Neighbours approach)
* We only need the TOP-K similar users per user, not all pairs.
* K = 20 is usually enough for exam platforms.
*/
require_once '../config/db.php';
// Load the full user-item interaction matrix
$rows = $db->query("
SELECT user_id, content_id, score
FROM user_item_scores
WHERE score > 0.05 -- ignore pure page views
")->fetchAll(PDO::FETCH_ASSOC);
// Build interaction map: $matrix[user_id][content_id] = score
$matrix = [];
foreach ($rows as $row) {
$matrix[$row['user_id']][$row['content_id']] = (float) $row['score'];
}
// Pre-compute L2 norms for all users
$norms = [];
foreach ($matrix as $uid => $vec) {
$norms[$uid] = sqrt(array_sum(array_map(fn($s) => $s**2, $vec)));
}
$userIds = array_keys($matrix);
$K = 20; // keep top-20 neighbours per user
$db->exec("TRUNCATE TABLE user_similarity");
$stmt = $db->prepare("
INSERT INTO user_similarity (user_a, user_b, similarity)
VALUES (?, ?, ?)
");
foreach ($userIds as $uid) {
$sims = [];
foreach ($userIds as $other) {
if ($other === $uid) continue;
// Only compare over shared items — sparse shortcut
$shared = array_intersect_key($matrix[$uid], $matrix[$other]);
if (count($shared) < 2) continue; // need overlap of ≥2 items
$dot = 0.0;
foreach ($shared as $cid => $score) {
$dot += $score * $matrix[$other][$cid];
}
$denom = $norms[$uid] * $norms[$other];
if ($denom == 0) continue;
$sims[$other] = $dot / $denom;
}
// Keep only top-K similar users
arsort($sims);
$topK = array_slice($sims, 0, $K, true);
$db->beginTransaction();
foreach ($topK as $otherUid => $sim) {
$stmt->execute([$uid, $otherUid, $sim]);
}
$db->commit();
}
Given a user's K nearest neighbours, aggregate their item scores weighted by how similar each neighbour is to our user:
Where N(u) = neighbour set, sim(u,v) = user similarity, r(v,i) = neighbour v's rating of item i.
<?php
class CollaborativeRecommender {
public function __construct(private readonly PDO $db) {}
public function recommend(int $userId, int $limit = 20): array {
// Items this user already knows — exclude from recommendations
$seenStmt = $this->db->prepare("
SELECT content_id FROM user_item_scores WHERE user_id = ?
");
$seenStmt->execute([$userId]);
$seen = $seenStmt->fetchAll(PDO::FETCH_COLUMN);
// Fetch top-K neighbours from pre-computed similarity table
$nbStmt = $this->db->prepare("
SELECT user_b AS neighbour_id, similarity
FROM user_similarity
WHERE user_a = ?
ORDER BY similarity DESC
LIMIT 20
");
$nbStmt->execute([$userId]);
$neighbours = $nbStmt->fetchAll(PDO::FETCH_ASSOC);
if (empty($neighbours)) return [];
$neighbourIds = array_column($neighbours, 'neighbour_id');
$simMap = array_column($neighbours, 'similarity', 'neighbour_id');
$seenPh = implode(',', array_fill(0, count($seen) ?: 1, '?'));
$nbPh = implode(',', array_fill(0, count($neighbourIds), '?'));
// Get all items that neighbours have interacted with (not seen by our user)
$itemStmt = $this->db->prepare("
SELECT user_id, content_id, score
FROM user_item_scores
WHERE user_id IN ($nbPh)
AND content_id NOT IN ($seenPh)
");
$params = array_merge($neighbourIds, empty($seen) ? [-1] : $seen);
$itemStmt->execute($params);
$items = $itemStmt->fetchAll(PDO::FETCH_ASSOC);
/*
* Weighted sum formula:
* predicted(u, i) = Σ sim(u,v) * r(v,i) / Σ |sim(u,v)|
*
* numerator[i] = running sum of sim-weighted ratings
* denominator[i] = running sum of similarities
*/
$numerator = [];
$denominator = [];
foreach ($items as $row) {
$vid = $row['user_id'];
$cid = $row['content_id'];
$sim = $simMap[$vid'] ?? 0;
$numerator[$cid] = ($numerator[$cid] ?? 0) + $sim * (float)$row['score'];
$denominator[$cid] = ($denominator[$cid] ?? 0) + abs($sim);
}
$predictions = [];
foreach ($numerator as $cid => $num) {
if ($denominator[$cid] > 0) {
$predictions[$cid] = $num / $denominator[$cid];
}
}
arsort($predictions);
return array_slice($predictions, 0, $limit, true);
}
}
Neither content-based nor collaborative filtering is best on its own. The hybrid combiner merges their outputs with tunable weights and applies score normalisation so that scores from both engines are on the same 0–1 scale before combining.
<?php
class HybridRecommender {
public function __construct(
private readonly PDO $db,
private readonly ContentBasedRecommender $cbEngine,
private readonly CollaborativeRecommender $cfEngine,
private readonly PopularityRecommender $popEngine,
) {}
public function recommend(int $userId, int $limit = 10): array {
// Determine how many interactions this user has
$countStmt = $this->db->prepare("
SELECT COUNT(*) FROM user_item_scores WHERE user_id = ?
");
$countStmt->execute([$userId]);
$interactionCount = (int) $countStmt->fetchColumn();
// ── Adaptive weights (switched hybrid) ───────────────────────────
//
// interactions: 0–2 → cold start (popularity only)
// interactions: 3–9 → mostly content-based
// interactions: 10–29 → balanced CB + CF
// interactions: 30+ → CF leads, CB supports
//
// These thresholds should be tuned from your data over time.
if ($interactionCount < 3) {
return $this->popEngine->recommend($userId, $limit);
}
$wCB = 0.7; $wCF = 0.3; // 3–9 interactions
if ($interactionCount >= 10) { $wCB = 0.5; $wCF = 0.5; }
if ($interactionCount >= 30) { $wCB = 0.3; $wCF = 0.7; }
// Get candidate lists from each engine (more than needed, then trim)
$cbCandidates = $this->cbEngine->recommend($userId, 30);
$cfCandidates = $this->cfEngine->recommend($userId, 30);
// Merge: aggregate weighted scores across all candidates
$merged = [];
foreach ($cbCandidates as $cid => $score) {
$merged[$cid] = ($merged[$cid] ?? 0) + $wCB * $score;
}
foreach ($cfCandidates as $cid => $score) {
$merged[$cid] = ($merged[$cid] ?? 0) + $wCF * $score;
}
arsort($merged);
// Enrich with content metadata before returning
return $this->enrichWithMetadata($merged, $limit);
}
private function enrichWithMetadata(array $scores, int $limit): array {
if (empty($scores)) return [];
$topN = array_slice($scores, 0, $limit * 3, true); // get 3× for re-ranking
$ids = array_keys($topN);
$ph = implode(',', array_fill(0, count($ids), '?'));
$stmt = $this->db->prepare("
SELECT id, title, content_type, category, difficulty, duration_min
FROM content WHERE id IN ($ph)
");
$stmt->execute($ids);
$meta = $stmt->fetchAll(PDO::FETCH_UNIQUE | PDO::FETCH_ASSOC);
$result = [];
foreach ($ids as $id) {
if (!isset($meta[$id])) continue;
$result[] = array_merge($meta[$id], ['rec_score' => $topN[$id]]);
}
return $result;
}
}
The hybrid combiner gives you a mathematically optimal list. But "mathematically optimal" ignores things that matter on an exam prep platform: the user should face a difficulty progression, they shouldn't see the same exam category five times in a row, and items they struggled with should resurface (spaced repetition).
Re-ranking applies these rules to the combiner's output. It runs after scoring, not during — this keeps your algorithms clean and your business logic separate.
<?php
class ReRanker {
/**
* Apply all business rules to re-rank a candidate list.
*
* @param array $candidates [['id'=>, 'rec_score'=>, 'difficulty'=>, 'category'=>, ...], ...]
* @param int $userId
* @param int $limit
* @return array Top $limit items after re-ranking
*/
public function rerank(
array $candidates,
int $userId,
int $limit = 10
): array {
$candidates = $this->applySpacedRepetitionBoost($candidates, $userId);
$candidates = $this->applyDifficultyProgression($candidates, $userId);
$candidates = $this->applyRecencyDecay($candidates, $userId);
// Sort by final adjusted score
usort($candidates, fn($a, $b) => $b['rec_score'] <=> $a['rec_score']);
// Apply diversity: cap each category at 2 items in final list
return $this->applyDiversityFilter($candidates, $limit);
}
// ─────────────────────────────────────────────────────────────────────
// RULE 1: Spaced Repetition Boost
//
// Ebbinghaus forgetting curve: material that was hard for the user
// and hasn't been revisited should resurface after 1–7 days.
//
// Boost = 0.3 × (1 - last_score) × time_factor
// - (1 - last_score): the worse they did, the bigger the boost
// - time_factor: peaks at 3 days, zero at 0 days, fades after 14 days
// ─────────────────────────────────────────────────────────────────────
private function applySpacedRepetitionBoost(array $items, int $userId): array {
$ids = array_column($items, 'id');
$ph = implode(',', array_fill(0, count($ids), '?'));
// Get most recent score for this user on each candidate item
$stmt = $this->db->prepare("
SELECT content_id, score_pct, created_at
FROM user_events
WHERE user_id = ? AND event_type = 'score'
AND content_id IN ($ph)
ORDER BY created_at DESC
");
$stmt->execute([$userId, ...$ids]);
$scores = [];
foreach ($stmt->fetchAll(PDO::FETCH_ASSOC) as $row) {
if (!isset($scores[$row['content_id']])) {
$scores[$row['content_id']] = $row;
}
}
foreach ($items as &$item) {
if (!isset($scores[$item['id']])) continue;
$data = $scores[$item['id']];
$daysSince = (time() - strtotime($data['created_at'])) / 86400;
$lastScore = (float) $data['score_pct'] / 100;
// Time factor: bell curve peaking at day 3, zero at 0 and 14
$timeFactor = max(0, ($daysSince / 3) * exp(1 - $daysSince / 3));
$boost = 0.3 * (1 - $lastScore) * min(1, $timeFactor);
$item['rec_score'] += $boost;
}
return $items;
}
// ─────────────────────────────────────────────────────────────────────
// RULE 2: Difficulty Progression
//
// If a user consistently scores > 75% on difficulty-2 items,
// boost difficulty-3 items and penalise difficulty-1.
// ─────────────────────────────────────────────────────────────────────
private function applyDifficultyProgression(array $items, int $userId): array {
// Find user's average score per difficulty level
$stmt = $this->db->prepare("
SELECT c.difficulty, AVG(e.score_pct) as avg_score, COUNT(*) as cnt
FROM user_events e
JOIN content c ON c.id = e.content_id
WHERE e.user_id = ? AND e.event_type = 'score'
GROUP BY c.difficulty
HAVING cnt >= 2
");
$stmt->execute([$userId]);
$avgByDiff = $stmt->fetchAll(PDO::FETCH_KEY_PAIR);
// Find the user's "current" difficulty: highest diff where avg >= 70%
$currentDiff = 1;
foreach ($avgByDiff as $diff => $avg) {
if ($avg >= 70 && $diff > $currentDiff) $currentDiff = $diff;
}
$targetDiff = min(5, $currentDiff + 1); // challenge the user slightly
foreach ($items as &$item) {
$diff = (int) ($item['difficulty'] ?? 2);
$dist = abs($diff - $targetDiff);
// Gaussian penalty: items far from target difficulty are penalised
// e^(-dist²/2) = 1.0 at dist=0, 0.61 at dist=1, 0.13 at dist=2
$diffFactor = exp(-($dist ** 2) / 2);
$item['rec_score'] *= (0.6 + 0.4 * $diffFactor);
}
return $items;
}
// ─────────────────────────────────────────────────────────────────────
// RULE 3: Recency Decay
//
// Penalise items that were shown to this user recently but not engaged.
// Prevents the same items recycling endlessly in recommendations.
// ─────────────────────────────────────────────────────────────────────
private function applyRecencyDecay(array $items, int $userId): array {
// (In production you'd log recommendation impressions too)
// Here we use view events without follow-up action as a proxy
$stmt = $this->db->prepare("
SELECT content_id, MAX(created_at) as last_seen
FROM user_events
WHERE user_id = ? AND event_type = 'view'
GROUP BY content_id
");
$stmt->execute([$userId]);
$viewLog = $stmt->fetchAll(PDO::FETCH_KEY_PAIR);
foreach ($items as &$item) {
if (!isset($viewLog[$item['id']])) continue;
$daysSince = (time() - strtotime($viewLog[$item['id']])) / 86400;
// Decay: full penalty if seen today, no penalty if seen 7+ days ago
$decay = min(1, $daysSince / 7);
$item['rec_score'] *= (0.5 + 0.5 * $decay);
}
return $items;
}
// ─────────────────────────────────────────────────────────────────────
// RULE 4: Diversity Filter
// Maximum 2 items from same category in top-N results.
// Prevents "all SSC CGL, all the time" problem.
// ─────────────────────────────────────────────────────────────────────
private function applyDiversityFilter(array $items, int $limit): array {
$result = [];
$catCounts = [];
const MAX_PER_CATEGORY = 2;
foreach ($items as $item) {
if (count($result) >= $limit) break;
$cat = $item['category'] ?? 'other';
if (($catCounts[$cat] ?? 0) >= self::MAX_PER_CATEGORY) continue;
$result[] = $item;
$catCounts[$cat] = ($catCounts[$cat] ?? 0) + 1;
}
// If diversity filter left us short of $limit, fill from the leftovers
if (count($result) < $limit) {
foreach ($items as $item) {
if (in_array($item, $result)) continue;
$result[] = $item;
if (count($result) >= $limit) break;
}
}
return $result;
}
}
The cold start problem occurs in two forms: new user cold start (no interaction history) and new item cold start (no one has interacted with a new test yet). Each needs a different solution.
The fastest way to solve cold start is to ask the user directly during registration. Show a 3-question onboarding quiz and instantly populate their interaction profile:
<?php
class OnboardingSeeder {
/**
* After onboarding quiz: seed the user's implicit profile
* from their declared preferences so day-1 recs are meaningful.
*
* Quiz question 1: "Which exam are you preparing for?"
* → seeds category preference at weight 1.0
*
* Quiz question 2: "What is your current level?"
* → seeds difficulty preference
*
* Quiz question 3: "How much time per day can you practice?"
* → seeds duration preference (affects which tests are suitable)
*/
public function seed(
int $userId,
string $targetExam, // e.g. 'ssc', 'banking', 'railways'
int $selfLevel, // 1 = beginner, 3 = intermediate, 5 = advanced
int $minutesPerDay // 15, 30, 60
): void {
// Find 2-3 starter items matching their declared exam + difficulty
$stmt = $this->db->prepare("
SELECT id FROM content
WHERE category = ?
AND difficulty BETWEEN ? AND ?
AND duration_min <= ?
ORDER BY popularity DESC
LIMIT 3
");
$stmt->execute([
$targetExam,
max(1, $selfLevel - 1), // ± 1 difficulty band
min(5, $selfLevel + 1),
$minutesPerDay
]);
$starterItems = $stmt->fetchAll(PDO::FETCH_COLUMN);
// Inject synthetic "view" events at medium weight
// — enough to bootstrap the CB engine, but labelled 'onboarding'
// so it can be excluded from score analysis later
$insert = $this->db->prepare("
INSERT INTO user_item_scores (user_id, content_id, score)
VALUES (?, ?, 0.3)
ON DUPLICATE KEY UPDATE score = GREATEST(score, 0.3)
");
foreach ($starterItems as $itemId) {
$insert->execute([$userId, $itemId]);
}
}
}
-- Popularity recommender that promotes new items
SELECT
id,
title,
category,
difficulty,
-- real popularity score + artificial boost for items < 7 days old
popularity + CASE
WHEN DATEDIFF(NOW(), created_at) < 7
THEN 50 -- inject as if it had 50 extra views
ELSE 0
END AS effective_popularity
FROM content
WHERE category = :category
ORDER BY effective_popularity DESC
LIMIT 10;
The API endpoint wires all the pieces together: it receives a request, calls the hybrid recommender + re-ranker, and returns a clean JSON response.
<?php
/**
* GET /api/recommend.php?limit=10
*
* Requires: Authorization: Bearer {token} header
* Returns: JSON recommendation payload
*
* Response format:
* {
* "status": "ok",
* "user_id": 42,
* "engine": "hybrid-cb70-cf30",
* "from_cache": false,
* "generated_at": "2026-03-22T10:00:00Z",
* "recommendations": [
* {
* "id": 7,
* "title": "SSC CGL Mock Test 3",
* "content_type": "mock_test",
* "category": "ssc",
* "difficulty": 2,
* "duration_min": 30,
* "rec_score": 0.847,
* "reason": "Similar to tests you scored well on"
* }, ...
* ]
* }
*/
require_once '../config/db.php';
require_once '../auth/Auth.php';
require_once '../recommenders/ContentBasedRecommender.php';
require_once '../recommenders/CollaborativeRecommender.php';
require_once '../recommenders/PopularityRecommender.php';
require_once '../recommenders/HybridRecommender.php';
require_once '../recommenders/ReRanker.php';
require_once '../cache/FileCache.php';
header('Content-Type: application/json');
header('Access-Control-Allow-Origin: *');
// ── Auth: validate Bearer token ───────────────────────────────────────
$auth = new Auth($db);
$userId = $auth->getUserFromToken($_SERVER['HTTP_AUTHORIZATION'] ?? '');
if (!$userId) {
http_response_code(401);
echo json_encode(['error' => 'Unauthorized']);
exit;
}
$limit = max(1, min(20, (int) ($_GET['limit'] ?? 10)));
// ── Cache check: serve from 30-min cache if available ─────────────────
$cache = new FileCache('/tmp/rec_cache', 1800); // 30 min TTL
$cacheKey = "rec_{$userId}_{$limit}";
if ($cached = $cache->get($cacheKey)) {
$cached['from_cache'] = true;
echo json_encode($cached);
exit;
}
// ── Run the recommendation pipeline ───────────────────────────────────
$cb = new ContentBasedRecommender($db);
$cf = new CollaborativeRecommender($db);
$pop = new PopularityRecommender($db);
$hybrid = new HybridRecommender($db, $cb, $cf, $pop);
$reranker= new ReRanker($db);
$candidates = $hybrid->recommend($userId, $limit);
$final = $reranker->rerank($candidates, $userId, $limit);
// Attach human-readable reason strings (transparency layer)
foreach ($final as &$item) {
$item['reason'] = $item['rec_score'] > 0.75
? 'Highly matched to your exam goal'
: 'Based on users with similar practice patterns';
}
$response = [
'status' => 'ok',
'user_id' => $userId,
'engine' => 'hybrid',
'from_cache' => false,
'generated_at' => gmdate('Y-m-d\TH:i:s\Z'),
'recommendations' => $final,
];
$cache->set($cacheKey, $response);
echo json_encode($response, JSON_PRETTY_PRINT);
Every recommendation query touches multiple tables. Without caching, a busy exam platform would hammer the DB on every page load. Use a two-layer caching strategy:
| Layer | Mechanism | TTL | What it caches |
|---|---|---|---|
| Layer 1 | PHP file cache (/tmp) | 30 min | Final recommendation JSON per user |
| Layer 2 | MySQL pre-computed tables | Daily cron | item_similarity, user_similarity, user_item_scores |
| Busted by | New user event logged | Immediate | Invalidate user's cache key on any new interaction |
<?php
class FileCache {
public function __construct(
private readonly string $dir,
private readonly int $ttl = 1800
) {
if (!is_dir($this->dir)) mkdir($this->dir, 0755, true);
}
public function get(string $key): ?array {
$path = $this->path($key);
if (!file_exists($path)) return null;
if (filemtime($path) < time() - $this->ttl) {
unlink($path); return null;
}
return json_decode(file_get_contents($path), true);
}
public function set(string $key, array $data): void {
file_put_contents($this->path($key), json_encode($data));
}
public function delete(string $key): void {
$path = $this->path($key);
if (file_exists($path)) unlink($path);
}
private function path(string $key): string {
return $this->dir . '/' . md5($key) . '.json';
}
}
You cannot improve what you don't measure. These three metrics are the standard toolkit for evaluating a recommendation system offline (no live A/B test needed to start).
Of the K items you recommended, what fraction did the user actually engage with?
A Precision@10 of 0.4 means 4 out of every 10 recommended items led to engagement. For an exam-prep platform, aim for P@10 > 0.30 as a starting benchmark.
Unlike Precision, NDCG cares about where in the list the good items appear. A relevant item at position 1 is worth more than one at position 10.
IDCG is the DCG of a perfect ranking. NDCG = 1.0 means your ranking is perfect; 0.0 is terrible. Target NDCG@10 > 0.50 as a starting goal.
What percentage of the content catalog is ever recommended to at least one user? Low coverage (<30%) means you're effectively ignoring most of your content.
<?php
/**
* Offline evaluation:
* For each user, hold out their last 5 interactions as ground truth.
* Pretend they don't exist, run the recommender, measure precision.
*/
class MetricsCalculator {
public function precisionAtK(array $recommended, array $relevant, int $k): float {
$topK = array_slice($recommended, 0, $k);
$hits = count(array_intersect($topK, $relevant));
return $k > 0 ? $hits / $k : 0.0;
}
public function ndcgAtK(array $recommended, array $relevant, int $k): float {
$dcg = 0.0;
$idcg = 0.0;
for ($i = 0; $i < min($k, count($recommended)); $i++) {
$rel = in_array($recommended[$i], $relevant) ? 1 : 0;
$dcg += $rel / log($i + 2, 2);
}
// IDCG: perfect ordering (all relevant items first)
for ($i = 0; $i < min($k, count($relevant)); $i++) {
$idcg += 1.0 / log($i + 2, 2);
}
return $idcg > 0 ? $dcg / $idcg : 0.0;
}
public function evaluateEngine(HybridRecommender $engine, array $testUsers): array {
$p10s = [];
$ndcg10s= [];
foreach ($testUsers as ['user_id' => $uid, 'held_out' => $heldOut]) {
$recs = $engine->recommend($uid, 10);
$recIds = array_column($recs, 'id');
$p10s[] = $this->precisionAtK($recIds, $heldOut, 10);
$ndcg10s[]= $this->ndcgAtK($recIds, $heldOut, 10);
}
return [
'precision@10' => round(array_sum($p10s) / count($p10s), 4),
'ndcg@10' => round(array_sum($ndcg10s) / count($ndcg10s), 4),
];
}
}
schema.sql. Tag all existing content items in content_tags. Add difficulty and category metadata to all mock tests.EventLogger::log() from every test start, test completion, score submission, and bookmark action.OnboardingSeeder::seed() immediately after registration.api/recommend.php behind your auth middleware. Test with curl -H "Authorization: Bearer TOKEN" /api/recommend.php?limit=10.EventLogger::log(), after writing the event, call $cache->delete("rec_{$userId}_*") to bust the user's recommendation cache.MetricsCalculator::evaluateEngine() with a holdout set. Record your Precision@10 and NDCG@10. This is your baseline.$wCB/$wCF ratios, spaced repetition boost (0.3), and diversity cap (2 per category) based on real metrics. Increment slowly — measure every change.FileCache works fine up to ~10,000 users. Beyond that, swap it for a RedisCache implementation behind the same interface.| Feature | Value | When to build |
|---|---|---|
| A/B testing framework | Measure impact of algorithm changes on real CTR | Month 2 |
| Real-time score updates | Update user_item_scores instantly on test submission (vs cron lag) | Month 1 |
| Sequence-aware recs | "You just finished SSC CGL Mock 1 — try Mock 2 next" (session context) | Month 3 |
| Matrix factorisation (SVD) | Better CF at scale — replace user-user with latent factor model | Month 4+ |
| Redis cache | Sub-millisecond cache for high-traffic periods (exam seasons) | Month 3+ |
| Notification trigger | "You haven't practiced in 3 days — here's what to review" | Month 5+ |