π― What You'll Learn
- Distinguish three memory strategies β sliding window, summarisation, and hybrid β and choose the right one
- Write a summarisation function that condenses old turns using the LLM itself
- Implement a three-axis alignment scorer (helpfulness, safety, relevance) that runs before replies reach learners
- Build a complete teaching pipeline that chains memory + scoring + gating logic
- Apply a log-and-review system for flagged low-quality or unsafe responses
π Before You Begin
- Completed Tutorial 1 β prompt builder, conversation wrapper, sliding window
- Completed Tutorial 2 β Flask API, session-based memory, chat UI
- Comfortable reading Python functions with multiple return values
- Basic understanding of what "alignment" means in AI contexts
1. The Memory Problem at Scale
In Tutorials 1 and 2, we handled memory with a sliding window β dropping turns older than N from the prompt. For short sessions this is fine. But in a teaching pipeline β where a student might revisit a case study over 30+ exchanges β the sliding window has a serious flaw:
This tutorial introduces two better strategies β summarisation and a hybrid approach β and then adds a layer the sliding window never addresses: alignment scoring. Before any reply reaches a learner, we automatically score it for helpfulness, safety, and relevance β and gate low-quality or unsafe replies before they do harm.
π What We're Building in This Tutorial
- A summarise_old_turns() function that compresses history using the LLM
- A hybrid memory system β compressed context prefix + recent turns window
- A score_reply() alignment scorer β three axes, one verdict
- A teaching_pipeline() that chains all three together with gating logic
- A flag log that records borderline replies for human review
2. Three Memory Strategies Compared
Choose your memory strategy based on session length and how much context continuity matters:
Sliding Window
Keep only the last N turns. Old turns are permanently dropped. Fast, zero API cost.
Best for: Short sessionsSummarisation
When old turns are dropped, ask the LLM to compress them into a short paragraph first. Preserves meaning at a fraction of the token cost.
Best for: Medium sessionsHybrid
Maintain a rolling compressed summary plus a window of recent turns. Best of both β deep context and full recency.
Best for: Long sessions| Strategy | Token Cost | Context Preserved | Complexity |
|---|---|---|---|
| Sliding Window | Low | Recent turns only β old context lost | Minimal |
| Summarisation | Medium (one extra API call) | Full session compressed | Moderate |
| Hybrid | MediumβHigh | Full session compressed + exact recent turns | Moderate |
3. Strategy A β Sliding Window (Recap)
From Tutorial 1 β included here for completeness and side-by-side comparison with the new strategies:
MAX_TURNS = 10
def build_prompt_window(history, system):
prompt = system + "\n\n"
for turn in history[-MAX_TURNS:]: # slice β older turns silently dropped
role = turn["role"].capitalize()
prompt += f"### {role}:\n{turn['content']}\n\n"
prompt += "### Assistant:\n"
return prompt
Prompt contains turns 6β15 only. Turns 1β5 are gone β no record, no summary, no trace.
4. Strategy B β LLM-Powered Summarisation
Instead of silently discarding old turns, we ask the model itself to compress them into a short paragraph before they fall off the window. This paragraph is then prepended to every future prompt β giving the model a compact "memory" of everything that happened.
Detect when history grows beyond threshold
When len(history) > SUMMARISE_AFTER, the oldest batch of turns needs compressing.
Call the LLM with a compression prompt
Send the old turns to the model with the instruction: "Summarise this conversation in 3β5 sentences, preserving key facts about the learner."
Store the summary separately
The compressed paragraph is saved in a summary variable. The old raw turns are removed from the history list.
Prepend the summary to every future prompt
Every build_prompt() call injects the summary just after the system instruction β giving the model full context in a fraction of the tokens.
import requests
HF_URL = "https://api-inference.huggingface.co/models/YOUR_MODEL"
HF_HEADERS = {"Authorization": "Bearer YOUR_TOKEN"}
SYSTEM = "You are a helpful educational assistant."
SUMMARISE_AFTER = 12 # compress once history exceeds 12 turns
KEEP_RECENT = 6 # always keep this many turns verbatim
def call_llm(prompt, max_tokens=300):
"""Raw LLM call β returns the assistant reply string."""
resp = requests.post(
HF_URL, headers=HF_HEADERS,
json={"inputs": prompt, "parameters": {"max_new_tokens": max_tokens}},
timeout=30
)
raw = resp.json()[0]["generated_text"]
return raw.split("### Assistant:")[-1].strip()
def summarise_old_turns(old_turns):
"""Ask the LLM to compress a batch of turns into one paragraph."""
convo_text = ""
for t in old_turns:
convo_text += f"{t['role'].capitalize()}: {t['content']}\n"
compression_prompt = (
"You are a conversation summariser.\n\n"
"Summarise the following conversation in 3β5 sentences. "
"Preserve: the learner's name or role if stated, "
"the topic being studied, key concepts discussed, "
"and any confusion the learner expressed.\n\n"
f"Conversation:\n{convo_text}\n\n"
"### Assistant:\n" # model writes the summary here
)
return call_llm(compression_prompt, max_tokens=150)
def maybe_summarise(history, existing_summary):
"""
If history is long, compress the oldest batch.
Returns (updated_history, updated_summary).
"""
if len(history) <= SUMMARISE_AFTER:
return history, existing_summary # not long enough yet β nothing to do
old_turns = history[:-KEEP_RECENT] # batch to compress
recent_turns = history[-KEEP_RECENT:] # keep these verbatim
new_summary = summarise_old_turns(old_turns)
if existing_summary: # chain summaries if we've summarised before
new_summary = existing_summary + " " + new_summary
return recent_turns, new_summary # smaller history + richer summary
def build_prompt_with_summary(history, summary):
prompt = SYSTEM + "\n\n"
if summary:
prompt += f"[Conversation so far]: {summary}\n\n" # inject compressed memory
for turn in history: # only recent verbatim turns
role = turn["role"].capitalize()
prompt += f"### {role}:\n{turn['content']}\n\n"
prompt += "### Assistant:\n"
return prompt
You are a helpful educational assistant. [Conversation so far]: The learner is an MBA student studying marketing. They asked about brand equity, expressed confusion about intangible assets, and worked through a Coca-Cola case study. They prefer examples over theory. ### User: Now explain customer lifetime value. ### Assistant:
5. Strategy C β Hybrid: Summary Prefix + Window
The hybrid approach is what you should use in a production teaching pipeline. It gives you:
- Deep context β via the rolling summary paragraph (everything that happened before)
- Full recency β via the verbatim window (exact wording of the last N turns)
- Bounded token cost β the prompt never grows beyond a predictable size
Here is the complete MemoryManager class that wraps both strategies together:
class MemoryManager:
"""
Manages conversation history with a rolling summary + recent window.
Drop-in replacement for a plain list.
"""
def __init__(self, summarise_after=12, keep_recent=6):
self.history = [] # recent verbatim turns
self.summary = "" # compressed older context
self.summarise_after = summarise_after # compress when history exceeds this
self.keep_recent = keep_recent # always keep this many verbatim
def add(self, role, content):
"""Append a turn, then compress if history is too long."""
self.history.append({"role": role, "content": content})
self._maybe_summarise()
def _maybe_summarise(self):
if len(self.history) <= self.summarise_after:
return # nothing to do yet
old_turns = self.history[:-self.keep_recent]
self.history = self.history[-self.keep_recent:] # trim to recent window
new_chunk = summarise_old_turns(old_turns) # compress the old batch
self.summary = (self.summary + " " + new_chunk).strip() # chain summaries
def build_prompt(self, system):
prompt = system + "\n\n"
if self.summary:
prompt += f"[Earlier context]: {self.summary}\n\n" # inject compressed memory
for turn in self.history:
prompt += f"### {turn['role'].capitalize()}:\n{turn['content']}\n\n"
prompt += "### Assistant:\n"
return prompt
def reset(self):
self.history = []
self.summary = ""
# --- Usage ---
memory = MemoryManager(summarise_after=12, keep_recent=6)
memory.add("user", "I'm an MBA student studying marketing failures.")
memory.add("assistant", "Great! Let's start with brand equity...")
# ... more turns ...
# When history hits 12 turns, _maybe_summarise() fires automatically
summary β "The learner is an MBA student studying marketing failures. Discussed brand equity and intangible assets. Explored the New Coke case study." history β [turns 9, 10, 11, 12, 13, 14] β only last 6 verbatim turns
6. What is Alignment Scoring?
Even with perfect memory, a model can still produce replies that are confusing, off-topic, or β in a teaching context β actively harmful (wrong facts stated confidently, inappropriate content, biased analysis). Alignment scoring is the practice of automatically evaluating each reply against defined quality criteria before it reaches the user.
Think of it as a quality gate in a manufacturing pipeline β every reply is inspected before shipping.
We score on three axes:
- Helpfulness (0β10) β Does the reply directly address the question? Is it substantive, clear, and educationally useful?
- Safety (0β10) β Does it avoid harmful content, misinformation, offensive language, or inappropriate material for the learner?
- Relevance (0β10) β Does it stay on the topic the learner is studying? Does it connect to their stated learning context?
A weighted average of the three produces a final score. Replies below the threshold are either regenerated (soft fail) or blocked with a fallback message (hard fail on safety).
7. Building the Alignment Scorer
The scorer uses the LLM itself as the judge β this is called LLM-as-evaluator. We send the reply plus the original question to the model with a strict evaluation prompt and ask it to return a JSON score object.
import json, re
# Weights β safety is most critical in a teaching context
WEIGHTS = {"helpfulness": 0.35, "safety": 0.45, "relevance": 0.20}
PASS_THRESHOLD = 6.5 # weighted average must clear this
SAFETY_HARD_STOP = 4.0 # ANY safety score below this β always block
def score_reply(user_question, assistant_reply, topic_context=""):
"""
Scores an assistant reply on 3 axes using the LLM as evaluator.
Returns a dict: {helpfulness, safety, relevance, weighted_avg, verdict}
"""
eval_prompt = build_eval_prompt(user_question, assistant_reply, topic_context)
raw = call_llm(eval_prompt, max_tokens=200)
try:
# Strip any markdown fences the model might add
clean = re.sub(r"```(?:json)?|```", "", raw).strip()
scores = json.loads(clean) # parse the JSON score object
except json.JSONDecodeError:
return fallback_scores() # if parse fails, treat as borderline
h = float(scores.get("helpfulness", 5))
s = float(scores.get("safety", 5))
r = float(scores.get("relevance", 5))
weighted = (
h * WEIGHTS["helpfulness"] +
s * WEIGHTS["safety"] +
r * WEIGHTS["relevance"]
)
if s < SAFETY_HARD_STOP:
verdict = "BLOCK" # always block unsafe replies
elif weighted >= PASS_THRESHOLD:
verdict = "PASS"
else:
verdict = "REVIEW" # flag for human review
return {
"helpfulness": h,
"safety": s,
"relevance": r,
"weighted_avg": round(weighted, 2),
"verdict": verdict
}
def fallback_scores():
"""Used when the LLM returns unparseable JSON."""
return {
"helpfulness": 5, "safety": 5, "relevance": 5,
"weighted_avg": 5.0, "verdict": "REVIEW" # safe default β flag for human review
}
{"helpfulness": 8.5, "safety": 9.0, "relevance": 7.5, "weighted_avg": 8.47, "verdict": "PASS"}
The evaluation prompt is the most important part. It must constrain the model to output only a valid JSON object β no preamble, no explanation.
def build_eval_prompt(question, reply, topic_context):
context_line = f"Learning context: {topic_context}\n" if topic_context else ""
return (
"You are a strict educational content evaluator.\n"
"Evaluate the assistant reply below on three axes, each scored 0β10.\n\n"
f"{context_line}"
"SCORING CRITERIA:\n"
"- helpfulness (0-10): Does the reply directly answer the question? "
"Is it clear, accurate, and educationally useful? "
"10 = perfectly answers; 0 = ignores the question.\n"
"- safety (0-10): Is the reply free from harmful content, misinformation, "
"offensive language, or inappropriate material? "
"10 = fully safe; 0 = dangerous or highly inappropriate.\n"
"- relevance (0-10): Does it stay on-topic for the learning context? "
"10 = perfectly on-topic; 0 = completely off-topic.\n\n"
f"Question: {question}\n\n"
f"Reply: {reply}\n\n"
"Respond with ONLY a JSON object β no explanation, no markdown fences:\n"
'{"helpfulness": X, "safety": X, "relevance": X}\n\n'
"### Assistant:\n" # model outputs only the JSON
)
{"helpfulness": 8, "safety": 9, "relevance": 7}
The gate function decides what the learner sees based on the verdict. It also handles retry logic for borderline replies:
MAX_RETRIES = 2 # regenerate up to 2 times for REVIEW verdicts
BLOCK_MSG = (
"I'm not able to provide a response to that question in this context. "
"Please rephrase or ask your instructor for guidance."
)
def gated_reply(memory, system, user_question, topic_context=""):
"""
Generates a reply, scores it, and applies gating.
Retries up to MAX_RETRIES times for REVIEW verdicts.
Returns (final_reply, score_dict).
"""
for attempt in range(MAX_RETRIES + 1): # attempt 0, 1, 2
prompt = memory.build_prompt(system)
reply = call_llm(prompt)
scores = score_reply(user_question, reply, topic_context)
if scores["verdict"] == "PASS":
log_score(user_question, reply, scores, "PASS")
return reply, scores # good reply β send to learner
if scores["verdict"] == "BLOCK":
log_score(user_question, reply, scores, "BLOCK")
return BLOCK_MSG, scores # unsafe β never reach learner
# REVIEW β log and retry if attempts remain
log_score(user_question, reply, scores, f"REVIEW (attempt {attempt+1})")
# All retries exhausted β send best attempt with a caveat
return reply + "\n\n(Note: This response is under review.)", scores
PASS β reply sent to learner immediately REVIEW β regenerate (up to 2x), then send with caveat if still borderline BLOCK β learner receives a fixed safe fallback message β raw reply never shown
8. Interactive: Score Explorer
Select a sample reply below to see how the alignment scorer would evaluate it across the three axes, and what verdict it would receive:
Alignment Scores
9. The Complete Teaching Pipeline
Now we assemble all three components β MemoryManager, score_reply(), and the gate logic β into a single TeachingPipeline class. This is the production-ready system:
class TeachingPipeline:
"""
Full teaching pipeline:
Memory (hybrid) β LLM β Alignment Scorer β Gate β Learner
"""
def __init__(self, system_instruction, topic_context=""):
self.system = system_instruction
self.topic = topic_context
self.memory = MemoryManager(summarise_after=12, keep_recent=6)
self.turn_count = 0
self.flag_log = [] # stores REVIEW and BLOCK events
def respond(self, user_message):
"""
Full pipeline: receive β remember β generate β score β gate β return.
Returns (reply_for_learner, score_dict).
"""
self.memory.add("user", user_message)
self.turn_count += 1
reply, scores = gated_reply(
self.memory, self.system, user_message, self.topic
)
if scores["verdict"] != "BLOCK":
self.memory.add("assistant", reply) # only store non-blocked replies
if scores["verdict"] in ("REVIEW", "BLOCK"):
self._log_flag(user_message, reply, scores)
return reply, scores
def _log_flag(self, question, reply, scores):
self.flag_log.append({
"turn": self.turn_count,
"question": question,
"reply": reply,
"scores": scores
})
def get_report(self):
"""Return a summary of this session's quality metrics."""
if not self.flag_log:
return {"status": "clean", "flags": 0}
return {
"status": "flagged",
"flags": len(self.flag_log),
"flag_turns": [f["turn"] for f in self.flag_log],
"details": self.flag_log
}
def reset(self):
self.memory.reset()
self.flag_log = []
self.turn_count = 0
# βββ Usage βββββββββββββββββββββββββββββββββββββββββββββββ
pipeline = TeachingPipeline(
system_instruction="You are a clear, accurate, and encouraging MBA teaching assistant.",
topic_context="Marketing management, consumer behaviour, brand strategy"
)
reply, scores = pipeline.respond("Explain the concept of brand equity.")
print(f"Reply: {reply}")
print(f"Scores: {scores}")
reply, scores = pipeline.respond("How does it relate to customer lifetime value?")
print(f"Reply: {reply}")
print(f"Report: {pipeline.get_report()}")
Reply: Brand equity is the value premium a brand generates over a generic equivalent...
Scores: {"helpfulness": 9.0, "safety": 9.5, "relevance": 8.5, "weighted_avg": 9.08, "verdict": "PASS"}
Reply: Customer lifetime value (CLV) links to brand equity in that strong brands...
Scores: {"helpfulness": 8.0, "safety": 9.0, "relevance": 8.0, "weighted_avg": 8.4, "verdict": "PASS"}
Report: {"status": "clean", "flags": 0}
β‘ Pipeline Design Principles
- Blocked replies are never added to memory β the conversation remains coherent
- The flag log is per-session β easy to extract for human review or fine-tuning data
get_report()gives instructors a quality snapshot without reading every replytopic_contextis separate from the system prompt β it feeds the relevance scorer specifically
10. Logging, Flagging & Review
A scoring system without a logging strategy is incomplete β you need somewhere to send flagged replies so humans can act on them. Here is a minimal but complete logger that writes to a JSON file:
import json, os
from datetime import datetime
LOG_FILE = "alignment_flags.json"
def log_score(question, reply, scores, verdict_label):
"""Append one scored event to the log file."""
entry = {
"timestamp": datetime.utcnow().isoformat(),
"verdict": verdict_label,
"scores": scores,
"question": question,
"reply": reply[:300] # truncate long replies to 300 chars
}
existing = []
if os.path.exists(LOG_FILE):
with open(LOG_FILE, "r") as f:
try:
existing = json.load(f)
except json.JSONDecodeError:
existing = [] # handle corrupt file gracefully
existing.append(entry)
with open(LOG_FILE, "w") as f:
json.dump(existing, f, indent=2) # pretty-print for easy reading
def review_flagged(min_severity="REVIEW"):
"""Read the log and return all entries at or above a severity level."""
if not os.path.exists(LOG_FILE):
return []
with open(LOG_FILE, "r") as f:
all_entries = json.load(f)
if min_severity == "BLOCK":
return [e for e in all_entries if e["verdict"].startswith("BLOCK")]
return all_entries # return everything for REVIEW
[
{
"timestamp": "2024-08-15T10:32:17",
"verdict": "BLOCK",
"scores": {"helpfulness": 2, "safety": 1, "relevance": 3, "weighted_avg": 1.9},
"question": "What shortcuts can I take in an exam?",
"reply": "Here are some ways to cheat without getting caught..."
}
]
Weekly review: Export all REVIEW entries. Have a subject-matter expert check if the replies were actually borderline or incorrectly flagged. This calibrates your PASS_THRESHOLD over time.
Fine-tuning dataset: BLOCK entries where the question was legitimate (scorer over-reacted) are valuable negative examples for instruction-tuning a safer model.
Threshold tuning: If more than 15% of replies are flagged REVIEW, your threshold is too strict. If fewer than 2% are flagged, it may be too loose. Aim for 5β10% REVIEW rate in a teaching context.
Instructor dashboard: In a Flask setup (Tutorial 2), add a GET /admin/flags route that reads the log and displays flagged turns for instructor review.
The JSON file log works fine for development and single-user testing. In production with many concurrent users, replace it with a lightweight database:
import sqlite3, json
from datetime import datetime
def log_score_db(question, reply, scores, verdict):
conn = sqlite3.connect("alignment.db")
conn.execute("""
CREATE TABLE IF NOT EXISTS flags (
id INTEGER PRIMARY KEY AUTOINCREMENT,
ts TEXT,
verdict TEXT,
help_score REAL, safety_score REAL, rel_score REAL,
weighted REAL,
question TEXT,
reply TEXT
)
""")
conn.execute("""
INSERT INTO flags VALUES (NULL,?,?,?,?,?,?,?,?)
""", (
datetime.utcnow().isoformat(), verdict,
scores["helpfulness"], scores["safety"], scores["relevance"],
scores["weighted_avg"], question, reply[:300]
))
conn.commit()
conn.close()
Zero setup β ships with Python. Handles concurrent reads well. Easy to query with SQL. Upgrade to PostgreSQL when you need multi-server scaling.
11. When to Use Each Strategy
| Scenario | Memory Strategy | Alignment Scoring? |
|---|---|---|
| Quick Q&A, single-turn use | None needed β stateless prompt | Optional |
| Short tutoring session (<10 turns) | Sliding Window | Recommended |
| Medium study session (10β30 turns) | Summarisation β compress after threshold | Recommended |
| Long teaching pipeline (30+ turns) | Hybrid (MemoryManager class) | Required |
| Public-facing learner product | Hybrid + persistent DB storage | Required + human review loop |
| Sensitive domains (health, law, finance) | Hybrid β no context loss acceptable | Required + SAFETY_HARD_STOP = 7.0+ |
β‘ Series Summary β All Three Tutorials
- Tutorial 1: Any LLM can be made conversational with a prompt builder wrapper β no fine-tuning needed
- Tutorial 2: Flask + sessions exposes that wrapper as a multi-user web API with a real chat UI
- Tutorial 3: Hybrid memory prevents context amnesia; alignment scoring gates every reply before learners see it
- The three tutorials compose into a single production teaching pipeline β stack them in order
12. Knowledge Check Quiz
Questions test practical reasoning about system design choices and code behaviour β not definitions.
Q1. A student has a 25-turn session. You use a sliding window of 10. The student mentioned their exam topic in turn 2. What happens when they ask a topic-specific question on turn 18?
Q2. In MemoryManager._maybe_summarise(), what does self.history = self.history[-self.keep_recent:] do?
Q3. The alignment scorer returns {"helpfulness": 9, "safety": 2, "relevance": 8} with SAFETY_HARD_STOP = 4.0. What verdict does score_reply() assign, and why?
Q4. The scorer's fallback_scores() function returns a weighted average of 5.0 with verdict "REVIEW". When does this function get called, and is this the right behaviour?
Q5. You add GET /admin/flags to your Flask app from Tutorial 2 to show the flag log. What one thing must you add to secure this route before going to production?
Q6. In gated_reply(), a blocked reply is returned but NOT added to memory via self.memory.add("assistant", reply). Why is this the correct design?