๐ฏ What You'll Learn
- Structure a Flask project that wraps a HuggingFace model with a conversational API
- Write a
/chatPOST endpoint that manages session-based memory - Build a browser chat UI using vanilla HTML, CSS, and fetch() โ no frameworks
- Implement session isolation so multiple users don't share conversation history
- Identify and fix the three most common Flask + LLM integration bugs
๐ Before You Begin
- Completed Tutorial 1 โ you understand the prompt builder and conversation wrapper
- Python 3.9+ installed with
pip - Basic Flask knowledge (routes,
request,jsonify) - A HuggingFace API token
1. What We're Building
In Tutorial 1 we built the conversation wrapper in pure Python โ useful for scripts, but not usable by real users. Now we add the two missing layers: a Flask REST API that exposes the wrapper over HTTP, and a browser UI that talks to that API.
โ The chat UI you will have built by the end of this tutorial.
The complete architecture looks like this:
๐ Full System Architecture
- Browser UI โ HTML/CSS/JS chat window, sends fetch() POST requests
- Flask API โ receives messages, manages session memory, calls HF
- Prompt Builder โ from Tutorial 1, serialises history into one string
- HuggingFace Inference API โ the actual LLM endpoint
- Flask Session โ stores per-user conversation in a signed cookie
2. Project Structure
Keep everything in one folder. Flask's built-in template and static file serving makes this self-contained without any bundler or build step:
โโโ app.py # Flask server + LLM wrapper
โโโ .env # secrets โ never commit this
โโโ requirements.txt # pip dependencies
โโโ templates/
โโโ index.html # the chat UI (Flask serves this)
templates/. When you call render_template("index.html"), Flask finds and serves it โ no manual file path needed.
3. Installing Dependencies
Create requirements.txt with exactly these packages โ nothing more is needed:
flask==3.0.3 # web framework
requests==2.32.3 # HTTP calls to HuggingFace API
python-dotenv==1.0.1 # loads .env file into os.environ
pip install -r requirements.txt
Now create your .env file. This keeps your token out of source code:
HF_TOKEN=hf_your_actual_token_here # HuggingFace API token
HF_MODEL_URL=https://api-inference.huggingface.co/models/YOUR_MODEL
FLASK_SECRET_KEY=some-long-random-string # used to sign session cookies
Add .env to your .gitignore โ never push API tokens to GitHub.
4. The Flask Backend โ app.py
This is the core of the system. The file has four responsibilities: load config, define the conversation logic, expose two routes, and handle errors gracefully.
import os
import requests
from flask import Flask, request, jsonify, session, render_template
from dotenv import load_dotenv
load_dotenv() # reads .env into os.environ
app = Flask(__name__)
app.secret_key = os.getenv("FLASK_SECRET_KEY") # required to use Flask sessions
HF_URL = os.getenv("HF_MODEL_URL")
HF_HEADERS = {"Authorization": f"Bearer {os.getenv('HF_TOKEN')}"}
MAX_TURNS = 10 # sliding window โ keep only recent history
SYSTEM = (
"You are a helpful, clear, and concise educational assistant. "
"Answer accurately. Never provide harmful information."
)
# โโโ PROMPT BUILDER โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
def build_prompt(history):
"""Serialise conversation history โ single prompt string."""
prompt = SYSTEM + "\n\n"
for turn in history[-MAX_TURNS:]: # sliding window applied here
role = turn["role"].capitalize()
prompt += f"### {role}:\n{turn['content']}\n\n"
prompt += "### Assistant:\n" # open for model to complete
return prompt
# โโโ ROUTES โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
@app.route("/")
def index():
session.setdefault("history", []) # init history if new session
return render_template("index.html")
@app.route("/chat", methods=["POST"])
def chat():
data = request.get_json()
user_input = data.get("message", "").strip()
if not user_input:
return jsonify({"error": "Empty message"}), 400
history = session.get("history", [])
history.append({"role": "user", "content": user_input}) # add user turn
prompt = build_prompt(history)
try:
resp = requests.post(
HF_URL, headers=HF_HEADERS,
json={"inputs": prompt, "parameters": {"max_new_tokens": 200}},
timeout=30 # don't hang forever on slow models
)
resp.raise_for_status()
raw = resp.json()[0]["generated_text"]
reply = raw.split("### Assistant:")[-1].strip() # extract only assistant reply
reply = reply.split("### User:")[0].strip() # stop if model hallucinates next turn
except Exception as e:
return jsonify({"error": str(e)}), 500
history.append({"role": "assistant", "content": reply}) # add assistant turn
session["history"] = history # persist back to session
session.modified = True # tell Flask the session changed
return jsonify({"reply": reply})
@app.route("/reset", methods=["POST"])
def reset():
session["history"] = [] # wipe conversation for this user
return jsonify({"status": "reset"})
if __name__ == "__main__":
app.run(debug=True, port=5000)
Loads config โ builds prompts โ exposes /chat POST endpoint โ stores history in session cookie โ returns JSON reply
| Method | Route | What It Does | Returns |
|---|---|---|---|
| GET | / |
Serves the chat UI HTML page. Initialises an empty history in the session if not present. | HTML page |
| POST | /chat |
Receives {"message": "..."}, builds prompt, calls HF, parses reply, updates session, returns reply. |
{"reply": "..."} |
| POST | /reset |
Clears this user's conversation history from the session. The user starts fresh. | {"status": "reset"} |
The prompt builder in this Flask version uses the ### delimiter format from Tutorial 1's production template โ reducing role confusion in the model's output:
def build_prompt(history):
prompt = SYSTEM + "\n\n" # always prepend system instruction
for turn in history[-MAX_TURNS:]: # sliding window: last 10 turns only
role = turn["role"].capitalize()
prompt += f"### {role}:\n{turn['content']}\n\n" # strong delimiters per role
prompt += "### Assistant:\n" # cue for model to continue
return prompt
You are a helpful, clear, and concise educational assistant... ### User: What is supply chain disruption? ### Assistant: Supply chain disruption refers to... ### User: Give me a real-world example. ### Assistant:
5. Session-Based Memory
In Tutorial 1 we stored the conversation list in a plain Python variable. That works for a single user script โ but a Flask server handles many users at once. A shared Python variable would mix up everyone's conversations.
Flask's session solves this. It stores data in a signed cookie sent to the browser โ each user carries their own history.
User sends a message (POST /chat)
Browser sends {"message": "What is inflation?"} plus its session cookie.
Flask reads the session
history = session.get("history", []) retrieves this user's conversation list โ not anyone else's.
History is updated and saved
After getting the reply, the new turns are appended and session["history"] = history writes them back into the cookie.
Flask sends back the signed cookie
The updated session cookie is included in the HTTP response. The browser stores it and sends it with every future request.
session with a server-side store like Redis or a database, keyed by a session ID.
history = session.get("history", []) # read this user's history (or empty list)
history.append({"role": "user", "content": user_input})
# ... build prompt, call API, parse reply ...
history.append({"role": "assistant", "content": reply})
session["history"] = history # write back โ Flask signs and sends as cookie
session.modified = True # required when mutating mutable objects
Flask only auto-detects session changes when you replace the value entirely. If you mutate a list in-place (append), you must set session.modified = True manually โ otherwise Flask won't re-send the updated cookie.
6. The Chat UI โ templates/index.html
The frontend is a single HTML file with embedded CSS and JavaScript. No React, no Vue, no build step. It sends a fetch() POST to /chat on every message and renders the reply.
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>aplly Chat</title>
<style>
* { box-sizing: border-box; margin: 0; padding: 0; }
body { font-family: sans-serif; background: #F3F4F6; display: flex;
flex-direction: column; height: 100vh; }
#topbar { background: #5B21B6; color: #fff; padding: .8rem 1.2rem;
font-weight: 700; font-size: 1rem; }
#messages { flex: 1; overflow-y: auto; padding: 1rem; display: flex;
flex-direction: column; gap: .7rem; }
.msg { max-width: 72%; padding: .6rem 1rem; border-radius: 12px;
font-size: .92rem; line-height: 1.55; }
.user { background: #5B21B6; color: #fff; align-self: flex-end;
border-bottom-right-radius: 3px; }
.bot { background: #fff; color: #1F1F2E; align-self: flex-start;
border: 1px solid #E5E7EB; border-bottom-left-radius: 3px; }
.thinking { color: #6B7280; font-style: italic; } # loading state
#input-bar { display: flex; padding: .8rem; gap: .6rem;
background: #fff; border-top: 1px solid #E5E7EB; }
#msg-input { flex: 1; padding: .55rem .9rem; border: 1.5px solid #E5E7EB;
border-radius: 8px; font-size: .92rem; outline: none; }
#msg-input:focus { border-color: #7C3AED; }
#send-btn { background: #5B21B6; color: #fff; border: none; padding: .55rem 1.2rem;
border-radius: 8px; font-weight: 700; cursor: pointer; }
#send-btn:hover { background: #7C3AED; }
#reset-btn { background: #F3F4F6; border: 1px solid #E5E7EB; padding: .55rem .9rem;
border-radius: 8px; cursor: pointer; font-size: .82rem; color: #6B7280; }
#reset-btn:hover { background: #EDE9FE; color: #5B21B6; }
</style>
</head>
<body>
<div id="topbar">๐ค aplly Assistant</div>
<div id="messages">
<div class="msg bot">Hello! Ask me anything.</div> <!-- initial greeting -->
</div>
<div id="input-bar">
<input id="msg-input" type="text" placeholder="Type a message..." />
<button id="reset-btn" onclick="resetChat()">New Chat</button>
<button id="send-btn" onclick="sendMessage()">Send</button>
</div>
<script>
const msgBox = document.getElementById("messages");
const msgInput = document.getElementById("msg-input");
// Send on Enter key
msgInput.addEventListener("keydown", e => {
if (e.key === "Enter") sendMessage();
});
function appendMsg(text, cls) {
const div = document.createElement("div");
div.className = "msg " + cls;
div.textContent = text;
msgBox.appendChild(div);
msgBox.scrollTop = msgBox.scrollHeight; // auto-scroll to latest
return div;
}
async function sendMessage() {
const text = msgInput.value.trim();
if (!text) return;
appendMsg(text, "user"); // show user message immediately
msgInput.value = "";
const thinking = appendMsg("Thinking...", "bot thinking"); // loading indicator
try {
const res = await fetch("/chat", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ message: text })
});
const data = await res.json();
thinking.remove(); // remove loading bubble
appendMsg(data.reply || data.error, "bot");
} catch (err) {
thinking.remove();
appendMsg("Connection error. Is the server running?", "bot");
}
}
async function resetChat() {
await fetch("/reset", { method: "POST" });
msgBox.innerHTML = '<div class="msg bot">Chat reset. Start fresh!</div>';
}
</script>
</body>
</html>
appendMsg() โ reusable bubble creator for both user and bot messages "Thinking..." bubble โ shows while awaiting API, removed on response scrollTop = scrollHeight โ always auto-scrolls to the latest message Enter key listener โ sends without clicking the button
7. Wiring Frontend to Backend
Let's trace one complete message through the entire system so every piece is clear:
User types & hits Send
sendMessage() is called. The text is appended to the UI as a user bubble and the input is cleared.
fetch() POSTs to /chat
The browser sends POST /chat with body {"message": "your text"} and the session cookie automatically attached.
Flask reads session, builds prompt
The server retrieves this user's history, appends the new message, and calls build_prompt() to serialise everything.
HuggingFace API returns a completion
The full serialised prompt is sent to the HF model. The model returns the entire prompt text plus its new reply appended at the end.
Flask parses & stores the reply
The reply is extracted by splitting on ### Assistant:. Both new turns are appended to history and saved back to the session.
JSON response reaches the browser
{"reply": "..."} arrives. The "Thinking..." bubble is removed and the actual reply is rendered as a bot message bubble.
8. Running & Testing
Start the server and open the browser:
cd flask-llm-chat
python app.py
* Running on http://127.0.0.1:5000 * Debug mode: on
Open http://127.0.0.1:5000 in your browser. You can also test the API directly with curl:
curl -X POST http://127.0.0.1:5000/chat \
-H "Content-Type: application/json" \
-d '{"message": "What is inflation?"}' # test without the browser UI
{"reply": "Inflation is the rate at which the general level of prices for goods and services rises..."}
/reset and send another โ confirm it no longer references the earlier exchange.
9. Common Bugs & Fixes
These are the three bugs you'll almost certainly hit the first time you run this setup:
# BUG: history updates vanish after one turn
history = session.get("history", [])
history.append({"role": "assistant", "content": reply})
session["history"] = history
# Missing: session.modified = True โ Flask won't re-send cookie without this
# FIX:
session["history"] = history
session.modified = True โ always add this when mutating mutable objects
Model has no memory of previous turns โ every reply seems to come from a fresh session.
# BUG: reply contains the full history, not just the new answer
raw = resp.json()[0]["generated_text"]
reply = raw โ raw includes the entire prompt + completion
# FIX: split on the last assistant delimiter
reply = raw.split("### Assistant:")[-1].strip() โ take only what comes after the last marker
reply = reply.split("### User:")[0].strip() โ also stop if model hallucinates next user turn
The reply bubble contains the full serialised conversation history, not just the new answer.
# BUG: browser blocks fetch() if frontend is on a different origin
# (e.g., frontend on port 3000, Flask on port 5000)
# FIX: install flask-cors and enable it
pip install flask-cors
from flask_cors import CORS
app = Flask(__name__)
CORS(app, supports_credentials=True) โ supports_credentials needed for session cookies
Browser console: "Access to fetch at 'http://127.0.0.1:5000/chat' from origin 'http://localhost:3000' has been blocked by CORS policy"
โก Key Takeaways
- Always set
session.modified = Trueafter mutating a mutable session value - Always split on your role delimiter to extract only the new reply from the model's output
- For separate frontend/backend deployments, add
flask-corswithsupports_credentials=True - Set a
timeouton all HuggingFace requests โ slow models will block your server forever without one
10. Concept Flashcards
Click each card to reveal the explanation. Use the arrows to navigate.
๐ Click a card to flip it
11. Knowledge Check Quiz
Questions focus on practical code behaviour โ what happens, why it breaks, and what the fix is.
Q1. You append a new turn to the history list and call session["history"] = history, but on the next request the history is empty again. What's most likely missing?
Q2. The model's reply in the JSON response contains the full serialised prompt plus the new answer. Which line of code extracts only the new assistant reply?
Q3. Two users are chatting simultaneously. User A's messages start appearing in User B's conversation. What caused this?
Q4. The user types a message and hits Send. The "Thinking..." bubble appears but never goes away. What is most likely wrong?
Q5. You deploy the Flask API on port 5000 and the frontend separately on port 3000. The browser throws a CORS error when fetch() is called. What is the minimal fix in app.py?
Q6. Predict what happens: the user sends 15 messages in a row, but MAX_TURNS = 10. What does the model receive on the 15th call?