π In This Article
π― Situation & Goals
A small company wants to build an ad bot. Their knowledge base: ~1,000 ads stored as .txt and .md files in a folder called out_going/.
The bot needs to understand ads semantically β not just keyword-match them. That means converting every ad into a numerical vector (embedding) that captures meaning. These embeddings will later power similarity search, clustering, recommendation, and retrieval.
β What You Will Learn
- What embeddings are and why they matter
- How FAISS stores and searches vectors on CPU
- How to build an incremental indexing pipeline
- How to persist and reload a FAISS index
- How to visualise ad clusters with PCA
- How to schedule the script on Windows
π Prerequisites
- Python 3.8 or higher installed
- Comfortable with
forloops and functions - Basic understanding of lists and dicts
- Know how to run
.pyfiles from terminal - No ML or AI background required
β Prerequisites & Installation
Install all required packages with a single pip command. All run on CPU β no CUDA setup needed.
pip install sentence-transformers faiss-cpu matplotlib numpy scikit-learn
| Package | Why we need it |
|---|---|
| sentence-transformers | Loads the MiniLM model and converts text into float vectors |
| faiss-cpu | Stores, saves, and searches vectors efficiently β the CPU build, no GPU required |
| matplotlib | Draws the 2D scatter plot of ad clusters |
| numpy | Required by FAISS and SentenceTransformers for array operations |
| scikit-learn | PCA β reduces 384D vectors to 2D for the diagnostic plot |
faiss-cpu fails to install, try pip install faiss-cpu --no-cache-dir. On Python 3.12+ you may need pip install faiss-cpu==1.8.0.
π Key Concepts
Click a card to reveal the definition. Use the arrows to navigate all 7 terms.
π Step 1 β Configuration & Data Simulation
Define the three file paths the pipeline uses, then generate 1,000 dummy ad files that simulate the real out_going/ folder. In production you skip the generation and point at your real folder.
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import os, glob, json import numpy as np import faiss import matplotlib.pyplot as plt from sentence_transformers import SentenceTransformer from sklearn.decomposition import PCA # ββ The only 3 constants you ever need to change ββ FOLDER = "out_going" # folder holding your ad files INDEX_F = "ad_bot.index" # persisted FAISS binary on disk META_F = "ad_metadata.json" # state tracker β dict of processed filenames os.makedirs(FOLDER, exist_ok=True) def create_dummy_ads(): categories = ["Tech", "Fashion", "Food", "Cars", "Travel"] templates = [ "Buy our {cat} product! It is {adj}.", "Discover the best {cat} deals. {adj} offer today.", "Top {cat} service. Very {adj}.", ] adjectives = ["affordable", "reliable", "fast", "stylish"] for i in range(1000): cat = categories[i % len(categories)] adj = adjectives[i % len(adjectives)] body = templates[i % len(templates)].format(cat=cat, adj=adj) ext = ".md" if i % 2 == 0 else ".txt" with open(f"{FOLDER}/ad_{i:04d}{ext}", "w") as f: f.write(f"# Ad {i}\nCategory: {cat}\n{body}")
| Line / expression | What it does and why |
|---|---|
| FOLDER / INDEX_F / META_F | Three constants at the top. Changing these is all you need to adapt the script to a different project. Never hard-code paths inside functions. |
| os.makedirs(exist_ok=True) | Creates the folder if missing. exist_ok=True prevents an error if it already exists β safe for repeated scheduled runs. |
| i % len(categories) | Cycles through the list endlessly using modulo. 1,000 files Γ· 5 categories = 200 files per category. |
| f"ad_{i:04d}{ext}" | :04d zero-pads to 4 digits β ad_0001.txt, ad_0042.md. Files sort correctly in any file explorer. |
Expected terminal output:
Created 1000 ad files in 'out_going/' 500 x .txt | 500 x .md
π§ Step 2 β Load the Model & FAISS Index
Load (or create) the embedding model and the FAISS index at the start of every run. This "load or create" pattern is the foundation of incremental indexing.
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# 1. Load a lightweight, CPU-optimised embedding model model = SentenceTransformer('all-MiniLM-L6-v2') dim = model.get_sentence_embedding_dimension() # returns 384 # 2. Load existing index + metadata, or start fresh if os.path.exists(INDEX_F) and os.path.exists(META_F): print("Loading existing index...") index = faiss.read_index(INDEX_F) with open(META_F) as f: processed = json.load(f) else: print("No index found. Creating IndexFlatL2...") index = faiss.IndexFlatL2(dim) processed = {} print(f"Index size: {index.ntotal} vectors | Metadata: {len(processed)} entries")
| Line / expression | What it does and why |
|---|---|
| SentenceTransformer('all-MiniLM-L6-v2') | Downloads (~22 MB, once) and loads MiniLM. First run fetches from the internet; all subsequent runs load from local cache β works offline for scheduled tasks. |
| get_sentence_embedding_dimension() | Returns 384. This must match your FAISS index dimension exactly. If you switch models, you must rebuild the index from scratch. |
| faiss.read_index(INDEX_F) | Loads the full binary index from disk, restoring all previously indexed vectors into memory in milliseconds. |
| faiss.IndexFlatL2(dim) | Creates an exact L2 (Euclidean) distance index. "Flat" = vectors stored as-is, no compression, no approximation. Sub-millisecond search for 1,000 ads. |
| index.ntotal | The number of vectors currently stored. Zero on first run, 1,000 after full indexing. |
Expected terminal output β first run:
No index found. Creating IndexFlatL2... Index size: 0 vectors | Metadata: 0 entries
Expected terminal output β subsequent runs:
Loading existing index... Index size: 1000 vectors | Metadata: 1000 entries
faiss-cpu is perfectly fast for 1,000β100,000 vectors. GPU FAISS only starts showing an advantage at millions of vectors β and requires CUDA. CPU is the right choice here.
π Step 3 β The Incremental Logic
The entire "incremental" mechanism is just a set subtraction: all files β already processed = new files. If nothing is new, the script exits cleanly without burning CPU.
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# 1. Collect every .txt and .md currently in the folder all_files = ( glob.glob(os.path.join(FOLDER, "*.txt")) + glob.glob(os.path.join(FOLDER, "*.md")) ) # 2. Keep only files NOT already in our metadata dict β THE incremental step new_files = [ f for f in all_files if os.path.basename(f) not in processed ] print(f"Total: {len(all_files)} | Processed: {len(processed)} | New: {len(new_files)}") # 3. Early-exit if nothing to do if not new_files: print("Nothing new. Index is up to date.") exit(0) # exit code 0 = success for Windows Task Scheduler
| Line / expression | What it does and why |
|---|---|
| glob.glob("*.txt") + glob.glob("*.md") | Scans the folder for matching filenames. os.path.join() makes the path work on both Windows and Linux without changes. |
| os.path.basename(f) not in processed | Extracts just the filename (not the full path) and checks it against the metadata dict. Dict key lookup is O(1) β fast even with 100,000 entries. |
| [f for f in all_files if ...] | List comprehension filtering. Produces only the files the script has not seen before. Equivalent to a filtered loop but more concise. |
| exit(0) | Exit code 0 = success. Windows Task Scheduler marks the run as successful. A non-zero code (or an exception) would appear as a failure in the task history. |
Total: 1050 | Processed: 1000 | New: 50
Nothing new. Index is up to date.
IndexFlatL2 does not support deletion. If ads are removed, delete ad_bot.index and ad_metadata.json and re-run. For a ~1,000-file corpus this takes under 2 minutes on CPU.
β Step 4 β Encode & Save to FAISS
Read each new file, encode all texts in a single batch call (much faster than one-by-one), add vectors to FAISS, then persist everything to disk.
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# 1. Read all new files into memory new_texts, new_names = [], [] for fp in new_files: with open(fp, "r", encoding="utf-8") as f: new_texts.append(f.read()) new_names.append(os.path.basename(fp)) # 2. Batch encode β ALWAYS pass a list, not a single string # Returns numpy array shape (N, 384), dtype float32 vectors = model.encode( new_texts, batch_size=32, # process 32 texts at a time show_progress_bar=True, # tqdm progress bar in terminal convert_to_numpy=True # FAISS requires numpy float32 ) # 3. Append vectors to the FAISS index (does not touch existing vectors) index.add(vectors) # index.ntotal increases by len(new_texts) # 4. Record each new file in metadata with its FAISS position (ID) start_id = len(processed) for i, name in enumerate(new_names): processed[name] = { "faiss_id": start_id + i, # integer position in the index "status": "indexed" } # 5. Persist both files to disk faiss.write_index(index, INDEX_F) with open(META_F, "w") as f: json.dump(processed, f, indent=4) print(f"Saved. Index now holds {index.ntotal} vectors.")
| Line / expression | What it does and why |
|---|---|
| model.encode(list, batch_size=32) | Passing a list lets the model process 32 texts at once internally. This is 5β10x faster than calling encode(text) in a loop because operations are parallelised within each batch. |
| convert_to_numpy=True | FAISS requires a numpy.ndarray of dtype float32. This flag ensures that automatically β you would get a TypeError if you passed a PyTorch tensor. |
| index.add(vectors) | Appends vectors to the in-memory FAISS index without touching existing ones. IDs are assigned automatically as sequential integers from the current ntotal. |
| processed[name] = {"faiss_id": ...} | Maps the filename to its FAISS integer ID. Without this, a search returning ID 42 gives you no way to know which ad file it corresponds to. |
| faiss.write_index(index, INDEX_F) | Serialises the full index (all vectors + structure) to a compact binary file. Next run, faiss.read_index() restores it in milliseconds. |
Batches: 100%|#####| 32/32 [00:48<00:00, 1.51s/it] Saved. Index now holds 1000 vectors. ad_bot.index 1.5 MB ad_metadata.json 85 KB
ad_bot.index and ad_metadata.json. Any Python script can now load these and find the top-5 most semantically similar ads to any query using index.search(query_vector, k=5).
π Step 5 β Visualise Ad Clusters with PCA
384-dimensional vectors cannot be plotted directly. PCA squashes them to 2D so you can visually verify that similar-category ads cluster together β a sanity check that embeddings are working correctly.
visualise_clusters() manually whenever you need a diagnostic report β not in the automated task.
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
def visualise_clusters(model): all_paths = (glob.glob(f"{FOLDER}/*.txt") + glob.glob(f"{FOLDER}/*.md")) texts = [] for fp in all_paths: with open(fp) as f: texts.append(f.read()) embeddings = model.encode(texts, batch_size=64, show_progress_bar=True) # Reduce 384 dims to 2 dims for the scatter plot pca = PCA(n_components=2) reduced = pca.fit_transform(embeddings) # shape: (1000, 2) plt.figure(figsize=(10, 6)) plt.scatter(reduced[:, 0], reduced[:, 1], alpha=0.5, c='#5B21B6', s=12) plt.title("Ad Clusters β PCA of 384-dim Embeddings") plt.xlabel("PC 1") plt.ylabel("PC 2") plt.grid(True, alpha=0.3) plt.tight_layout() plt.savefig("ad_clusters.png", dpi=150) print("Saved: ad_clusters.png")
| Line / expression | What it does and why |
|---|---|
| PCA(n_components=2) | Finds the 2 directions of greatest variance in 384D space and projects all vectors onto them. Relative distances (clusters) are preserved well enough for a visual sanity check. |
| fit_transform(embeddings) | fit computes the principal components; transform applies the projection. Calling both in one step is standard practice. |
| reduced[:, 0] / reduced[:, 1] | Slices the 2D array: column 0 = x-axis (PC1), column 1 = y-axis (PC2). In a working embedding model, Tech ads cluster away from Fashion ads visually. |
π Windows Task Scheduler Setup
Save the complete script as update_embeddings.py, then follow these steps.
Find your Python path
Open Command Prompt and run:
where python
Copy the full result, e.g. C:\Users\YourName\AppData\Local\Programs\Python\Python311\python.exe
Open Task Scheduler
Press Win + R, type taskschd.msc, press Enter. Click Create Basic Task⦠in the right panel.
Set name and trigger
Name: "Update Ad Bot Embeddings". Set trigger to Daily. After the wizard, right-click the task β Properties β Triggers β Edit β enable "Repeat task every 30 minutes", Duration: Indefinitely.
Configure the action
| Field | Value |
|---|---|
| Program/script | C:\...\python.exe (path from Step 1) |
| Add arguments | update_embeddings.py |
| Start in (critical) | C:\path\to\project\ β sets working directory so the script finds out_going/ |
Credentials and verification
In Properties β General: select "Run whether user is logged on or not" and enter your Windows password. Then right-click β Run. Check Last Run Result β it must show (0x0) which means success.
ad_bot.index and ad_metadata.json exist in your project folder with fresh modification timestamps.
π§ͺ Knowledge Check
These questions test practical understanding, not just definitions. Read the code snippets carefully.
π― Key Takeaways
- Incremental = state tracking. A plain JSON dict of filenames is all you need. Dict lookup is O(1) and scales to 100,000+ files without slowing down.
- all-MiniLM-L6-v2 is the right CPU model: 22 MB, 384 dims, state-of-the-art semantic quality, sub-second per batch.
- IndexFlatL2 is exact, needs no training or tuning, and is fast enough for corpora up to ~100K vectors.
- Always batch encode.
model.encode(list)is 5β10x faster than callingencode(text)in a loop. - faiss_id is the bridge from a search result back to the original file. Store it in metadata at indexing time.
- exit(0) matters. A clean zero exit code tells Windows Task Scheduler the run was successful.
- PCA plots are diagnostic only. Exclude the visualisation function from your scheduled script.