Ad Bot Custom Embeddings with FAISS

📋 In This Article

01Situation & Goals
02Prerequisites & Setup
03Key Concepts
04Step 1 — Data Simulation
05Step 2 — Load Model & Index
06Step 3 — Incremental Logic
07Step 4 — Encode & Save
08Step 5 — Visualise Clusters
09Windows Task Scheduler
10Knowledge Check

🎯 Situation & Goals

A small company wants to build an ad bot. Their knowledge base: ~1,000 ads stored as .txt and .md files in a folder called out_going/.

The bot needs to understand ads semantically — not just keyword-match them. That means converting every ad into a numerical vector (embedding) that captures meaning. These embeddings will later power similarity search, clustering, recommendation, and retrieval.

✅ What You Will Learn

What embeddings are and why they matter
How FAISS stores and searches vectors on CPU
How to build an incremental indexing pipeline
How to persist and reload a FAISS index
How to visualise ad clusters with PCA
How to schedule the script on Windows

📌 Prerequisites

Python 3.8 or higher installed
Comfortable with for loops and functions
Basic understanding of lists and dicts
Know how to run .py files from terminal
No ML or AI background required

🏗 Key Constraints CPU only (no GPU) · Incremental runs — never re-process old files · Scheduled via Windows Task Scheduler · Reusable index for downstream tasks.

⚙ Prerequisites & Installation

● Beginner ⏱ 2 min

Install all required packages with a single pip command. All run on CPU — no CUDA setup needed.

pip install sentence-transformers faiss-cpu matplotlib numpy scikit-learn

Package	Why we need it
sentence-transformers	Loads the MiniLM model and converts text into float vectors
faiss-cpu	Stores, saves, and searches vectors efficiently — the CPU build, no GPU required
matplotlib	Draws the 2D scatter plot of ad clusters
numpy	Required by FAISS and SentenceTransformers for array operations
scikit-learn	PCA — reduces 384D vectors to 2D for the diagnostic plot

⚠ Windows users If faiss-cpu fails to install, try pip install faiss-cpu --no-cache-dir. On Python 3.12+ you may need pip install faiss-cpu==1.8.0.

📖 Key Concepts

● Beginner ⏱ 3 min

Click a card to reveal the definition. Use the arrows to navigate all 7 terms.

1 / 7

Embedding

Click to flip ↩

📁 Step 1 — Configuration & Data Simulation

● Beginner ⏱ 2 min

Define the three file paths the pipeline uses, then generate 1,000 dummy ad files that simulate the real out_going/ folder. In production you skip the generation and point at your real folder.

      python
      
    
import os, glob, json
import numpy as np
import faiss
import matplotlib.pyplot as plt
from sentence_transformers import SentenceTransformer
from sklearn.decomposition import PCA

# ── The only 3 constants you ever need to change ──
FOLDER  = "out_going"          # folder holding your ad files
INDEX_F = "ad_bot.index"       # persisted FAISS binary on disk
META_F  = "ad_metadata.json"   # state tracker — dict of processed filenames

os.makedirs(FOLDER, exist_ok=True)

def create_dummy_ads():
    categories = ["Tech", "Fashion", "Food", "Cars", "Travel"]
    templates  = [
        "Buy our {cat} product! It is {adj}.",
        "Discover the best {cat} deals. {adj} offer today.",
        "Top {cat} service. Very {adj}.",
    ]
    adjectives = ["affordable", "reliable", "fast", "stylish"]
    for i in range(1000):
        cat  = categories[i % len(categories)]
        adj  = adjectives[i  % len(adjectives)]
        body = templates[i   % len(templates)].format(cat=cat, adj=adj)
        ext  = ".md" if i % 2 == 0 else ".txt"
        with open(f"{FOLDER}/ad_{i:04d}{ext}", "w") as f:
            f.write(f"# Ad {i}\nCategory: {cat}\n{body}")

Line / expression	What it does and why
FOLDER / INDEX_F / META_F	Three constants at the top. Changing these is all you need to adapt the script to a different project. Never hard-code paths inside functions.
os.makedirs(exist_ok=True)	Creates the folder if missing. `exist_ok=True` prevents an error if it already exists — safe for repeated scheduled runs.
i % len(categories)	Cycles through the list endlessly using modulo. 1,000 files ÷ 5 categories = 200 files per category.
f"ad_{i:04d}{ext}"	`:04d` zero-pads to 4 digits → `ad_0001.txt`, `ad_0042.md`. Files sort correctly in any file explorer.

Expected terminal output:

terminal

Created 1000 ad files in 'out_going/'
  500 x .txt  |  500 x .md

🧠 Step 2 — Load the Model & FAISS Index

● Intermediate ⏱ 3 min

Load (or create) the embedding model and the FAISS index at the start of every run. This "load or create" pattern is the foundation of incremental indexing.

      python
      
    
# 1. Load a lightweight, CPU-optimised embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')
dim   = model.get_sentence_embedding_dimension()  # returns 384

# 2. Load existing index + metadata, or start fresh
if os.path.exists(INDEX_F) and os.path.exists(META_F):
    print("Loading existing index...")
    index = faiss.read_index(INDEX_F)
    with open(META_F) as f:
        processed = json.load(f)
else:
    print("No index found. Creating IndexFlatL2...")
    index     = faiss.IndexFlatL2(dim)
    processed = {}

print(f"Index size: {index.ntotal} vectors | Metadata: {len(processed)} entries")

Line / expression	What it does and why
SentenceTransformer('all-MiniLM-L6-v2')	Downloads (~22 MB, once) and loads MiniLM. First run fetches from the internet; all subsequent runs load from local cache — works offline for scheduled tasks.
get_sentence_embedding_dimension()	Returns 384. This must match your FAISS index dimension exactly. If you switch models, you must rebuild the index from scratch.
faiss.read_index(INDEX_F)	Loads the full binary index from disk, restoring all previously indexed vectors into memory in milliseconds.
faiss.IndexFlatL2(dim)	Creates an exact L2 (Euclidean) distance index. "Flat" = vectors stored as-is, no compression, no approximation. Sub-millisecond search for 1,000 ads.
index.ntotal	The number of vectors currently stored. Zero on first run, 1,000 after full indexing.

Expected terminal output — first run:

terminal — first run

No index found. Creating IndexFlatL2...
Index size: 0 vectors | Metadata: 0 entries

Expected terminal output — subsequent runs:

terminal — subsequent runs

Loading existing index...
Index size: 1000 vectors | Metadata: 1000 entries

💡 Why not GPU? faiss-cpu is perfectly fast for 1,000–100,000 vectors. GPU FAISS only starts showing an advantage at millions of vectors — and requires CUDA. CPU is the right choice here.

🔍 Step 3 — The Incremental Logic

● Intermediate ⏱ 3 min

The entire "incremental" mechanism is just a set subtraction: all files − already processed = new files. If nothing is new, the script exits cleanly without burning CPU.

      python
      
    
# 1. Collect every .txt and .md currently in the folder
all_files = (
    glob.glob(os.path.join(FOLDER, "*.txt")) +
    glob.glob(os.path.join(FOLDER, "*.md"))
)

# 2. Keep only files NOT already in our metadata dict  ← THE incremental step
new_files = [
    f for f in all_files
    if os.path.basename(f) not in processed
]

print(f"Total: {len(all_files)} | Processed: {len(processed)} | New: {len(new_files)}")

# 3. Early-exit if nothing to do
if not new_files:
    print("Nothing new. Index is up to date.")
    exit(0)   # exit code 0 = success for Windows Task Scheduler

Line / expression	What it does and why
glob.glob(".txt") + glob.glob(".md")	Scans the folder for matching filenames. `os.path.join()` makes the path work on both Windows and Linux without changes.
os.path.basename(f) not in processed	Extracts just the filename (not the full path) and checks it against the metadata dict. Dict key lookup is O(1) — fast even with 100,000 entries.
[f for f in all_files if ...]	List comprehension filtering. Produces only the files the script has not seen before. Equivalent to a filtered loop but more concise.
exit(0)	Exit code 0 = success. Windows Task Scheduler marks the run as successful. A non-zero code (or an exception) would appear as a failure in the task history.

terminal — when 50 new ads dropped in

Total: 1050 | Processed: 1000 | New: 50

terminal — when nothing changed

Nothing new. Index is up to date.

📌 What about deleted files? IndexFlatL2 does not support deletion. If ads are removed, delete ad_bot.index and ad_metadata.json and re-run. For a ~1,000-file corpus this takes under 2 minutes on CPU.

⚙ Step 4 — Encode & Save to FAISS

● Intermediate ⏱ 3 min

Read each new file, encode all texts in a single batch call (much faster than one-by-one), add vectors to FAISS, then persist everything to disk.

      python
      
    
# 1. Read all new files into memory
new_texts, new_names = [], []
for fp in new_files:
    with open(fp, "r", encoding="utf-8") as f:
        new_texts.append(f.read())
    new_names.append(os.path.basename(fp))

# 2. Batch encode — ALWAYS pass a list, not a single string
#    Returns numpy array shape (N, 384), dtype float32
vectors = model.encode(
    new_texts,
    batch_size=32,               # process 32 texts at a time
    show_progress_bar=True,      # tqdm progress bar in terminal
    convert_to_numpy=True        # FAISS requires numpy float32
)

# 3. Append vectors to the FAISS index (does not touch existing vectors)
index.add(vectors)              # index.ntotal increases by len(new_texts)

# 4. Record each new file in metadata with its FAISS position (ID)
start_id = len(processed)
for i, name in enumerate(new_names):
    processed[name] = {
        "faiss_id": start_id + i,  # integer position in the index
        "status":   "indexed"
    }

# 5. Persist both files to disk
faiss.write_index(index, INDEX_F)
with open(META_F, "w") as f:
    json.dump(processed, f, indent=4)
print(f"Saved. Index now holds {index.ntotal} vectors.")

Line / expression	What it does and why
model.encode(list, batch_size=32)	Passing a list lets the model process 32 texts at once internally. This is 5–10x faster than calling `encode(text)` in a loop because operations are parallelised within each batch.
convert_to_numpy=True	FAISS requires a `numpy.ndarray` of dtype `float32`. This flag ensures that automatically — you would get a TypeError if you passed a PyTorch tensor.
index.add(vectors)	Appends vectors to the in-memory FAISS index without touching existing ones. IDs are assigned automatically as sequential integers from the current `ntotal`.
processed[name] = {"faiss_id": ...}	Maps the filename to its FAISS integer ID. Without this, a search returning ID 42 gives you no way to know which ad file it corresponds to.
faiss.write_index(index, INDEX_F)	Serialises the full index (all vectors + structure) to a compact binary file. Next run, `faiss.read_index()` restores it in milliseconds.

terminal — first full run

Batches: 100%|#####| 32/32 [00:48<00:00, 1.51s/it]
Saved. Index now holds 1000 vectors.
  ad_bot.index      1.5 MB
  ad_metadata.json   85 KB

✅ What you have after this step Two files on disk: ad_bot.index and ad_metadata.json. Any Python script can now load these and find the top-5 most semantically similar ads to any query using index.search(query_vector, k=5).

📊 Step 5 — Visualise Ad Clusters with PCA

● Intermediate ⏱ 2 min

384-dimensional vectors cannot be plotted directly. PCA squashes them to 2D so you can visually verify that similar-category ads cluster together — a sanity check that embeddings are working correctly.

⚠ Run this separately from the scheduled script Re-encoding all 1,000 files just for a plot wastes CPU. Call visualise_clusters() manually whenever you need a diagnostic report — not in the automated task.

      python
      
    
def visualise_clusters(model):
    all_paths = (glob.glob(f"{FOLDER}/*.txt") +
                 glob.glob(f"{FOLDER}/*.md"))
    texts = []
    for fp in all_paths:
        with open(fp) as f:
            texts.append(f.read())

    embeddings = model.encode(texts, batch_size=64, show_progress_bar=True)

    # Reduce 384 dims to 2 dims for the scatter plot
    pca     = PCA(n_components=2)
    reduced = pca.fit_transform(embeddings)  # shape: (1000, 2)

    plt.figure(figsize=(10, 6))
    plt.scatter(reduced[:, 0], reduced[:, 1],
                alpha=0.5, c='#5B21B6', s=12)
    plt.title("Ad Clusters — PCA of 384-dim Embeddings")
    plt.xlabel("PC 1")
    plt.ylabel("PC 2")
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.savefig("ad_clusters.png", dpi=150)
    print("Saved: ad_clusters.png")

Line / expression	What it does and why
PCA(n_components=2)	Finds the 2 directions of greatest variance in 384D space and projects all vectors onto them. Relative distances (clusters) are preserved well enough for a visual sanity check.
fit_transform(embeddings)	`fit` computes the principal components; `transform` applies the projection. Calling both in one step is standard practice.
reduced[:, 0] / reduced[:, 1]	Slices the 2D array: column 0 = x-axis (PC1), column 1 = y-axis (PC2). In a working embedding model, Tech ads cluster away from Fashion ads visually.

🗓 Windows Task Scheduler Setup

● Beginner ⏱ 5 min

Save the complete script as update_embeddings.py, then follow these steps.

Find your Python path

Open Command Prompt and run:

cmd

where python

Copy the full result, e.g. C:\Users\YourName\AppData\Local\Programs\Python\Python311\python.exe

Open Task Scheduler

Press Win + R, type taskschd.msc, press Enter. Click Create Basic Task… in the right panel.

Set name and trigger

Name: "Update Ad Bot Embeddings". Set trigger to Daily. After the wizard, right-click the task → Properties → Triggers → Edit → enable "Repeat task every 30 minutes", Duration: Indefinitely.

Configure the action

Field	Value
Program/script	`C:\...\python.exe` (path from Step 1)
Add arguments	`update_embeddings.py`
Start in (critical)	`C:\path\to\project\` — sets working directory so the script finds `out_going/`

Credentials and verification

In Properties → General: select "Run whether user is logged on or not" and enter your Windows password. Then right-click → Run. Check Last Run Result — it must show (0x0) which means success.

✅ Verify it worked After the first run, check that ad_bot.index and ad_metadata.json exist in your project folder with fresh modification timestamps.

🧪 Knowledge Check

● Mixed difficulty ⏱ 5 min

These questions test practical understanding, not just definitions. Read the code snippets carefully.

🎯 Key Takeaways

Incremental = state tracking. A plain JSON dict of filenames is all you need. Dict lookup is O(1) and scales to 100,000+ files without slowing down.
all-MiniLM-L6-v2 is the right CPU model: 22 MB, 384 dims, state-of-the-art semantic quality, sub-second per batch.
IndexFlatL2 is exact, needs no training or tuning, and is fast enough for corpora up to ~100K vectors.
Always batch encode. model.encode(list) is 5–10x faster than calling encode(text) in a loop.
faiss_id is the bridge from a search result back to the original file. Store it in metadata at indexing time.
exit(0) matters. A clean zero exit code tells Windows Task Scheduler the run was successful.
PCA plots are diagnostic only. Exclude the visualisation function from your scheduled script.