πŸ“‹ In This Article

🎯 Situation & Goals

A small company wants to build an ad bot. Their knowledge base: ~1,000 ads stored as .txt and .md files in a folder called out_going/.

The bot needs to understand ads semantically β€” not just keyword-match them. That means converting every ad into a numerical vector (embedding) that captures meaning. These embeddings will later power similarity search, clustering, recommendation, and retrieval.

βœ… What You Will Learn

  • What embeddings are and why they matter
  • How FAISS stores and searches vectors on CPU
  • How to build an incremental indexing pipeline
  • How to persist and reload a FAISS index
  • How to visualise ad clusters with PCA
  • How to schedule the script on Windows

πŸ“Œ Prerequisites

  • Python 3.8 or higher installed
  • Comfortable with for loops and functions
  • Basic understanding of lists and dicts
  • Know how to run .py files from terminal
  • No ML or AI background required
πŸ— Key Constraints CPU only (no GPU)  Β·  Incremental runs β€” never re-process old files  Β·  Scheduled via Windows Task Scheduler  Β·  Reusable index for downstream tasks.

βš™ Prerequisites & Installation

Install all required packages with a single pip command. All run on CPU β€” no CUDA setup needed.

pip install sentence-transformers faiss-cpu matplotlib numpy scikit-learn
PackageWhy we need it
sentence-transformersLoads the MiniLM model and converts text into float vectors
faiss-cpuStores, saves, and searches vectors efficiently β€” the CPU build, no GPU required
matplotlibDraws the 2D scatter plot of ad clusters
numpyRequired by FAISS and SentenceTransformers for array operations
scikit-learnPCA β€” reduces 384D vectors to 2D for the diagnostic plot
⚠ Windows users If faiss-cpu fails to install, try pip install faiss-cpu --no-cache-dir. On Python 3.12+ you may need pip install faiss-cpu==1.8.0.

πŸ“– Key Concepts

Click a card to reveal the definition. Use the arrows to navigate all 7 terms.

1 / 7
Embedding
Click to flip ↩
Loading...

πŸ“ Step 1 β€” Configuration & Data Simulation

Define the three file paths the pipeline uses, then generate 1,000 dummy ad files that simulate the real out_going/ folder. In production you skip the generation and point at your real folder.

python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import os, glob, json
import numpy as np
import faiss
import matplotlib.pyplot as plt
from sentence_transformers import SentenceTransformer
from sklearn.decomposition import PCA

# ── The only 3 constants you ever need to change ──
FOLDER  = "out_going"          # folder holding your ad files
INDEX_F = "ad_bot.index"       # persisted FAISS binary on disk
META_F  = "ad_metadata.json"   # state tracker β€” dict of processed filenames

os.makedirs(FOLDER, exist_ok=True)

def create_dummy_ads():
    categories = ["Tech", "Fashion", "Food", "Cars", "Travel"]
    templates  = [
        "Buy our {cat} product! It is {adj}.",
        "Discover the best {cat} deals. {adj} offer today.",
        "Top {cat} service. Very {adj}.",
    ]
    adjectives = ["affordable", "reliable", "fast", "stylish"]
    for i in range(1000):
        cat  = categories[i % len(categories)]
        adj  = adjectives[i  % len(adjectives)]
        body = templates[i   % len(templates)].format(cat=cat, adj=adj)
        ext  = ".md" if i % 2 == 0 else ".txt"
        with open(f"{FOLDER}/ad_{i:04d}{ext}", "w") as f:
            f.write(f"# Ad {i}\nCategory: {cat}\n{body}")
Line / expressionWhat it does and why
FOLDER / INDEX_F / META_FThree constants at the top. Changing these is all you need to adapt the script to a different project. Never hard-code paths inside functions.
os.makedirs(exist_ok=True)Creates the folder if missing. exist_ok=True prevents an error if it already exists β€” safe for repeated scheduled runs.
i % len(categories)Cycles through the list endlessly using modulo. 1,000 files Γ· 5 categories = 200 files per category.
f"ad_{i:04d}{ext}":04d zero-pads to 4 digits β†’ ad_0001.txt, ad_0042.md. Files sort correctly in any file explorer.

Expected terminal output:

terminal
Created 1000 ad files in 'out_going/'
  500 x .txt  |  500 x .md

🧠 Step 2 β€” Load the Model & FAISS Index

Load (or create) the embedding model and the FAISS index at the start of every run. This "load or create" pattern is the foundation of incremental indexing.

python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# 1. Load a lightweight, CPU-optimised embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')
dim   = model.get_sentence_embedding_dimension()  # returns 384

# 2. Load existing index + metadata, or start fresh
if os.path.exists(INDEX_F) and os.path.exists(META_F):
    print("Loading existing index...")
    index = faiss.read_index(INDEX_F)
    with open(META_F) as f:
        processed = json.load(f)
else:
    print("No index found. Creating IndexFlatL2...")
    index     = faiss.IndexFlatL2(dim)
    processed = {}

print(f"Index size: {index.ntotal} vectors | Metadata: {len(processed)} entries")
Line / expressionWhat it does and why
SentenceTransformer('all-MiniLM-L6-v2')Downloads (~22 MB, once) and loads MiniLM. First run fetches from the internet; all subsequent runs load from local cache β€” works offline for scheduled tasks.
get_sentence_embedding_dimension()Returns 384. This must match your FAISS index dimension exactly. If you switch models, you must rebuild the index from scratch.
faiss.read_index(INDEX_F)Loads the full binary index from disk, restoring all previously indexed vectors into memory in milliseconds.
faiss.IndexFlatL2(dim)Creates an exact L2 (Euclidean) distance index. "Flat" = vectors stored as-is, no compression, no approximation. Sub-millisecond search for 1,000 ads.
index.ntotalThe number of vectors currently stored. Zero on first run, 1,000 after full indexing.

Expected terminal output β€” first run:

terminal β€” first run
No index found. Creating IndexFlatL2...
Index size: 0 vectors | Metadata: 0 entries

Expected terminal output β€” subsequent runs:

terminal β€” subsequent runs
Loading existing index...
Index size: 1000 vectors | Metadata: 1000 entries
πŸ’‘ Why not GPU? faiss-cpu is perfectly fast for 1,000–100,000 vectors. GPU FAISS only starts showing an advantage at millions of vectors β€” and requires CUDA. CPU is the right choice here.

πŸ” Step 3 β€” The Incremental Logic

The entire "incremental" mechanism is just a set subtraction: all files βˆ’ already processed = new files. If nothing is new, the script exits cleanly without burning CPU.

python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# 1. Collect every .txt and .md currently in the folder
all_files = (
    glob.glob(os.path.join(FOLDER, "*.txt")) +
    glob.glob(os.path.join(FOLDER, "*.md"))
)

# 2. Keep only files NOT already in our metadata dict  ← THE incremental step
new_files = [
    f for f in all_files
    if os.path.basename(f) not in processed
]

print(f"Total: {len(all_files)} | Processed: {len(processed)} | New: {len(new_files)}")

# 3. Early-exit if nothing to do
if not new_files:
    print("Nothing new. Index is up to date.")
    exit(0)   # exit code 0 = success for Windows Task Scheduler
Line / expressionWhat it does and why
glob.glob("*.txt") + glob.glob("*.md")Scans the folder for matching filenames. os.path.join() makes the path work on both Windows and Linux without changes.
os.path.basename(f) not in processedExtracts just the filename (not the full path) and checks it against the metadata dict. Dict key lookup is O(1) β€” fast even with 100,000 entries.
[f for f in all_files if ...]List comprehension filtering. Produces only the files the script has not seen before. Equivalent to a filtered loop but more concise.
exit(0)Exit code 0 = success. Windows Task Scheduler marks the run as successful. A non-zero code (or an exception) would appear as a failure in the task history.
terminal β€” when 50 new ads dropped in
Total: 1050 | Processed: 1000 | New: 50
terminal β€” when nothing changed
Nothing new. Index is up to date.
πŸ“Œ What about deleted files? IndexFlatL2 does not support deletion. If ads are removed, delete ad_bot.index and ad_metadata.json and re-run. For a ~1,000-file corpus this takes under 2 minutes on CPU.

βš™ Step 4 β€” Encode & Save to FAISS

Read each new file, encode all texts in a single batch call (much faster than one-by-one), add vectors to FAISS, then persist everything to disk.

python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# 1. Read all new files into memory
new_texts, new_names = [], []
for fp in new_files:
    with open(fp, "r", encoding="utf-8") as f:
        new_texts.append(f.read())
    new_names.append(os.path.basename(fp))

# 2. Batch encode β€” ALWAYS pass a list, not a single string
#    Returns numpy array shape (N, 384), dtype float32
vectors = model.encode(
    new_texts,
    batch_size=32,               # process 32 texts at a time
    show_progress_bar=True,      # tqdm progress bar in terminal
    convert_to_numpy=True        # FAISS requires numpy float32
)

# 3. Append vectors to the FAISS index (does not touch existing vectors)
index.add(vectors)              # index.ntotal increases by len(new_texts)

# 4. Record each new file in metadata with its FAISS position (ID)
start_id = len(processed)
for i, name in enumerate(new_names):
    processed[name] = {
        "faiss_id": start_id + i,  # integer position in the index
        "status":   "indexed"
    }

# 5. Persist both files to disk
faiss.write_index(index, INDEX_F)
with open(META_F, "w") as f:
    json.dump(processed, f, indent=4)
print(f"Saved. Index now holds {index.ntotal} vectors.")
Line / expressionWhat it does and why
model.encode(list, batch_size=32)Passing a list lets the model process 32 texts at once internally. This is 5–10x faster than calling encode(text) in a loop because operations are parallelised within each batch.
convert_to_numpy=TrueFAISS requires a numpy.ndarray of dtype float32. This flag ensures that automatically β€” you would get a TypeError if you passed a PyTorch tensor.
index.add(vectors)Appends vectors to the in-memory FAISS index without touching existing ones. IDs are assigned automatically as sequential integers from the current ntotal.
processed[name] = {"faiss_id": ...}Maps the filename to its FAISS integer ID. Without this, a search returning ID 42 gives you no way to know which ad file it corresponds to.
faiss.write_index(index, INDEX_F)Serialises the full index (all vectors + structure) to a compact binary file. Next run, faiss.read_index() restores it in milliseconds.
terminal β€” first full run
Batches: 100%|#####| 32/32 [00:48<00:00, 1.51s/it]
Saved. Index now holds 1000 vectors.
  ad_bot.index      1.5 MB
  ad_metadata.json   85 KB
βœ… What you have after this step Two files on disk: ad_bot.index and ad_metadata.json. Any Python script can now load these and find the top-5 most semantically similar ads to any query using index.search(query_vector, k=5).

πŸ“Š Step 5 β€” Visualise Ad Clusters with PCA

384-dimensional vectors cannot be plotted directly. PCA squashes them to 2D so you can visually verify that similar-category ads cluster together β€” a sanity check that embeddings are working correctly.

⚠ Run this separately from the scheduled script Re-encoding all 1,000 files just for a plot wastes CPU. Call visualise_clusters() manually whenever you need a diagnostic report β€” not in the automated task.
python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
def visualise_clusters(model):
    all_paths = (glob.glob(f"{FOLDER}/*.txt") +
                 glob.glob(f"{FOLDER}/*.md"))
    texts = []
    for fp in all_paths:
        with open(fp) as f:
            texts.append(f.read())

    embeddings = model.encode(texts, batch_size=64, show_progress_bar=True)

    # Reduce 384 dims to 2 dims for the scatter plot
    pca     = PCA(n_components=2)
    reduced = pca.fit_transform(embeddings)  # shape: (1000, 2)

    plt.figure(figsize=(10, 6))
    plt.scatter(reduced[:, 0], reduced[:, 1],
                alpha=0.5, c='#5B21B6', s=12)
    plt.title("Ad Clusters β€” PCA of 384-dim Embeddings")
    plt.xlabel("PC 1")
    plt.ylabel("PC 2")
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.savefig("ad_clusters.png", dpi=150)
    print("Saved: ad_clusters.png")
Line / expressionWhat it does and why
PCA(n_components=2)Finds the 2 directions of greatest variance in 384D space and projects all vectors onto them. Relative distances (clusters) are preserved well enough for a visual sanity check.
fit_transform(embeddings)fit computes the principal components; transform applies the projection. Calling both in one step is standard practice.
reduced[:, 0] / reduced[:, 1]Slices the 2D array: column 0 = x-axis (PC1), column 1 = y-axis (PC2). In a working embedding model, Tech ads cluster away from Fashion ads visually.

πŸ—“ Windows Task Scheduler Setup

Save the complete script as update_embeddings.py, then follow these steps.

1

Find your Python path

Open Command Prompt and run:

cmd
where python

Copy the full result, e.g. C:\Users\YourName\AppData\Local\Programs\Python\Python311\python.exe

2

Open Task Scheduler

Press Win + R, type taskschd.msc, press Enter. Click Create Basic Task… in the right panel.

3

Set name and trigger

Name: "Update Ad Bot Embeddings". Set trigger to Daily. After the wizard, right-click the task β†’ Properties β†’ Triggers β†’ Edit β†’ enable "Repeat task every 30 minutes", Duration: Indefinitely.

4

Configure the action

FieldValue
Program/scriptC:\...\python.exe (path from Step 1)
Add argumentsupdate_embeddings.py
Start in (critical)C:\path\to\project\ β€” sets working directory so the script finds out_going/
5

Credentials and verification

In Properties β†’ General: select "Run whether user is logged on or not" and enter your Windows password. Then right-click β†’ Run. Check Last Run Result β€” it must show (0x0) which means success.

βœ… Verify it worked After the first run, check that ad_bot.index and ad_metadata.json exist in your project folder with fresh modification timestamps.

πŸ§ͺ Knowledge Check

These questions test practical understanding, not just definitions. Read the code snippets carefully.

🎯 Key Takeaways