What You'll Learn

Before You Begin

In This Article

  1. The Reality Check: Common Problems
  2. Best Practices for Reliability
  3. Advanced Implementation
  4. Analysis of the New Workflow
  5. Summary of Best Practices

1. The Reality Check: Common Problems

In Part 1, we built a basic incremental indexer. However, in a real-world scenario—where humans are involved and systems crash—simple scripts often fail. Here are the three most common problems you will face:

A. The "Slow Loading" Bottleneck

As the company grows, the out_going folder might contain 10,000 or 100,000 ads. Loading a massive FAISS index from disk into RAM every time the Windows Task runs can take seconds or even minutes. During this time, CPU spikes, and the bot is unresponsive.

B. Index Corruption (The "Broken State")

Imagine this sequence of events:

  1. Script loads Index.
  2. Script adds new vectors.
  3. System crashes/Power loss before the script saves the file to disk.

Result: The metadata (JSON) might say "Ad 1001 is processed," but the FAISS index file on disk does not contain Ad 1001. Your index is now inconsistent with your files. The ad is effectively lost until you do a full rebuild.

C. The "Human in the Loop" Latency

The company adds a new ad manually. The Windows Task Scheduler is set to run every hour.

Problem: The human adds the ad, but the bot won't "know" about it for up to 59 minutes. This is unacceptable for reactive ad botting.

2. Best Practices for Reliability

To solve these problems, we need to upgrade our architecture with Atomic Operations and Event-Driven Updates.

Solution 1: Atomic Saving (Consistency)

We should never overwrite the main index file directly. Instead:

  1. Write the new index to a temporary file (e.g., ad_bot.index.tmp).
  2. Once the write is fully complete, use os.replace to rename the temp file to the real file.

Why? os.replace is an atomic operation on most file systems. If the power cuts halfway through writing .tmp, the original ad_bot.index remains untouched and safe.

Solution 2: The Watchdog Pattern (Real-Time)

Instead of relying solely on a schedule, we run a lightweight "Observer" script in the background. This script watches the out_going folder.

This ensures the knowledge base is always updated with the latest knowledge the second it is created.

3. Advanced Implementation

Below is the upgraded code. It introduces:

  1. AtomicIndexer class: Handles safe saving.
  2. FileWatcher class: Uses the watchdog library to listen for new files instantly.
  3. Error Handling: Ensures a bad file doesn't crash the whole loop.

The Python Code

python
import faiss
import json
import os
import time
from sentence_transformers import SentenceTransformer
from watchdog.observers import Observer
from watchdog.events import FileSystemEventHandler

class AtomicIndexer:
    def __init__(self, index_path, metadata_path, model_name='all-MiniLM-L6-v2'):
        self.index_path = index_path
        self.metadata_path = metadata_path
        self.model = SentenceTransformer(model_name)
        self.dim = self.model.get_sentence_embedding_dimension()
        
        if os.path.exists(index_path):
            self.index = faiss.read_index(index_path)
        else:
            self.index = faiss.IndexFlatL2(self.dim)
            
        if os.path.exists(metadata_path):
            with open(metadata_path, 'r') as f:
                self.metadata = json.load(f)
        else:
            self.metadata = {}

    def add_file(self, file_path):
        filename = os.path.basename(file_path)
        if filename in self.metadata:
            return False

        try:
            with open(file_path, 'r', encoding='utf-8') as f:
                text = f.read()
            
            vector = self.model.encode([text])
            self.index.add(vector)
            self.metadata[filename] = {'status': 'indexed', 'size': len(text)}
            self._safe_save()
            print(f"[SUCCESS] Indexed: {filename}")
            return True
            
        except Exception as e:
            print(f"[ERROR] Failed to index {filename}: {e}")
            return False

    def _safe_save(self):
        # Atomic Index Save
        temp_index = self.index_path + ".tmp"
        faiss.write_index(self.index, temp_index)
        os.replace(temp_index, self.index_path)
        
        # Atomic Metadata Save
        temp_meta = self.metadata_path + ".tmp"
        with open(temp_meta, 'w') as f:
            json.dump(self.metadata, f)
        os.replace(temp_meta, self.metadata_path)

class AdFileHandler(FileSystemEventHandler):
    def __init__(self, indexer):
        self.indexer = indexer

    def on_created(self, event):
        if not event.is_directory:
            time.sleep(1)
            print(f"Detected: {event.src_path}")
            self.indexer.add_file(event.src_path)

# Setup
indexer = AtomicIndexer("ad_bot.index", "ad_metadata.json")
event_handler = AdFileHandler(indexer)
observer = Observer()
observer.schedule(event_handler, path='out_going', recursive=False)
observer.start()

print("Watching for new ads...")
try:
    while True:
        time.sleep(1)
except KeyboardInterrupt:
    observer.stop()
observer.join()
Installation
pip install faiss-cpu sentence-transformers watchdog

4. Analysis of the New Workflow

The diagram below illustrates how we solved the consistency and latency issues.

Robust Pipeline Visualization

The robust pipeline includes these key improvements:

  1. Validation: Before attempting to embed, we validate the file. This prevents the indexer from crashing on a locked or empty file.
  2. Atomic Save: Notice the step "Write .tmp -> Rename". This guarantees that ad_bot.index is never half-written.
  3. Event Trigger: The "New Ad Detected" box represents the Human-in-the-loop trigger, bypassing the need to wait for a scheduled task.

5. Summary of Best Practices

Click to expand comparison table

Problem Old Approach New Best Practice
Corruption faiss.write_index(file) Write to .tmpos.replace()
Latency Check every 60 mins Use watchdog to listen for OS events
Validation None try/except blocks around file reading
Consistency Trust metadata Check if file exists in metadata before adding

By implementing these changes, the Ad Bot's knowledge base becomes resilient to crashes, instantly responsive to human input, and consistent across reboots.

Key Takeaways