🎯 What You'll Learn

📋 Before You Begin

1. The Automated Tagging Pipeline

To implement industry-grade Semantic Induction, your bot shouldn't just store "text"; it needs to understand the "soul" of each ad. By extracting metadata like Target Audience, Ad Channel, and Campaign Goal, your FAISS search becomes surgical.

For example, when you ask for a "Facebook ad for a luxury watch," the bot will search for examples that are specifically tagged with platform: Facebook and luxury: true, rather than just generic "watch" ads.

Zero-Shot Extraction

We use a small LLM (like a 7B model or a lightweight API) to perform "Zero-Shot Extraction" on your files before they are indexed.

python
import json
import pandas as pd

TAGGING_PROMPT = """
Analyze the following ad copy and return a JSON object with:
1. 'platform': (e.g., Instagram, LinkedIn, Google Search)
2. 'target_audience': (e.g., Gen Z, Business Owners, Students)
3. 'tone': (e.g., Professional, Urgent, Emotional)
4. 'ad_type': (e.g., Product Launch, Discount, Brand Awareness)

Ad Copy: "{text}"
JSON: """

def extract_metadata(ad_text):
    simulated_response = {
        "platform": "Instagram",
        "target_audience": "Tech Enthusiasts",
        "tone": "Exciting",
        "ad_type": "Product Launch"
    }
    return simulated_response

def process_new_files(file_path):
    df = pd.read_csv(file_path)
    all_metadata = []
    
    for text in df['ad_copy']:
        meta = extract_metadata(text)
        all_metadata.append(meta)
        
    df_enriched = pd.concat([df, pd.DataFrame(all_metadata)], axis=1)
    return df_enriched
Output
# Returns DataFrame with original columns + platform, target_audience, tone, ad_type

2. Upgraded Retrieval Logic

Now that your metadata is enriched, your few-shot retrieval can use Filtered Search. This prevents the bot from giving a 'Google Search' example for an 'Instagram' request.

python
def get_precise_few_shots(product_query, platform="Instagram", k=3):
    # 1. First, we filter the metadata for the specific platform
    filtered_metadata = metadata[metadata['platform'] == platform]
    
    # 2. Vector search within that subset
    relevant_ids = filtered_metadata['id'].tolist()
    
    # [Logic to search only within relevant_ids]
    return examples

3. The Semantic Induction Workflow

This approach transforms your data from a "dump" into a structured "Ad Library."

Stage Process Outcome
Ingestion Scanning input files Raw data is identified
Induction LLM-based Tagging Data labeled with Tone, Audience, Platform
Indexing FAISS IndexIDMap Vectors stored with unique IDs
Inference Few-Shot Prompting Bot writes copy using exact style needed

Indian Context Advantage

For the Indian market, you can add language or region tags. This allows the bot to distinguish between:

  • A Mumbai-based Kirana shop (local, community-focused tone)
  • A tech startup in Bengaluru (global, "Silicon Valley" tone)

4. Full Indian-Context Indexing Script

This script creates a professional-grade pipeline for Indian-context ad generation with specialized taxonomy.

Indian-AdBot Implementation
python
import numpy as np
import pandas as pd
import faiss
import os
import json
from sentence_transformers import SentenceTransformer

MODEL_NAME = 'all-MiniLM-L6-v2'
DIMENSION = 384
INDEX_PATH = "/kaggle/working/indian_ad_bot.faiss"
METADATA_PATH = "/kaggle/working/indian_ad_metadata.csv"

TAXONOMY_PROMPT = """
Analyze this Indian advertisement copy. Return ONLY a JSON object with:
{
  "region": "e.g., Pan-India, South India, Rural, Urban",
  "language_style": "e.g., Hinglish, Formal English, Vernacular-focused",
  "retail_format": "e.g., Kirana, Quick Commerce, Modern Trade, D2C",
  "festive_context": "e.g., Diwali, Holi, Wedding Season, None",
  "audience_segment": "e.g., Gen Z, Middle-class Families, Small Biz Owners"
}
Ad Copy: "{text}"
"""

class IndianAdBot:
    def __init__(self):
        self.encoder = SentenceTransformer(MODEL_NAME)
        self.index = self._load_or_create_index()
        self.metadata = self._load_metadata()

    def _load_or_create_index(self):
        if os.path.exists(INDEX_PATH):
            return faiss.read_index(INDEX_PATH)
        return faiss.IndexIDMap(faiss.IndexFlatL2(DIMENSION))

    def _load_metadata(self):
        if os.path.exists(METADATA_PATH):
            return pd.read_csv(METADATA_PATH)
        return pd.DataFrame(columns=['id', 'text', 'region', 'language_style', 'retail_format', 'festive_context', 'audience_segment'])

    def _simulate_llm_tagging(self, text):
        is_kirana = "kirana" in text.lower() or "dukaan" in text.lower()
        return {
            "region": "Pan-India",
            "language_style": "Hinglish" if "de" in text.lower() else "English",
            "retail_format": "Kirana" if is_kirana else "Modern Trade",
            "festive_context": "Diwali" if "light" in text.lower() else "None",
            "audience_segment": "Small Biz Owners" if is_kirana else "Urban Youth"
        }

    def add_ads_incrementally(self, new_texts):
        new_entries = []
        start_id = self.index.ntotal
        
        embeddings = self.encoder.encode(new_texts).astype('float32')
        ids = np.arange(start_id, start_id + len(new_texts)).astype('int64')

        for i, text in enumerate(new_texts):
            tags = self._simulate_llm_tagging(text)
            tags.update({'id': ids[i], 'text': text})
            new_entries.append(tags)

        self.index.add_with_ids(embeddings, ids)
        faiss.write_index(self.index, INDEX_PATH)

        new_df = pd.DataFrame(new_entries)
        self.metadata = pd.concat([self.metadata, new_df], ignore_index=True)
        self.metadata.to_csv(METADATA_PATH, index=False)
        
        print(f"Successfully indexed {len(new_texts)} new ads. Total: {self.index.ntotal}")

    def get_few_shot_examples(self, query, filter_dict=None, k=3):
        query_vec = self.encoder.encode([query]).astype('float32')
        
        if filter_dict:
            mask = pd.Series([True] * len(self.metadata))
            for key, value in filter_dict.items():
                mask &= (self.metadata[key] == value)
            valid_ids = self.metadata[mask]['id'].values
            
            if len(valid_ids) == 0:
                return []
            
            distances, indices = self.index.search(query_vec, k * 2)
            filtered_indices = [idx for idx in indices[0] if idx in valid_ids][:k]
        else:
            distances, indices = self.index.search(query_vec, k)
            filtered_indices = indices[0]

        return self.metadata[self.metadata['id'].isin(filtered_indices)]['text'].tolist()

bot = IndianAdBot()

raw_ads = [
    "Upgrade your Kirana store with our digital ledger app. Simple and secure.",
    "This Diwali, bring home the brightest LED lights for your home. 20% off!",
    "Order milk and eggs in 10 minutes. The fastest delivery in Bangalore."
]

bot.add_ads_incrementally(raw_ads)

user_query = "A credit app for small shop owners in rural India."
examples = bot.get_few_shot_examples(user_query, filter_dict={"retail_format": "Kirana"})

print(f"\n--- FOUND {len(examples)} FEW-SHOT EXAMPLES ---")
for i, ex in enumerate(examples):
    print(f"Example {i+1}: {ex}")
Output
Successfully indexed 3 new ads. Total: 3

--- FOUND 1 FEW-SHOT EXAMPLES ---
Example 1: Upgrade your Kirana store with our digital ledger app. Simple and secure.

5. Data Validation with Pydantic

Use Pydantic to create a professional "quality gate" for your ad bot. This ensures that if an annotator forgets to include the "Retail Format" or uses an incorrect "Language Style," the script will reject the data before it can corrupt your FAISS index.

Pydantic Validation Model
python
from pydantic import BaseModel, Field, ValidationError
from typing import Optional, List
from enum import Enum

class LanguageStyle(str, Enum):
    HINGLISH = "Hinglish"
    FORMAL_ENGLISH = "Formal English"
    VERNACULAR = "Vernacular"

class RetailFormat(str, Enum):
    KIRANA = "Kirana"
    QUICK_COMMERCE = "Quick Commerce"
    MODERN_TRADE = "Modern Trade"
    D2C = "D2C"

class AdDataPoint(BaseModel):
    text: str = Field(..., min_length=10, description="The actual ad copy")
    language_style: LanguageStyle
    retail_format: RetailFormat
    region: str = Field(default="Pan-India")
    festive_context: Optional[str] = None
    audience_segment: List[str] = Field(default_factory=list)

def validate_and_format_for_faiss(raw_data: dict):
    try:
        validated_ad = AdDataPoint(**raw_data)
        print("Validation Successful")
        return validated_ad.model_dump()
    except ValidationError as e:
        print("Validation Failed!")
        print(e.json())
        return None

good_data = {
    "text": "Arre Sharma ji! Ledger book chhodo, phone pe hisab rakho.",
    "language_style": "Hinglish",
    "retail_format": "Kirana",
    "audience_segment": ["Small Business", "Rural"]
}

validate_and_format_for_faiss(good_data)
Output
Validation Successful

Golden Rules for Annotators

  1. Enforce Core Intent: If an ad mentions "delivery in 10 mins," tag it as Quick Commerce
  2. The "Sharma Ji" Rule: If copy uses a character to build trust, add "Trust-based" in audience_segment
  3. Boundary Examples: Show "Bad" vs "Good" examples to team

6. Bulk Upload Pipeline

The Bulk Uploader acts as the final bridge between your team's manual curation and your AI's production memory.

Bulk Validation & Indexing Code
python
import pandas as pd
import numpy as np
import faiss
from pydantic import ValidationError

class BulkAdUploader:
    def __init__(self, bot_instance):
        self.bot = bot_instance
        self.stats = {"success": 0, "failed": 0, "errors": []}

    def process_csv(self, file_path):
        df = pd.read_csv(file_path)
        valid_ads_text = []
        valid_metadata_list = []
        
        print(f"Starting Bulk Upload for {len(df)} rows...")

        for index, row in df.iterrows():
            raw_data = row.to_dict()
            if isinstance(raw_data.get('audience_segment'), str):
                raw_data['audience_segment'] = [s.strip() for s in raw_data['audience_segment'].split(',')]

            try:
                validated_data = AdDataPoint(**raw_data)
                valid_ads_text.append(validated_data.text)
                
                meta_row = validated_data.model_dump()
                valid_metadata_list.append(meta_row)
                
                self.stats["success"] += 1
            
            except ValidationError as e:
                self.stats["failed"] += 1
                self.stats["errors"].append({
                    "row_index": index,
                    "error": e.errors()[0]['msg'],
                    "text_preview": str(raw_data.get('text'))[:30]
                })

        if valid_ads_text:
            self.bot.add_ads_incrementally(valid_ads_text)
        
        self._print_summary()

    def _print_summary(self):
        print("\n--- Upload Summary ---")
        print(f"Successfully Indexed: {self.stats['success']}")
        print(f"Failed/Rejected:      {self.stats['failed']}")
        
        if self.stats["errors"]:
            print("\n--- Error Log (Top 5) ---")
            for err in self.stats["errors"][:5]:
                print(f"Row {err['row_index']}: {err['error']} | '{err['text_preview']}...'")

my_bot = IndianAdBot()
uploader = BulkAdUploader(my_bot)
uploader.process_csv('/path/to/team_batch_march.csv')
Feature Implementation Benefit
Atomic Validation Pydantic Enum + min_length Prevents "junk" data
Error Reporting Detailed stats["errors"] Team knows what to fix
Batch Processing List-based collection Optimized vectorization
Standardization Auto string-to-list Handles CSV limitations

💡 Key Takeaways