Semantic Induction for Indian Ad Generation

🎯 What You'll Learn

Implement a semantic induction pipeline that transforms raw ad data into structured, searchable content
Build an automated metadata tagging system using LLMs for platform, audience, tone, and ad type
Create filtered FAISS retrieval to get precise few-shot examples based on specific criteria
Apply Pydantic validation to enforce data quality standards and prevent corruption of your index
Build a bulk upload pipeline with detailed error reporting for team workflows

📋 Before You Begin

Basic Python programming knowledge
Familiarity with pandas DataFrames
Understanding of vector embeddings (conceptual level)
Python 3.9+ with pip installed

1. The Automated Tagging Pipeline

⚡ Intermediate ⏱ ~4 min

To implement industry-grade Semantic Induction, your bot shouldn't just store "text"; it needs to understand the "soul" of each ad. By extracting metadata like Target Audience, Ad Channel, and Campaign Goal, your FAISS search becomes surgical.

For example, when you ask for a "Facebook ad for a luxury watch," the bot will search for examples that are specifically tagged with platform: Facebook and luxury: true, rather than just generic "watch" ads.

Zero-Shot Extraction

We use a small LLM (like a 7B model or a lightweight API) to perform "Zero-Shot Extraction" on your files before they are indexed.

python

import json
import pandas as pd

TAGGING_PROMPT = """
Analyze the following ad copy and return a JSON object with:
1. 'platform': (e.g., Instagram, LinkedIn, Google Search)
2. 'target_audience': (e.g., Gen Z, Business Owners, Students)
3. 'tone': (e.g., Professional, Urgent, Emotional)
4. 'ad_type': (e.g., Product Launch, Discount, Brand Awareness)

Ad Copy: "{text}"
JSON: """

def extract_metadata(ad_text):
    simulated_response = {
        "platform": "Instagram",
        "target_audience": "Tech Enthusiasts",
        "tone": "Exciting",
        "ad_type": "Product Launch"
    }
    return simulated_response

def process_new_files(file_path):
    df = pd.read_csv(file_path)
    all_metadata = []
    
    for text in df['ad_copy']:
        meta = extract_metadata(text)
        all_metadata.append(meta)
        
    df_enriched = pd.concat([df, pd.DataFrame(all_metadata)], axis=1)
    return df_enriched

Output

# Returns DataFrame with original columns + platform, target_audience, tone, ad_type

2. Upgraded Retrieval Logic

⚡ Intermediate ⏱ ~3 min

Now that your metadata is enriched, your few-shot retrieval can use Filtered Search. This prevents the bot from giving a 'Google Search' example for an 'Instagram' request.

python

def get_precise_few_shots(product_query, platform="Instagram", k=3):
    # 1. First, we filter the metadata for the specific platform
    filtered_metadata = metadata[metadata['platform'] == platform]
    
    # 2. Vector search within that subset
    relevant_ids = filtered_metadata['id'].tolist()
    
    # [Logic to search only within relevant_ids]
    return examples

3. The Semantic Induction Workflow

⚡ Beginner ⏱ ~2 min

This approach transforms your data from a "dump" into a structured "Ad Library."

Stage	Process	Outcome
Ingestion	Scanning input files	Raw data is identified
Induction	LLM-based Tagging	Data labeled with Tone, Audience, Platform
Indexing	FAISS IndexIDMap	Vectors stored with unique IDs
Inference	Few-Shot Prompting	Bot writes copy using exact style needed

Indian Context Advantage

For the Indian market, you can add language or region tags. This allows the bot to distinguish between:

A Mumbai-based Kirana shop (local, community-focused tone)
A tech startup in Bengaluru (global, "Silicon Valley" tone)

4. Full Indian-Context Indexing Script

⚡ Advanced ⏱ ~5 min

This script creates a professional-grade pipeline for Indian-context ad generation with specialized taxonomy.

Indian-AdBot Implementation ▼

python

import numpy as np
import pandas as pd
import faiss
import os
import json
from sentence_transformers import SentenceTransformer

MODEL_NAME = 'all-MiniLM-L6-v2'
DIMENSION = 384
INDEX_PATH = "/kaggle/working/indian_ad_bot.faiss"
METADATA_PATH = "/kaggle/working/indian_ad_metadata.csv"

TAXONOMY_PROMPT = """
Analyze this Indian advertisement copy. Return ONLY a JSON object with:
{
  "region": "e.g., Pan-India, South India, Rural, Urban",
  "language_style": "e.g., Hinglish, Formal English, Vernacular-focused",
  "retail_format": "e.g., Kirana, Quick Commerce, Modern Trade, D2C",
  "festive_context": "e.g., Diwali, Holi, Wedding Season, None",
  "audience_segment": "e.g., Gen Z, Middle-class Families, Small Biz Owners"
}
Ad Copy: "{text}"
"""

class IndianAdBot:
    def __init__(self):
        self.encoder = SentenceTransformer(MODEL_NAME)
        self.index = self._load_or_create_index()
        self.metadata = self._load_metadata()

    def _load_or_create_index(self):
        if os.path.exists(INDEX_PATH):
            return faiss.read_index(INDEX_PATH)
        return faiss.IndexIDMap(faiss.IndexFlatL2(DIMENSION))

    def _load_metadata(self):
        if os.path.exists(METADATA_PATH):
            return pd.read_csv(METADATA_PATH)
        return pd.DataFrame(columns=['id', 'text', 'region', 'language_style', 'retail_format', 'festive_context', 'audience_segment'])

    def _simulate_llm_tagging(self, text):
        is_kirana = "kirana" in text.lower() or "dukaan" in text.lower()
        return {
            "region": "Pan-India",
            "language_style": "Hinglish" if "de" in text.lower() else "English",
            "retail_format": "Kirana" if is_kirana else "Modern Trade",
            "festive_context": "Diwali" if "light" in text.lower() else "None",
            "audience_segment": "Small Biz Owners" if is_kirana else "Urban Youth"
        }

    def add_ads_incrementally(self, new_texts):
        new_entries = []
        start_id = self.index.ntotal
        
        embeddings = self.encoder.encode(new_texts).astype('float32')
        ids = np.arange(start_id, start_id + len(new_texts)).astype('int64')

        for i, text in enumerate(new_texts):
            tags = self._simulate_llm_tagging(text)
            tags.update({'id': ids[i], 'text': text})
            new_entries.append(tags)

        self.index.add_with_ids(embeddings, ids)
        faiss.write_index(self.index, INDEX_PATH)

        new_df = pd.DataFrame(new_entries)
        self.metadata = pd.concat([self.metadata, new_df], ignore_index=True)
        self.metadata.to_csv(METADATA_PATH, index=False)
        
        print(f"Successfully indexed {len(new_texts)} new ads. Total: {self.index.ntotal}")

    def get_few_shot_examples(self, query, filter_dict=None, k=3):
        query_vec = self.encoder.encode([query]).astype('float32')
        
        if filter_dict:
            mask = pd.Series([True] * len(self.metadata))
            for key, value in filter_dict.items():
                mask &= (self.metadata[key] == value)
            valid_ids = self.metadata[mask]['id'].values
            
            if len(valid_ids) == 0:
                return []
            
            distances, indices = self.index.search(query_vec, k * 2)
            filtered_indices = [idx for idx in indices[0] if idx in valid_ids][:k]
        else:
            distances, indices = self.index.search(query_vec, k)
            filtered_indices = indices[0]

        return self.metadata[self.metadata['id'].isin(filtered_indices)]['text'].tolist()

bot = IndianAdBot()

raw_ads = [
    "Upgrade your Kirana store with our digital ledger app. Simple and secure.",
    "This Diwali, bring home the brightest LED lights for your home. 20% off!",
    "Order milk and eggs in 10 minutes. The fastest delivery in Bangalore."
]

bot.add_ads_incrementally(raw_ads)

user_query = "A credit app for small shop owners in rural India."
examples = bot.get_few_shot_examples(user_query, filter_dict={"retail_format": "Kirana"})

print(f"\n--- FOUND {len(examples)} FEW-SHOT EXAMPLES ---")
for i, ex in enumerate(examples):
    print(f"Example {i+1}: {ex}")

Output

Successfully indexed 3 new ads. Total: 3

--- FOUND 1 FEW-SHOT EXAMPLES ---
Example 1: Upgrade your Kirana store with our digital ledger app. Simple and secure.

5. Data Validation with Pydantic

⚡ Intermediate ⏱ ~4 min

Use Pydantic to create a professional "quality gate" for your ad bot. This ensures that if an annotator forgets to include the "Retail Format" or uses an incorrect "Language Style," the script will reject the data before it can corrupt your FAISS index.

Pydantic Validation Model ▼

python

from pydantic import BaseModel, Field, ValidationError
from typing import Optional, List
from enum import Enum

class LanguageStyle(str, Enum):
    HINGLISH = "Hinglish"
    FORMAL_ENGLISH = "Formal English"
    VERNACULAR = "Vernacular"

class RetailFormat(str, Enum):
    KIRANA = "Kirana"
    QUICK_COMMERCE = "Quick Commerce"
    MODERN_TRADE = "Modern Trade"
    D2C = "D2C"

class AdDataPoint(BaseModel):
    text: str = Field(..., min_length=10, description="The actual ad copy")
    language_style: LanguageStyle
    retail_format: RetailFormat
    region: str = Field(default="Pan-India")
    festive_context: Optional[str] = None
    audience_segment: List[str] = Field(default_factory=list)

def validate_and_format_for_faiss(raw_data: dict):
    try:
        validated_ad = AdDataPoint(**raw_data)
        print("Validation Successful")
        return validated_ad.model_dump()
    except ValidationError as e:
        print("Validation Failed!")
        print(e.json())
        return None

good_data = {
    "text": "Arre Sharma ji! Ledger book chhodo, phone pe hisab rakho.",
    "language_style": "Hinglish",
    "retail_format": "Kirana",
    "audience_segment": ["Small Business", "Rural"]
}

validate_and_format_for_faiss(good_data)

Output

Validation Successful

Golden Rules for Annotators

Enforce Core Intent: If an ad mentions "delivery in 10 mins," tag it as Quick Commerce
The "Sharma Ji" Rule: If copy uses a character to build trust, add "Trust-based" in audience_segment
Boundary Examples: Show "Bad" vs "Good" examples to team

6. Bulk Upload Pipeline

⚡ Advanced ⏱ ~3 min

The Bulk Uploader acts as the final bridge between your team's manual curation and your AI's production memory.

Bulk Validation & Indexing Code ▼

python

import pandas as pd
import numpy as np
import faiss
from pydantic import ValidationError

class BulkAdUploader:
    def __init__(self, bot_instance):
        self.bot = bot_instance
        self.stats = {"success": 0, "failed": 0, "errors": []}

    def process_csv(self, file_path):
        df = pd.read_csv(file_path)
        valid_ads_text = []
        valid_metadata_list = []
        
        print(f"Starting Bulk Upload for {len(df)} rows...")

        for index, row in df.iterrows():
            raw_data = row.to_dict()
            if isinstance(raw_data.get('audience_segment'), str):
                raw_data['audience_segment'] = [s.strip() for s in raw_data['audience_segment'].split(',')]

            try:
                validated_data = AdDataPoint(**raw_data)
                valid_ads_text.append(validated_data.text)
                
                meta_row = validated_data.model_dump()
                valid_metadata_list.append(meta_row)
                
                self.stats["success"] += 1
            
            except ValidationError as e:
                self.stats["failed"] += 1
                self.stats["errors"].append({
                    "row_index": index,
                    "error": e.errors()[0]['msg'],
                    "text_preview": str(raw_data.get('text'))[:30]
                })

        if valid_ads_text:
            self.bot.add_ads_incrementally(valid_ads_text)
        
        self._print_summary()

    def _print_summary(self):
        print("\n--- Upload Summary ---")
        print(f"Successfully Indexed: {self.stats['success']}")
        print(f"Failed/Rejected:      {self.stats['failed']}")
        
        if self.stats["errors"]:
            print("\n--- Error Log (Top 5) ---")
            for err in self.stats["errors"][:5]:
                print(f"Row {err['row_index']}: {err['error']} | '{err['text_preview']}...'")

my_bot = IndianAdBot()
uploader = BulkAdUploader(my_bot)
uploader.process_csv('/path/to/team_batch_march.csv')

Feature	Implementation	Benefit
Atomic Validation	Pydantic Enum + min_length	Prevents "junk" data
Error Reporting	Detailed stats["errors"]	Team knows what to fix
Batch Processing	List-based collection	Optimized vectorization
Standardization	Auto string-to-list	Handles CSV limitations

💡 Key Takeaways

Context Preservation: Tagging for "Retail Format" prevents corporate tone for local products
Incremental Persistence: FAISS + CSV means your bot's memory grows with every session
Semantic Search: SentenceTransformer retrieves "Kirana" ads for "Shop owners" queries
Data Quality Loop: High failure rates in validation indicate taxonomy needs refinement
Next Steps: Create a template CSV generator to minimize team errors