🎯 What You'll Learn
- Implement a semantic induction pipeline that transforms raw ad data into structured, searchable content
- Build an automated metadata tagging system using LLMs for platform, audience, tone, and ad type
- Create filtered FAISS retrieval to get precise few-shot examples based on specific criteria
- Apply Pydantic validation to enforce data quality standards and prevent corruption of your index
- Build a bulk upload pipeline with detailed error reporting for team workflows
📋 Before You Begin
- Basic Python programming knowledge
- Familiarity with pandas DataFrames
- Understanding of vector embeddings (conceptual level)
- Python 3.9+ with pip installed
1. The Automated Tagging Pipeline
To implement industry-grade Semantic Induction, your bot shouldn't just store "text"; it needs to understand the "soul" of each ad. By extracting metadata like Target Audience, Ad Channel, and Campaign Goal, your FAISS search becomes surgical.
For example, when you ask for a "Facebook ad for a luxury watch," the bot will search for examples that are specifically tagged with platform: Facebook and luxury: true, rather than just generic "watch" ads.
Zero-Shot Extraction
We use a small LLM (like a 7B model or a lightweight API) to perform "Zero-Shot Extraction" on your files before they are indexed.
import json
import pandas as pd
TAGGING_PROMPT = """
Analyze the following ad copy and return a JSON object with:
1. 'platform': (e.g., Instagram, LinkedIn, Google Search)
2. 'target_audience': (e.g., Gen Z, Business Owners, Students)
3. 'tone': (e.g., Professional, Urgent, Emotional)
4. 'ad_type': (e.g., Product Launch, Discount, Brand Awareness)
Ad Copy: "{text}"
JSON: """
def extract_metadata(ad_text):
simulated_response = {
"platform": "Instagram",
"target_audience": "Tech Enthusiasts",
"tone": "Exciting",
"ad_type": "Product Launch"
}
return simulated_response
def process_new_files(file_path):
df = pd.read_csv(file_path)
all_metadata = []
for text in df['ad_copy']:
meta = extract_metadata(text)
all_metadata.append(meta)
df_enriched = pd.concat([df, pd.DataFrame(all_metadata)], axis=1)
return df_enriched
# Returns DataFrame with original columns + platform, target_audience, tone, ad_type
2. Upgraded Retrieval Logic
Now that your metadata is enriched, your few-shot retrieval can use Filtered Search. This prevents the bot from giving a 'Google Search' example for an 'Instagram' request.
def get_precise_few_shots(product_query, platform="Instagram", k=3):
# 1. First, we filter the metadata for the specific platform
filtered_metadata = metadata[metadata['platform'] == platform]
# 2. Vector search within that subset
relevant_ids = filtered_metadata['id'].tolist()
# [Logic to search only within relevant_ids]
return examples
3. The Semantic Induction Workflow
This approach transforms your data from a "dump" into a structured "Ad Library."
| Stage | Process | Outcome |
|---|---|---|
| Ingestion | Scanning input files | Raw data is identified |
| Induction | LLM-based Tagging | Data labeled with Tone, Audience, Platform |
| Indexing | FAISS IndexIDMap | Vectors stored with unique IDs |
| Inference | Few-Shot Prompting | Bot writes copy using exact style needed |
Indian Context Advantage
For the Indian market, you can add language or region tags. This allows the bot to distinguish between:
- A Mumbai-based Kirana shop (local, community-focused tone)
- A tech startup in Bengaluru (global, "Silicon Valley" tone)
4. Full Indian-Context Indexing Script
This script creates a professional-grade pipeline for Indian-context ad generation with specialized taxonomy.
import numpy as np
import pandas as pd
import faiss
import os
import json
from sentence_transformers import SentenceTransformer
MODEL_NAME = 'all-MiniLM-L6-v2'
DIMENSION = 384
INDEX_PATH = "/kaggle/working/indian_ad_bot.faiss"
METADATA_PATH = "/kaggle/working/indian_ad_metadata.csv"
TAXONOMY_PROMPT = """
Analyze this Indian advertisement copy. Return ONLY a JSON object with:
{
"region": "e.g., Pan-India, South India, Rural, Urban",
"language_style": "e.g., Hinglish, Formal English, Vernacular-focused",
"retail_format": "e.g., Kirana, Quick Commerce, Modern Trade, D2C",
"festive_context": "e.g., Diwali, Holi, Wedding Season, None",
"audience_segment": "e.g., Gen Z, Middle-class Families, Small Biz Owners"
}
Ad Copy: "{text}"
"""
class IndianAdBot:
def __init__(self):
self.encoder = SentenceTransformer(MODEL_NAME)
self.index = self._load_or_create_index()
self.metadata = self._load_metadata()
def _load_or_create_index(self):
if os.path.exists(INDEX_PATH):
return faiss.read_index(INDEX_PATH)
return faiss.IndexIDMap(faiss.IndexFlatL2(DIMENSION))
def _load_metadata(self):
if os.path.exists(METADATA_PATH):
return pd.read_csv(METADATA_PATH)
return pd.DataFrame(columns=['id', 'text', 'region', 'language_style', 'retail_format', 'festive_context', 'audience_segment'])
def _simulate_llm_tagging(self, text):
is_kirana = "kirana" in text.lower() or "dukaan" in text.lower()
return {
"region": "Pan-India",
"language_style": "Hinglish" if "de" in text.lower() else "English",
"retail_format": "Kirana" if is_kirana else "Modern Trade",
"festive_context": "Diwali" if "light" in text.lower() else "None",
"audience_segment": "Small Biz Owners" if is_kirana else "Urban Youth"
}
def add_ads_incrementally(self, new_texts):
new_entries = []
start_id = self.index.ntotal
embeddings = self.encoder.encode(new_texts).astype('float32')
ids = np.arange(start_id, start_id + len(new_texts)).astype('int64')
for i, text in enumerate(new_texts):
tags = self._simulate_llm_tagging(text)
tags.update({'id': ids[i], 'text': text})
new_entries.append(tags)
self.index.add_with_ids(embeddings, ids)
faiss.write_index(self.index, INDEX_PATH)
new_df = pd.DataFrame(new_entries)
self.metadata = pd.concat([self.metadata, new_df], ignore_index=True)
self.metadata.to_csv(METADATA_PATH, index=False)
print(f"Successfully indexed {len(new_texts)} new ads. Total: {self.index.ntotal}")
def get_few_shot_examples(self, query, filter_dict=None, k=3):
query_vec = self.encoder.encode([query]).astype('float32')
if filter_dict:
mask = pd.Series([True] * len(self.metadata))
for key, value in filter_dict.items():
mask &= (self.metadata[key] == value)
valid_ids = self.metadata[mask]['id'].values
if len(valid_ids) == 0:
return []
distances, indices = self.index.search(query_vec, k * 2)
filtered_indices = [idx for idx in indices[0] if idx in valid_ids][:k]
else:
distances, indices = self.index.search(query_vec, k)
filtered_indices = indices[0]
return self.metadata[self.metadata['id'].isin(filtered_indices)]['text'].tolist()
bot = IndianAdBot()
raw_ads = [
"Upgrade your Kirana store with our digital ledger app. Simple and secure.",
"This Diwali, bring home the brightest LED lights for your home. 20% off!",
"Order milk and eggs in 10 minutes. The fastest delivery in Bangalore."
]
bot.add_ads_incrementally(raw_ads)
user_query = "A credit app for small shop owners in rural India."
examples = bot.get_few_shot_examples(user_query, filter_dict={"retail_format": "Kirana"})
print(f"\n--- FOUND {len(examples)} FEW-SHOT EXAMPLES ---")
for i, ex in enumerate(examples):
print(f"Example {i+1}: {ex}")
Successfully indexed 3 new ads. Total: 3 --- FOUND 1 FEW-SHOT EXAMPLES --- Example 1: Upgrade your Kirana store with our digital ledger app. Simple and secure.
5. Data Validation with Pydantic
Use Pydantic to create a professional "quality gate" for your ad bot. This ensures that if an annotator forgets to include the "Retail Format" or uses an incorrect "Language Style," the script will reject the data before it can corrupt your FAISS index.
from pydantic import BaseModel, Field, ValidationError
from typing import Optional, List
from enum import Enum
class LanguageStyle(str, Enum):
HINGLISH = "Hinglish"
FORMAL_ENGLISH = "Formal English"
VERNACULAR = "Vernacular"
class RetailFormat(str, Enum):
KIRANA = "Kirana"
QUICK_COMMERCE = "Quick Commerce"
MODERN_TRADE = "Modern Trade"
D2C = "D2C"
class AdDataPoint(BaseModel):
text: str = Field(..., min_length=10, description="The actual ad copy")
language_style: LanguageStyle
retail_format: RetailFormat
region: str = Field(default="Pan-India")
festive_context: Optional[str] = None
audience_segment: List[str] = Field(default_factory=list)
def validate_and_format_for_faiss(raw_data: dict):
try:
validated_ad = AdDataPoint(**raw_data)
print("Validation Successful")
return validated_ad.model_dump()
except ValidationError as e:
print("Validation Failed!")
print(e.json())
return None
good_data = {
"text": "Arre Sharma ji! Ledger book chhodo, phone pe hisab rakho.",
"language_style": "Hinglish",
"retail_format": "Kirana",
"audience_segment": ["Small Business", "Rural"]
}
validate_and_format_for_faiss(good_data)
Validation Successful
Golden Rules for Annotators
- Enforce Core Intent: If an ad mentions "delivery in 10 mins," tag it as
Quick Commerce - The "Sharma Ji" Rule: If copy uses a character to build trust, add "Trust-based" in
audience_segment - Boundary Examples: Show "Bad" vs "Good" examples to team
6. Bulk Upload Pipeline
The Bulk Uploader acts as the final bridge between your team's manual curation and your AI's production memory.
import pandas as pd
import numpy as np
import faiss
from pydantic import ValidationError
class BulkAdUploader:
def __init__(self, bot_instance):
self.bot = bot_instance
self.stats = {"success": 0, "failed": 0, "errors": []}
def process_csv(self, file_path):
df = pd.read_csv(file_path)
valid_ads_text = []
valid_metadata_list = []
print(f"Starting Bulk Upload for {len(df)} rows...")
for index, row in df.iterrows():
raw_data = row.to_dict()
if isinstance(raw_data.get('audience_segment'), str):
raw_data['audience_segment'] = [s.strip() for s in raw_data['audience_segment'].split(',')]
try:
validated_data = AdDataPoint(**raw_data)
valid_ads_text.append(validated_data.text)
meta_row = validated_data.model_dump()
valid_metadata_list.append(meta_row)
self.stats["success"] += 1
except ValidationError as e:
self.stats["failed"] += 1
self.stats["errors"].append({
"row_index": index,
"error": e.errors()[0]['msg'],
"text_preview": str(raw_data.get('text'))[:30]
})
if valid_ads_text:
self.bot.add_ads_incrementally(valid_ads_text)
self._print_summary()
def _print_summary(self):
print("\n--- Upload Summary ---")
print(f"Successfully Indexed: {self.stats['success']}")
print(f"Failed/Rejected: {self.stats['failed']}")
if self.stats["errors"]:
print("\n--- Error Log (Top 5) ---")
for err in self.stats["errors"][:5]:
print(f"Row {err['row_index']}: {err['error']} | '{err['text_preview']}...'")
my_bot = IndianAdBot()
uploader = BulkAdUploader(my_bot)
uploader.process_csv('/path/to/team_batch_march.csv')
| Feature | Implementation | Benefit |
|---|---|---|
| Atomic Validation | Pydantic Enum + min_length | Prevents "junk" data |
| Error Reporting | Detailed stats["errors"] | Team knows what to fix |
| Batch Processing | List-based collection | Optimized vectorization |
| Standardization | Auto string-to-list | Handles CSV limitations |
💡 Key Takeaways
- Context Preservation: Tagging for "Retail Format" prevents corporate tone for local products
- Incremental Persistence: FAISS + CSV means your bot's memory grows with every session
- Semantic Search: SentenceTransformer retrieves "Kirana" ads for "Shop owners" queries
- Data Quality Loop: High failure rates in validation indicate taxonomy needs refinement
- Next Steps: Create a template CSV generator to minimize team errors