π― What You'll Learn
- Compare V1, V2, and V3 architectures, features, and performance metrics side-by-side
- Understand how modular architecture enables incremental enhancement without system redesign
- Analyze the evolution of ethical safeguards alongside capability expansion
- Identify architectural patterns that enabled smooth evolution vs. those that constrained it
- Apply lessons learned from building production-quality clinical data collection systems
- Evaluate trade-offs between pattern-based detection and machine learning approaches
π Before You Begin
- Basic understanding of software development lifecycle
- Familiarity with web scraping concepts (helpful but not required)
- Interest in healthcare data systems and NLP
- No prior knowledge of Clinical Research Agent required β comprehensive comparison provided
Abstract
This paper presents a comprehensive retrospective on the evolution of Clinical Research Agent across three major versions, tracing the architectural decisions, feature enhancements, and lessons learned during the development of a production-quality clinical data collection system.
- V1: Foundational pipeline for clinical text collection with 85%+ test coverage and ethical crawling basics
- V2: Cultural adaptation with 300+ Indian medical abbreviations, advanced politeness framework, 92.3% F1 for Indian detection
- V3: Multi-modal extension to images with OCR (English+Hindi), 10 document types, 89% classification accuracy
The evolved system achieves 88-92% F1 scores across detection tasks while maintaining 100% robots.txt compliance and comprehensive PHI flagging.
1. Version Timeline
The evolution was driven by three critical realizations during deployment and field studies:
V1: Foundation
Text collection pipeline with keyword filtering and basic politeness delays
~2,500 lines of code
V2: Cultural Adaptation
Indian content detection + advanced politeness framework
~4,000 lines of code
V3: Multi-Modal
Image collection + OCR + document classification
~6,500 lines of code
- Cultural Blindness: V1 treated all clinical text uniformly, missing distinctive Indian patterns (s/o, d/o, "came with complaints of")
- Ethical Crawling: Basic 2-second delays were inadequate for respectful crawling of medical/academic websites
- Image Dominance: Significant portion of Indian clinical documentation exists as scanned images, not digital text
2. Version Comparison
Each version addressed specific limitations while preserving working functionality from its predecessor.
Foundation
Text Collection Pipeline
- Sequential pipeline architecture
- Keyword-based relevance filtering
- Pattern-based PHI detection (2 patterns)
- Basic politeness delays (2 seconds)
- 85%+ test coverage
- Fail-safe design (no exceptions)
Cultural Adaptation
Indian Content + Enhanced Ethics
- 50+ Indian hospital names
- 300+ Indian medical abbreviations
- Token bucket rate limiting (6 req/min)
- robots.txt compliance with caching
- Exponential backoff retry logic
- 92.3% F1 for Indian detection
Multi-Modal
Image Collection + OCR
- Clinical image detection (context-based)
- Multi-language OCR (English+Hindi)
- 10 document type classifications
- PHI detection in OCR text
- Word-level bounding boxes
- 89% document classification accuracy
3. V1: Foundation
3.1 Problem Addressed
V1 addressed the fundamental challenge: How can we automatically collect publicly available clinical text for NLP research while maintaining ethical standards?
3.2 Architecture
βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ
β Seed βββββΆβ Scraper βββββΆβ Extractor βββββΆβ Filter βββββΆβ Storage β
β URLs β β (HTML) β β (PDF/TXT) β β (Keywords) β β (Files) β
βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ
β β
βΌ βΌ
βββββββββββββββ βββββββββββββββ
β Downloaderβ β PHI Check β
βββββββββββββββ βββββββββββββββ
Figure 1: V1 sequential pipeline architecture
3.3 Key Components
Filter Module (Keyword-Based)
def is_clinically_relevant(text: str) -> bool:
"""Check if text is clinically relevant."""
# Step 1: Check minimum length
if len(text) < 150:
return False
# Step 2: Check PHI patterns
if contains_phi(text):
return False # Reject content with PHI
# Step 3: Count keyword matches
keywords = [
"discharge summary", "hospital course",
"patient history", "medications",
]
matches = sum(1 for kw in keywords if kw in text.lower())
# Need at least 2 keyword matches
return matches >= 2
- Keyword matching: Simple, transparent, interpretable
- PHI detection: Regex patterns for SSN, 9-digit numbers
- Conservative approach: Reject when uncertain (protects privacy)
- Binary classification: Relevant or not-relevant (no scoring)
3.4 Performance Metrics
3.5 Limitations Identified
- Cultural Blindness: No distinction between Indian and Western clinical text
- Basic Politeness: Simple delays without rate limiting or robots.txt
- Text-Only: No support for image-based documents
- Generic Keywords: Western-centric keyword lists
- No Content Scoring: Binary relevant/not-relevant classification
4. V2: Cultural Adaptation
4.1 Problems Addressed
V2 addressed two critical V1 limitations:
- How can we detect and prioritize Indian clinical content?
- How can we implement more sophisticated crawling etiquette?
4.2 Architecture Evolution
βββββββββββββββ ββββββββββββββββββββ βββββββββββββββ βββββββββββββββββββββ βββββββββββββββ
β Seed βββββΆβ Politeness βββββΆβ Scraper βββββΆβ Indian Content βββββΆβ Storage β
β URLs β β Manager β β (HTML) β β Filter β β (Files) β
βββββββββββββββ ββββββββββββββββββββ βββββββββββββββ βββββββββββββββββββββ βββββββββββββββ
β β
βββ Rate Limiter βββ Hospital Detection
βββ robots.txt βββ Abbreviation Detection
βββ Retry Logic βββ Pattern Detection
Figure 2: V2 architecture showing new politeness manager and enhanced Indian content detection
4.3 New Components
Token Bucket Rate Limiter
class RateLimiter:
"""Token bucket rate limiter for smooth rate control."""
def __init__(self, max_requests_per_minute: int = 6):
self.max_rpm = max_requests_per_minute
self.burst_size = max_requests_per_minute
self.tokens = float(self.burst_size) # Start full
self.last_update = datetime.now()
def _refill_tokens(self):
"""Add tokens based on elapsed time."""
now = datetime.now()
elapsed = (now - self.last_update).total_seconds()
# Rate = 6/60 = 0.1 tokens per second
tokens_to_add = elapsed * self.max_rpm / 60.0
self.tokens = min(self.burst_size, self.tokens + tokens_to_add)
self.last_update = now
def acquire(self) -> bool:
"""Acquire a token, waiting if necessary."""
while True:
self._refill_tokens()
if self.tokens >= 1.0:
self.tokens -= 1.0 # Consume token
return True
time.sleep(0.1) # Wait before retry
- Bucket capacity: 6 tokens (allows initial burst)
- Refill rate: 0.1 tokens/second (6 per minute)
- Each request: Consumes 1 token
- When empty: Must wait for refill
- Benefit: Smooth rate limiting with controlled bursting
Indian Content Detection
def is_indian_clinical_content(text: str) -> Tuple[bool, dict]:
"""Detect Indian clinical content with scoring."""
text_lower = text.lower()
score = 0
analysis = {
"indian_score": 0,
"detected_hospitals": [],
"detected_abbreviations": [],
"detected_terms": [],
"has_indian_pattern": False,
}
# Hospital detection: +2 points each (strong signal)
INDIAN_HOSPITALS = ["AIIMS", "PGIMER", "Apollo", ...] # 50+ hospitals
for hospital in INDIAN_HOSPITALS:
if hospital.lower() in text_lower:
analysis["detected_hospitals"].append(hospital)
score += 2
# Abbreviation detection: +1 point each
INDIAN_ABBREVIATIONS = ["s/o", "d/o", "w/o", "c/o", "h/o", "NAD", ...] # 300+ abbrevs
for abbrev in INDIAN_ABBREVIATIONS:
if re.search(r"\b" + re.escape(abbrev) + r"\b", text, re.I):
analysis["detected_abbreviations"].append(abbrev)
score += 1
# Pattern detection: +3 points each (strongest signal)
indian_patterns = [
r"\b(?:s/o|d/o|w/o|r/o)\s+\w+", # Family relations
r"\b(?:came with|presented with)\s+complaints?\s+of", # Indian phrasing
r"\bper\s+(?:abdomen|speculum)\b", # Examination style
]
for pattern in indian_patterns:
if re.search(pattern, text_lower):
analysis["has_indian_pattern"] = True
score += 3
analysis["indian_score"] = score
analysis["is_indian"] = score >= 2 # Low threshold for high recall
return analysis["is_indian"], analysis
- Hospital names (+2): Strong indicator of Indian content
- Abbreviations (+1): Common but not definitive alone
- Patterns (+3): Very strong indicator (documentation style)
- Threshold (β₯2): Low threshold ensures high recall (91.3%)
4.4 Performance Improvements
| Metric | V1 | V2 | Improvement |
|---|---|---|---|
| Base Delay | 2.0s | 5.0s | +150% |
| Rate Limiting | None | 6 req/min | New |
| robots.txt | β | β | New |
| Retry Logic | Basic | Exponential backoff | Enhanced |
| Indian Detection F1 | N/A | 92.3% | New |
| Content Scoring | Binary | Quantitative | Enhanced |
5. V3: Multi-Modal Image Collection
5.1 Problems Addressed
V3 addressed a fundamental modality gap:
- How can we collect and process image-based clinical documents?
- How can we extract text from scanned documents with OCR?
- How can we classify document types automatically?
5.2 Architecture Evolution
βββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββββ
β Seed βββββΆβ Politeness βββββΆβ Image βββββΆβ Clinical Image β
β URLs β β Manager β β Extractor β β Filter β
βββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββββ
β
βΌ
βββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββββ
β Analytics ββββββ PHI Detector ββββββ Document ββββββ OCR Processor β
β Dashboard β β (OCR Text) β β Classifier β β (Eng+Hin) β
βββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββββ
Figure 3: V3 architecture showing image collection and OCR processing pipeline
5.3 New Components
Clinical Image Detection (Context-Based)
def is_clinical_image(
image_url: str,
page_content: str = "",
page_title: str = "",
image_alt: str = "",
) -> Tuple[bool, Dict]:
# Step 1: Combine all context for analysis
text_to_analyze = " ".join([
image_url.lower(), # Analyze filename & path
page_title.lower(), # Page context
image_alt.lower(), # Image description
page_content.lower()[:2000], # First 2000 chars
])
score = 0
# Step 2: Match clinical patterns (+2 points each)
for pattern in clinical_patterns: # 60+ patterns
if pattern.lower() in text_to_analyze:
score += 2
# Step 3: Match Indian hospitals (+3 points each)
for hospital in INDIAN_HOSPITAL_PATTERNS: # 45+ hospitals
if hospital.lower() in text_to_analyze:
score += 3
# Step 4: URL analysis (+3 points)
if "discharge" in urlparse(image_url).path.lower():
score += 3
# Classification threshold (higher than V2)
return score >= 5, {"clinical_score": score, ...}
- Context-based: Analyzes surrounding text, not pixel content
- Higher threshold (β₯5): Reduces false positives for images
- URL as signal: Filenames often contain clinical indicators
- Pattern matching: Interpretable, zero-shot capability
Multi-Language OCR Processor
def process_ocr(
image_path: str,
languages: List[str] = ["eng", "hin"],
) -> Dict:
"""Extract text with Tesseract OCR (English+Hindi)."""
import pytesseract
from PIL import Image
# Create language code: "eng+hin"
lang_code = "+".join(languages)
img = Image.open(image_path)
# Get detailed OCR data with bounding boxes
ocr_data = pytesseract.image_to_data(
img,
lang=lang_code,
output_type=pytesseract.Output.DICT,
)
# Process word-level results
text_blocks = []
total_confidence = 0
valid_blocks = 0
for i in range(len(ocr_data["text"])):
text = ocr_data["text"][i].strip()
conf = ocr_data["conf"][i] # Confidence 0-100
# Filter by confidence threshold
if text and conf > 60:
text_blocks.append({
"text": text,
"confidence": conf,
"left": ocr_data["left"][i], # Bounding box
"top": ocr_data["top"][i],
"width": ocr_data["width"][i],
"height": ocr_data["height"][i],
})
total_confidence += conf
valid_blocks += 1
# Get full text output
full_text = pytesseract.image_to_string(img, lang=lang_code)
avg_confidence = total_confidence / valid_blocks if valid_blocks > 0 else 0
return {
"success": True,
"text": full_text.strip(),
"confidence": avg_confidence,
"language": lang_code,
"blocks": text_blocks,
}
- text: Full extracted text (plain string)
- confidence: Average confidence score (0-100)
- language: Language code used ("eng+hin")
- blocks: Word-level detections with position and confidence
Document Type Classifier
def classify_document(ocr_text: str) -> str:
"""Classify document into one of 10 types."""
# Ordered by specificity (most specific first)
document_types = [
("discharge summary", "Discharge Summary"),
("discharge advice", "Discharge Summary"),
("prescription", "Prescription"),
("case sheet", "Case Sheet"),
("lab report", "Lab Report"),
("blood report", "Lab Report"),
("scan report", "Scan Report"),
("x-ray", "Scan Report"),
("CT scan", "Scan Report"),
("ECG", "ECG Report"),
("echocardiogram", "Echocardiogram"),
("ICU note", "ICU Note"),
("OPD", "OPD Card"),
("emergency", "Emergency Note"),
# ... 10 total types
]
text_lower = ocr_text.lower()
# Find first matching type (ordered by specificity)
for pattern, doc_type in document_types:
if pattern in text_lower:
return doc_type # Stop at first match
return "Unknown"
- Ordered patterns: Most specific types checked first
- First match wins:
breakprevents overwriting with less specific types - Pattern-based: Zero-shot capability (no training data needed)
- Interpretable: Clear reason for each classification
5.4 Performance Metrics
6. Feature Evolution Matrix
Comprehensive comparison of features across all three versions.
| Feature | V1 | V2 | V3 |
|---|---|---|---|
| Content Modality | Text | Text | Text + Images |
| Clinical Detection | Keywords | Keywords + Indian | Context + Patterns |
| Content Scoring | Binary | Quantitative | Multi-dimensional |
| Politeness | Basic Delay | Rate Limiting | Rate Limiting |
| robots.txt | β | β | β |
| Retry Logic | Basic | Exponential | Exponential |
| PHI Detection | 2 patterns | 20+ patterns | 20+ patterns (OCR) |
| Language Support | English | English | English + Hindi |
| Document Types | N/A | N/A | 10 types |
| Test Coverage | 85% | 88% | 86% |
| Lines of Code | ~2,500 | ~4,000 | ~6,500 |
7. Lessons Learned
7.1 Architectural Lessons
The modular pipeline architecture proved essential for evolutionary development. Each version could add or modify components without redesigning the entire system.
Recommendation: Design research software with clear component boundaries and well-defined interfaces.
Moving from hardcoded values (V1) to configuration files (V2/V3) enabled customization without code changes.
Recommendation: Externalize all tunable parameters to configuration files.
The comprehensive test suite (85%+ coverage) enabled confident refactoring across versions.
Recommendation: Invest in testing infrastructure from the beginning; it enables future evolution.
7.2 Domain-Specific Lessons
Generic NLP techniques failed for Indian clinical text. Success required domain-specific knowledge: abbreviations, hospital names, documentation patterns.
Recommendation: Engage domain experts early; build knowledge bases, not just models.
As collection capabilities expanded (text β images), ethical safeguards needed parallel enhancement (basic delays β rate limiting β robots.txt).
Recommendation: Design ethics frameworks that scale with system capabilities.
Image collection required fundamentally different processing (OCR, document classification) while maintaining unified ethics.
Recommendation: Plan for modality-specific processing within unified architectural frameworks.
8. Unified Ethics Framework
All versions adhere to core ethical principles with evolving implementation:
| Principle | V1 | V2 | V3 |
|---|---|---|---|
| PHI Protection | β Pattern detection | β Enhanced patterns | β OCR-based detection |
| Respectful Crawling | β οΈ Basic delays | β Rate limiting | β Rate limiting |
| Transparency | β Logging | β Enhanced logging | β Comprehensive audit |
| Purpose Limitation | β Research only | β Research only | β Research only |
| Data Minimization | β Relevance filtering | β Indian content focus | β Clinical image focus |
Ethics Evolution
9. Knowledge Check
Test your understanding of the Clinical Research Agent evolution.
0 / 0
π Key Takeaways
- Modular pipeline architecture enabled incremental enhancement: V1 (~2,500 LOC) β V2 (~4,000 LOC) β V3 (~6,500 LOC) without system redesign
- Cultural adaptation requires domain-specific knowledge engineering: 50+ hospitals, 300+ abbreviations, 80+ patterns for Indian clinical content
- Ethics frameworks must scale with capabilities: basic delays β rate limiting β robots.txt β OCR-based PHI detection
- Multi-modal collection needs specialized pipelines: image detection (context-based), OCR (Tesseract Eng+Hin), document classification (10 types)
- Pattern-based approaches offer interpretability and zero-shot capability vs. machine learning requiring training data
- Test coverage (85%+) across all versions enabled confident refactoring and evolution