Clinical Research Agent Evolution: V1 to V3 Retrospective

🎯 What You'll Learn

Compare V1, V2, and V3 architectures, features, and performance metrics side-by-side
Understand how modular architecture enables incremental enhancement without system redesign
Analyze the evolution of ethical safeguards alongside capability expansion
Identify architectural patterns that enabled smooth evolution vs. those that constrained it
Apply lessons learned from building production-quality clinical data collection systems
Evaluate trade-offs between pattern-based detection and machine learning approaches

📋 Before You Begin

Basic understanding of software development lifecycle
Familiarity with web scraping concepts (helpful but not required)
Interest in healthcare data systems and NLP
No prior knowledge of Clinical Research Agent required — comprehensive comparison provided

⚡ Beginner ⏱ ~3 min

Abstract

This paper presents a comprehensive retrospective on the evolution of Clinical Research Agent across three major versions, tracing the architectural decisions, feature enhancements, and lessons learned during the development of a production-quality clinical data collection system.

Version Evolution at a Glance:

V1: Foundational pipeline for clinical text collection with 85%+ test coverage and ethical crawling basics
V2: Cultural adaptation with 300+ Indian medical abbreviations, advanced politeness framework, 92.3% F1 for Indian detection
V3: Multi-modal extension to images with OCR (English+Hindi), 10 document types, 89% classification accuracy

The evolved system achieves 88-92% F1 scores across detection tasks while maintaining 100% robots.txt compliance and comprehensive PHI flagging.

⚡ Beginner ⏱ ~3 min

1. Version Timeline

The evolution was driven by three critical realizations during deployment and field studies:

V1: Foundation

Text collection pipeline with keyword filtering and basic politeness delays

~2,500 lines of code

V2: Cultural Adaptation

Indian content detection + advanced politeness framework

~4,000 lines of code

V3: Multi-Modal

Image collection + OCR + document classification

~6,500 lines of code

Three Critical Realizations:

Cultural Blindness: V1 treated all clinical text uniformly, missing distinctive Indian patterns (s/o, d/o, "came with complaints of")
Ethical Crawling: Basic 2-second delays were inadequate for respectful crawling of medical/academic websites
Image Dominance: Significant portion of Indian clinical documentation exists as scanned images, not digital text

⚡ Intermediate ⏱ ~5 min

2. Version Comparison

Each version addressed specific limitations while preserving working functionality from its predecessor.

VERSION 1

Foundation

Text Collection Pipeline

Sequential pipeline architecture
Keyword-based relevance filtering
Pattern-based PHI detection (2 patterns)
Basic politeness delays (2 seconds)
85%+ test coverage
Fail-safe design (no exceptions)

VERSION 2

Cultural Adaptation

Indian Content + Enhanced Ethics

50+ Indian hospital names
300+ Indian medical abbreviations
Token bucket rate limiting (6 req/min)
robots.txt compliance with caching
Exponential backoff retry logic
92.3% F1 for Indian detection

VERSION 3

Multi-Modal

Image Collection + OCR

Clinical image detection (context-based)
Multi-language OCR (English+Hindi)
10 document type classifications
PHI detection in OCR text
Word-level bounding boxes
89% document classification accuracy

⚡ Intermediate ⏱ ~4 min

3. V1: Foundation

3.1 Problem Addressed

V1 addressed the fundamental challenge: How can we automatically collect publicly available clinical text for NLP research while maintaining ethical standards?

3.2 Architecture

┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   Seed      │───▶│  Scraper    │───▶│  Extractor  │───▶│   Filter    │───▶│  Storage    │
│   URLs      │    │  (HTML)     │    │  (PDF/TXT)  │    │ (Keywords)  │    │  (Files)    │
└─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘
                          │                                      │
                          ▼                                      ▼
                   ┌─────────────┐                       ┌─────────────┐
                   │   Downloader│                       │  PHI Check  │
                   └─────────────┘                       └─────────────┘

Figure 1: V1 sequential pipeline architecture

3.3 Key Components

Filter Module (Keyword-Based)

python

def is_clinically_relevant(text: str) -> bool:
    """Check if text is clinically relevant."""
    
    # Step 1: Check minimum length
    if len(text) < 150:
        return False

    # Step 2: Check PHI patterns
    if contains_phi(text):
        return False  # Reject content with PHI

    # Step 3: Count keyword matches
    keywords = [
        "discharge summary", "hospital course",
        "patient history", "medications",
    ]
    matches = sum(1 for kw in keywords if kw in text.lower())
    
    # Need at least 2 keyword matches
    return matches >= 2

Design decisions:

Keyword matching: Simple, transparent, interpretable
PHI detection: Regex patterns for SSN, 9-digit numbers
Conservative approach: Reject when uncertain (protects privacy)
Binary classification: Relevant or not-relevant (no scoring)

3.4 Performance Metrics

85%

Test Coverage

88%

Precision

82%

Recall

10-20

Samples/Minute

3.5 Limitations Identified

V1 Limitations:

Cultural Blindness: No distinction between Indian and Western clinical text
Basic Politeness: Simple delays without rate limiting or robots.txt
Text-Only: No support for image-based documents
Generic Keywords: Western-centric keyword lists
No Content Scoring: Binary relevant/not-relevant classification

⚡ Intermediate ⏱ ~6 min

4. V2: Cultural Adaptation

4.1 Problems Addressed

V2 addressed two critical V1 limitations:

How can we detect and prioritize Indian clinical content?
How can we implement more sophisticated crawling etiquette?

4.2 Architecture Evolution

┌─────────────┐    ┌──────────────────┐    ┌─────────────┐    ┌───────────────────┐    ┌─────────────┐
│   Seed      │───▶│  Politeness      │───▶│  Scraper    │───▶│  Indian Content   │───▶│  Storage    │
│   URLs      │    │  Manager         │    │  (HTML)     │    │  Filter           │    │  (Files)    │
└─────────────┘    └──────────────────┘    └─────────────┘    └───────────────────┘    └─────────────┘
                          │                                           │
                          ├── Rate Limiter                            ├── Hospital Detection
                          ├── robots.txt                              ├── Abbreviation Detection
                          └── Retry Logic                             └── Pattern Detection

Figure 2: V2 architecture showing new politeness manager and enhanced Indian content detection

4.3 New Components

Token Bucket Rate Limiter

python

class RateLimiter:
    """Token bucket rate limiter for smooth rate control."""
    
    def __init__(self, max_requests_per_minute: int = 6):
        self.max_rpm = max_requests_per_minute
        self.burst_size = max_requests_per_minute
        self.tokens = float(self.burst_size)  # Start full
        self.last_update = datetime.now()

    def _refill_tokens(self):
        """Add tokens based on elapsed time."""
        now = datetime.now()
        elapsed = (now - self.last_update).total_seconds()
        
        # Rate = 6/60 = 0.1 tokens per second
        tokens_to_add = elapsed * self.max_rpm / 60.0
        self.tokens = min(self.burst_size, self.tokens + tokens_to_add)
        self.last_update = now

    def acquire(self) -> bool:
        """Acquire a token, waiting if necessary."""
        while True:
            self._refill_tokens()
            if self.tokens >= 1.0:
                self.tokens -= 1.0  # Consume token
                return True
            time.sleep(0.1)  # Wait before retry

How token bucket works:

Bucket capacity: 6 tokens (allows initial burst)
Refill rate: 0.1 tokens/second (6 per minute)
Each request: Consumes 1 token
When empty: Must wait for refill
Benefit: Smooth rate limiting with controlled bursting

Indian Content Detection

python

def is_indian_clinical_content(text: str) -> Tuple[bool, dict]:
    """Detect Indian clinical content with scoring."""
    text_lower = text.lower()
    score = 0
    analysis = {
        "indian_score": 0,
        "detected_hospitals": [],
        "detected_abbreviations": [],
        "detected_terms": [],
        "has_indian_pattern": False,
    }

    # Hospital detection: +2 points each (strong signal)
    INDIAN_HOSPITALS = ["AIIMS", "PGIMER", "Apollo", ...]  # 50+ hospitals
    for hospital in INDIAN_HOSPITALS:
        if hospital.lower() in text_lower:
            analysis["detected_hospitals"].append(hospital)
            score += 2

    # Abbreviation detection: +1 point each
    INDIAN_ABBREVIATIONS = ["s/o", "d/o", "w/o", "c/o", "h/o", "NAD", ...]  # 300+ abbrevs
    for abbrev in INDIAN_ABBREVIATIONS:
        if re.search(r"\b" + re.escape(abbrev) + r"\b", text, re.I):
            analysis["detected_abbreviations"].append(abbrev)
            score += 1

    # Pattern detection: +3 points each (strongest signal)
    indian_patterns = [
        r"\b(?:s/o|d/o|w/o|r/o)\s+\w+",  # Family relations
        r"\b(?:came with|presented with)\s+complaints?\s+of",  # Indian phrasing
        r"\bper\s+(?:abdomen|speculum)\b",  # Examination style
    ]
    for pattern in indian_patterns:
        if re.search(pattern, text_lower):
            analysis["has_indian_pattern"] = True
            score += 3

    analysis["indian_score"] = score
    analysis["is_indian"] = score >= 2  # Low threshold for high recall

    return analysis["is_indian"], analysis

Scoring weights:

Hospital names (+2): Strong indicator of Indian content
Abbreviations (+1): Common but not definitive alone
Patterns (+3): Very strong indicator (documentation style)
Threshold (≥2): Low threshold ensures high recall (91.3%)

4.4 Performance Improvements

Metric	V1	V2	Improvement
Base Delay	2.0s	5.0s	+150%
Rate Limiting	None	6 req/min	New
robots.txt	❌	✅	New
Retry Logic	Basic	Exponential backoff	Enhanced
Indian Detection F1	N/A	92.3%	New
Content Scoring	Binary	Quantitative	Enhanced

⚡ Advanced ⏱ ~6 min

5. V3: Multi-Modal Image Collection

5.1 Problems Addressed

V3 addressed a fundamental modality gap:

How can we collect and process image-based clinical documents?
How can we extract text from scanned documents with OCR?
How can we classify document types automatically?

5.2 Architecture Evolution

┌─────────────┐    ┌──────────────────┐    ┌─────────────────┐    ┌───────────────────┐
│   Seed      │───▶│  Politeness      │───▶│  Image          │───▶│  Clinical Image   │
│   URLs      │    │  Manager         │    │  Extractor      │    │  Filter           │
└─────────────┘    └──────────────────┘    └─────────────────┘    └───────────────────┘
                                                                        │
                                                                        ▼
┌─────────────┐    ┌──────────────────┐    ┌─────────────────┐    ┌───────────────────┐
│  Analytics  │◀───│  PHI Detector    │◀───│  Document       │◀───│  OCR Processor    │
│  Dashboard  │    │  (OCR Text)      │    │  Classifier     │    │  (Eng+Hin)        │
└─────────────┘    └──────────────────┘    └─────────────────┘    └───────────────────┘

Figure 3: V3 architecture showing image collection and OCR processing pipeline

5.3 New Components

Clinical Image Detection (Context-Based)

Key Insight: Unlike text files, images don't contain directly analyzable text. V3 analyzes the context around the image — URL, page title, alt text, and surrounding content.

python

def is_clinical_image(
    image_url: str,
    page_content: str = "",
    page_title: str = "",
    image_alt: str = "",
) -> Tuple[bool, Dict]:

    # Step 1: Combine all context for analysis
    text_to_analyze = " ".join([
        image_url.lower(),       # Analyze filename & path
        page_title.lower(),      # Page context
        image_alt.lower(),       # Image description
        page_content.lower()[:2000],  # First 2000 chars
    ])

    score = 0
    
    # Step 2: Match clinical patterns (+2 points each)
    for pattern in clinical_patterns:  # 60+ patterns
        if pattern.lower() in text_to_analyze:
            score += 2

    # Step 3: Match Indian hospitals (+3 points each)
    for hospital in INDIAN_HOSPITAL_PATTERNS:  # 45+ hospitals
        if hospital.lower() in text_to_analyze:
            score += 3

    # Step 4: URL analysis (+3 points)
    if "discharge" in urlparse(image_url).path.lower():
        score += 3

    # Classification threshold (higher than V2)
    return score >= 5, {"clinical_score": score, ...}

Design decisions:

Context-based: Analyzes surrounding text, not pixel content
Higher threshold (≥5): Reduces false positives for images
URL as signal: Filenames often contain clinical indicators
Pattern matching: Interpretable, zero-shot capability

Multi-Language OCR Processor

python

def process_ocr(
    image_path: str,
    languages: List[str] = ["eng", "hin"],
) -> Dict:
    """Extract text with Tesseract OCR (English+Hindi)."""
    import pytesseract
    from PIL import Image

    # Create language code: "eng+hin"
    lang_code = "+".join(languages)
    img = Image.open(image_path)

    # Get detailed OCR data with bounding boxes
    ocr_data = pytesseract.image_to_data(
        img,
        lang=lang_code,
        output_type=pytesseract.Output.DICT,
    )

    # Process word-level results
    text_blocks = []
    total_confidence = 0
    valid_blocks = 0

    for i in range(len(ocr_data["text"])):
        text = ocr_data["text"][i].strip()
        conf = ocr_data["conf"][i]  # Confidence 0-100

        # Filter by confidence threshold
        if text and conf > 60:
            text_blocks.append({
                "text": text,
                "confidence": conf,
                "left": ocr_data["left"][i],    # Bounding box
                "top": ocr_data["top"][i],
                "width": ocr_data["width"][i],
                "height": ocr_data["height"][i],
            })
            total_confidence += conf
            valid_blocks += 1

    # Get full text output
    full_text = pytesseract.image_to_string(img, lang=lang_code)
    avg_confidence = total_confidence / valid_blocks if valid_blocks > 0 else 0

    return {
        "success": True,
        "text": full_text.strip(),
        "confidence": avg_confidence,
        "language": lang_code,
        "blocks": text_blocks,
    }

OCR output structure:

text: Full extracted text (plain string)
confidence: Average confidence score (0-100)
language: Language code used ("eng+hin")
blocks: Word-level detections with position and confidence

Document Type Classifier

python

def classify_document(ocr_text: str) -> str:
    """Classify document into one of 10 types."""
    
    # Ordered by specificity (most specific first)
    document_types = [
        ("discharge summary", "Discharge Summary"),
        ("discharge advice", "Discharge Summary"),
        ("prescription", "Prescription"),
        ("case sheet", "Case Sheet"),
        ("lab report", "Lab Report"),
        ("blood report", "Lab Report"),
        ("scan report", "Scan Report"),
        ("x-ray", "Scan Report"),
        ("CT scan", "Scan Report"),
        ("ECG", "ECG Report"),
        ("echocardiogram", "Echocardiogram"),
        ("ICU note", "ICU Note"),
        ("OPD", "OPD Card"),
        ("emergency", "Emergency Note"),
        # ... 10 total types
    ]

    text_lower = ocr_text.lower()
    
    # Find first matching type (ordered by specificity)
    for pattern, doc_type in document_types:
        if pattern in text_lower:
            return doc_type  # Stop at first match

    return "Unknown"

Classification strategy:

Ordered patterns: Most specific types checked first
First match wins: break prevents overwriting with less specific types
Pattern-based: Zero-shot capability (no training data needed)
Interpretable: Clear reason for each classification

5.4 Performance Metrics

88.6%

Image Detection F1

89%

Classification Accuracy

87.1%

OCR Confidence

94%

PHI Detection Recall

⚡ Intermediate ⏱ ~4 min

6. Feature Evolution Matrix

Comprehensive comparison of features across all three versions.

Feature	V1	V2	V3
Content Modality	Text	Text	Text + Images
Clinical Detection	Keywords	Keywords + Indian	Context + Patterns
Content Scoring	Binary	Quantitative	Multi-dimensional
Politeness	Basic Delay	Rate Limiting	Rate Limiting
robots.txt	❌	✅	✅
Retry Logic	Basic	Exponential	Exponential
PHI Detection	2 patterns	20+ patterns	20+ patterns (OCR)
Language Support	English	English	English + Hindi
Document Types	N/A	N/A	10 types
Test Coverage	85%	88%	86%
Lines of Code	~2,500	~4,000	~6,500

⚡ Intermediate ⏱ ~4 min

7. Lessons Learned

7.1 Architectural Lessons

Lesson 1: Modularity Enables Evolution

The modular pipeline architecture proved essential for evolutionary development. Each version could add or modify components without redesigning the entire system.

Recommendation: Design research software with clear component boundaries and well-defined interfaces.

Lesson 2: Configuration Over Hardcoding

Moving from hardcoded values (V1) to configuration files (V2/V3) enabled customization without code changes.

Recommendation: Externalize all tunable parameters to configuration files.

Lesson 3: Testing Infrastructure Pays Dividends

The comprehensive test suite (85%+ coverage) enabled confident refactoring across versions.

Recommendation: Invest in testing infrastructure from the beginning; it enables future evolution.

7.2 Domain-Specific Lessons

Lesson 4: Cultural Adaptation Requires Knowledge Engineering

Generic NLP techniques failed for Indian clinical text. Success required domain-specific knowledge: abbreviations, hospital names, documentation patterns.

Recommendation: Engage domain experts early; build knowledge bases, not just models.

Lesson 5: Ethics Must Evolve with Capabilities

As collection capabilities expanded (text → images), ethical safeguards needed parallel enhancement (basic delays → rate limiting → robots.txt).

Recommendation: Design ethics frameworks that scale with system capabilities.

Lesson 6: Multi-Modal Requires Specialized Pipelines

Image collection required fundamentally different processing (OCR, document classification) while maintaining unified ethics.

Recommendation: Plan for modality-specific processing within unified architectural frameworks.

⚡ Advanced ⏱ ~3 min

8. Unified Ethics Framework

All versions adhere to core ethical principles with evolving implementation:

Principle	V1	V2	V3
PHI Protection	✅ Pattern detection	✅ Enhanced patterns	✅ OCR-based detection
Respectful Crawling	⚠️ Basic delays	✅ Rate limiting	✅ Rate limiting
Transparency	✅ Logging	✅ Enhanced logging	✅ Comprehensive audit
Purpose Limitation	✅ Research only	✅ Research only	✅ Research only
Data Minimization	✅ Relevance filtering	✅ Indian content focus	✅ Clinical image focus

Ethics Evolution

V1 Ethics: PHI pattern detection (2 patterns), basic politeness delays, research-only usage

V2 Ethics Enhancements: Expanded PHI patterns (20+), token bucket rate limiting, robots.txt compliance, exponential backoff retries

V3 Ethics Enhancements: OCR-based PHI detection, image-level PHI flagging, document type tracking, comprehensive metadata

⚡ Intermediate ⏱ ~5 min

9. Knowledge Check

Test your understanding of the Clinical Research Agent evolution.

0 / 0

🔑 Key Takeaways

Modular pipeline architecture enabled incremental enhancement: V1 (~2,500 LOC) → V2 (~4,000 LOC) → V3 (~6,500 LOC) without system redesign
Cultural adaptation requires domain-specific knowledge engineering: 50+ hospitals, 300+ abbreviations, 80+ patterns for Indian clinical content
Ethics frameworks must scale with capabilities: basic delays → rate limiting → robots.txt → OCR-based PHI detection
Multi-modal collection needs specialized pipelines: image detection (context-based), OCR (Tesseract Eng+Hin), document classification (10 types)
Pattern-based approaches offer interpretability and zero-shot capability vs. machine learning requiring training data
Test coverage (85%+) across all versions enabled confident refactoring and evolution