🎯 What You'll Learn

πŸ“‹ Before You Begin

<

Abstract

This paper presents a comprehensive retrospective on the evolution of Clinical Research Agent across three major versions, tracing the architectural decisions, feature enhancements, and lessons learned during the development of a production-quality clinical data collection system.

Version Evolution at a Glance:
  • V1: Foundational pipeline for clinical text collection with 85%+ test coverage and ethical crawling basics
  • V2: Cultural adaptation with 300+ Indian medical abbreviations, advanced politeness framework, 92.3% F1 for Indian detection
  • V3: Multi-modal extension to images with OCR (English+Hindi), 10 document types, 89% classification accuracy

The evolved system achieves 88-92% F1 scores across detection tasks while maintaining 100% robots.txt compliance and comprehensive PHI flagging.

1. Version Timeline

The evolution was driven by three critical realizations during deployment and field studies:

V1: Foundation

Text collection pipeline with keyword filtering and basic politeness delays

~2,500 lines of code

V2: Cultural Adaptation

Indian content detection + advanced politeness framework

~4,000 lines of code

V3: Multi-Modal

Image collection + OCR + document classification

~6,500 lines of code

Three Critical Realizations:
  1. Cultural Blindness: V1 treated all clinical text uniformly, missing distinctive Indian patterns (s/o, d/o, "came with complaints of")
  2. Ethical Crawling: Basic 2-second delays were inadequate for respectful crawling of medical/academic websites
  3. Image Dominance: Significant portion of Indian clinical documentation exists as scanned images, not digital text

2. Version Comparison

Each version addressed specific limitations while preserving working functionality from its predecessor.

VERSION 1

Foundation

Text Collection Pipeline

  • Sequential pipeline architecture
  • Keyword-based relevance filtering
  • Pattern-based PHI detection (2 patterns)
  • Basic politeness delays (2 seconds)
  • 85%+ test coverage
  • Fail-safe design (no exceptions)
VERSION 2

Cultural Adaptation

Indian Content + Enhanced Ethics

  • 50+ Indian hospital names
  • 300+ Indian medical abbreviations
  • Token bucket rate limiting (6 req/min)
  • robots.txt compliance with caching
  • Exponential backoff retry logic
  • 92.3% F1 for Indian detection
VERSION 3

Multi-Modal

Image Collection + OCR

  • Clinical image detection (context-based)
  • Multi-language OCR (English+Hindi)
  • 10 document type classifications
  • PHI detection in OCR text
  • Word-level bounding boxes
  • 89% document classification accuracy

3. V1: Foundation

3.1 Problem Addressed

V1 addressed the fundamental challenge: How can we automatically collect publicly available clinical text for NLP research while maintaining ethical standards?

3.2 Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Seed      │───▢│  Scraper    │───▢│  Extractor  │───▢│   Filter    │───▢│  Storage    β”‚
β”‚   URLs      β”‚    β”‚  (HTML)     β”‚    β”‚  (PDF/TXT)  β”‚    β”‚ (Keywords)  β”‚    β”‚  (Files)    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          β”‚                                      β”‚
                          β–Ό                                      β–Ό
                   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                   β”‚   Downloaderβ”‚                       β”‚  PHI Check  β”‚
                   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Figure 1: V1 sequential pipeline architecture

3.3 Key Components

Filter Module (Keyword-Based)

python
def is_clinically_relevant(text: str) -> bool:
    """Check if text is clinically relevant."""
    
    # Step 1: Check minimum length
    if len(text) < 150:
        return False

    # Step 2: Check PHI patterns
    if contains_phi(text):
        return False  # Reject content with PHI

    # Step 3: Count keyword matches
    keywords = [
        "discharge summary", "hospital course",
        "patient history", "medications",
    ]
    matches = sum(1 for kw in keywords if kw in text.lower())
    
    # Need at least 2 keyword matches
    return matches >= 2
Design decisions:
  • Keyword matching: Simple, transparent, interpretable
  • PHI detection: Regex patterns for SSN, 9-digit numbers
  • Conservative approach: Reject when uncertain (protects privacy)
  • Binary classification: Relevant or not-relevant (no scoring)

3.4 Performance Metrics

85%
Test Coverage
88%
Precision
82%
Recall
10-20
Samples/Minute

3.5 Limitations Identified

V1 Limitations:
  • Cultural Blindness: No distinction between Indian and Western clinical text
  • Basic Politeness: Simple delays without rate limiting or robots.txt
  • Text-Only: No support for image-based documents
  • Generic Keywords: Western-centric keyword lists
  • No Content Scoring: Binary relevant/not-relevant classification

4. V2: Cultural Adaptation

4.1 Problems Addressed

V2 addressed two critical V1 limitations:

  1. How can we detect and prioritize Indian clinical content?
  2. How can we implement more sophisticated crawling etiquette?

4.2 Architecture Evolution

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Seed      │───▢│  Politeness      │───▢│  Scraper    │───▢│  Indian Content   │───▢│  Storage    β”‚
β”‚   URLs      β”‚    β”‚  Manager         β”‚    β”‚  (HTML)     β”‚    β”‚  Filter           β”‚    β”‚  (Files)    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                          β”‚                                           β”‚
                          β”œβ”€β”€ Rate Limiter                            β”œβ”€β”€ Hospital Detection
                          β”œβ”€β”€ robots.txt                              β”œβ”€β”€ Abbreviation Detection
                          └── Retry Logic                             └── Pattern Detection

Figure 2: V2 architecture showing new politeness manager and enhanced Indian content detection

4.3 New Components

Token Bucket Rate Limiter

python
class RateLimiter:
    """Token bucket rate limiter for smooth rate control."""
    
    def __init__(self, max_requests_per_minute: int = 6):
        self.max_rpm = max_requests_per_minute
        self.burst_size = max_requests_per_minute
        self.tokens = float(self.burst_size)  # Start full
        self.last_update = datetime.now()

    def _refill_tokens(self):
        """Add tokens based on elapsed time."""
        now = datetime.now()
        elapsed = (now - self.last_update).total_seconds()
        
        # Rate = 6/60 = 0.1 tokens per second
        tokens_to_add = elapsed * self.max_rpm / 60.0
        self.tokens = min(self.burst_size, self.tokens + tokens_to_add)
        self.last_update = now

    def acquire(self) -> bool:
        """Acquire a token, waiting if necessary."""
        while True:
            self._refill_tokens()
            if self.tokens >= 1.0:
                self.tokens -= 1.0  # Consume token
                return True
            time.sleep(0.1)  # Wait before retry
How token bucket works:
  • Bucket capacity: 6 tokens (allows initial burst)
  • Refill rate: 0.1 tokens/second (6 per minute)
  • Each request: Consumes 1 token
  • When empty: Must wait for refill
  • Benefit: Smooth rate limiting with controlled bursting

Indian Content Detection

python
def is_indian_clinical_content(text: str) -> Tuple[bool, dict]:
    """Detect Indian clinical content with scoring."""
    text_lower = text.lower()
    score = 0
    analysis = {
        "indian_score": 0,
        "detected_hospitals": [],
        "detected_abbreviations": [],
        "detected_terms": [],
        "has_indian_pattern": False,
    }

    # Hospital detection: +2 points each (strong signal)
    INDIAN_HOSPITALS = ["AIIMS", "PGIMER", "Apollo", ...]  # 50+ hospitals
    for hospital in INDIAN_HOSPITALS:
        if hospital.lower() in text_lower:
            analysis["detected_hospitals"].append(hospital)
            score += 2

    # Abbreviation detection: +1 point each
    INDIAN_ABBREVIATIONS = ["s/o", "d/o", "w/o", "c/o", "h/o", "NAD", ...]  # 300+ abbrevs
    for abbrev in INDIAN_ABBREVIATIONS:
        if re.search(r"\b" + re.escape(abbrev) + r"\b", text, re.I):
            analysis["detected_abbreviations"].append(abbrev)
            score += 1

    # Pattern detection: +3 points each (strongest signal)
    indian_patterns = [
        r"\b(?:s/o|d/o|w/o|r/o)\s+\w+",  # Family relations
        r"\b(?:came with|presented with)\s+complaints?\s+of",  # Indian phrasing
        r"\bper\s+(?:abdomen|speculum)\b",  # Examination style
    ]
    for pattern in indian_patterns:
        if re.search(pattern, text_lower):
            analysis["has_indian_pattern"] = True
            score += 3

    analysis["indian_score"] = score
    analysis["is_indian"] = score >= 2  # Low threshold for high recall

    return analysis["is_indian"], analysis
Scoring weights:
  • Hospital names (+2): Strong indicator of Indian content
  • Abbreviations (+1): Common but not definitive alone
  • Patterns (+3): Very strong indicator (documentation style)
  • Threshold (β‰₯2): Low threshold ensures high recall (91.3%)

4.4 Performance Improvements

Metric V1 V2 Improvement
Base Delay 2.0s 5.0s +150%
Rate Limiting None 6 req/min New
robots.txt ❌ βœ… New
Retry Logic Basic Exponential backoff Enhanced
Indian Detection F1 N/A 92.3% New
Content Scoring Binary Quantitative Enhanced

5. V3: Multi-Modal Image Collection

5.1 Problems Addressed

V3 addressed a fundamental modality gap:

  1. How can we collect and process image-based clinical documents?
  2. How can we extract text from scanned documents with OCR?
  3. How can we classify document types automatically?

5.2 Architecture Evolution

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Seed      │───▢│  Politeness      │───▢│  Image          │───▢│  Clinical Image   β”‚
β”‚   URLs      β”‚    β”‚  Manager         β”‚    β”‚  Extractor      β”‚    β”‚  Filter           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                                        β”‚
                                                                        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Analytics  │◀───│  PHI Detector    │◀───│  Document       │◀───│  OCR Processor    β”‚
β”‚  Dashboard  β”‚    β”‚  (OCR Text)      β”‚    β”‚  Classifier     β”‚    β”‚  (Eng+Hin)        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Figure 3: V3 architecture showing image collection and OCR processing pipeline

5.3 New Components

Clinical Image Detection (Context-Based)

Key Insight: Unlike text files, images don't contain directly analyzable text. V3 analyzes the context around the image β€” URL, page title, alt text, and surrounding content.
python
def is_clinical_image(
    image_url: str,
    page_content: str = "",
    page_title: str = "",
    image_alt: str = "",
) -> Tuple[bool, Dict]:

    # Step 1: Combine all context for analysis
    text_to_analyze = " ".join([
        image_url.lower(),       # Analyze filename & path
        page_title.lower(),      # Page context
        image_alt.lower(),       # Image description
        page_content.lower()[:2000],  # First 2000 chars
    ])

    score = 0
    
    # Step 2: Match clinical patterns (+2 points each)
    for pattern in clinical_patterns:  # 60+ patterns
        if pattern.lower() in text_to_analyze:
            score += 2

    # Step 3: Match Indian hospitals (+3 points each)
    for hospital in INDIAN_HOSPITAL_PATTERNS:  # 45+ hospitals
        if hospital.lower() in text_to_analyze:
            score += 3

    # Step 4: URL analysis (+3 points)
    if "discharge" in urlparse(image_url).path.lower():
        score += 3

    # Classification threshold (higher than V2)
    return score >= 5, {"clinical_score": score, ...}
Design decisions:
  • Context-based: Analyzes surrounding text, not pixel content
  • Higher threshold (β‰₯5): Reduces false positives for images
  • URL as signal: Filenames often contain clinical indicators
  • Pattern matching: Interpretable, zero-shot capability

Multi-Language OCR Processor

python
def process_ocr(
    image_path: str,
    languages: List[str] = ["eng", "hin"],
) -> Dict:
    """Extract text with Tesseract OCR (English+Hindi)."""
    import pytesseract
    from PIL import Image

    # Create language code: "eng+hin"
    lang_code = "+".join(languages)
    img = Image.open(image_path)

    # Get detailed OCR data with bounding boxes
    ocr_data = pytesseract.image_to_data(
        img,
        lang=lang_code,
        output_type=pytesseract.Output.DICT,
    )

    # Process word-level results
    text_blocks = []
    total_confidence = 0
    valid_blocks = 0

    for i in range(len(ocr_data["text"])):
        text = ocr_data["text"][i].strip()
        conf = ocr_data["conf"][i]  # Confidence 0-100

        # Filter by confidence threshold
        if text and conf > 60:
            text_blocks.append({
                "text": text,
                "confidence": conf,
                "left": ocr_data["left"][i],    # Bounding box
                "top": ocr_data["top"][i],
                "width": ocr_data["width"][i],
                "height": ocr_data["height"][i],
            })
            total_confidence += conf
            valid_blocks += 1

    # Get full text output
    full_text = pytesseract.image_to_string(img, lang=lang_code)
    avg_confidence = total_confidence / valid_blocks if valid_blocks > 0 else 0

    return {
        "success": True,
        "text": full_text.strip(),
        "confidence": avg_confidence,
        "language": lang_code,
        "blocks": text_blocks,
    }
OCR output structure:
  • text: Full extracted text (plain string)
  • confidence: Average confidence score (0-100)
  • language: Language code used ("eng+hin")
  • blocks: Word-level detections with position and confidence

Document Type Classifier

python
def classify_document(ocr_text: str) -> str:
    """Classify document into one of 10 types."""
    
    # Ordered by specificity (most specific first)
    document_types = [
        ("discharge summary", "Discharge Summary"),
        ("discharge advice", "Discharge Summary"),
        ("prescription", "Prescription"),
        ("case sheet", "Case Sheet"),
        ("lab report", "Lab Report"),
        ("blood report", "Lab Report"),
        ("scan report", "Scan Report"),
        ("x-ray", "Scan Report"),
        ("CT scan", "Scan Report"),
        ("ECG", "ECG Report"),
        ("echocardiogram", "Echocardiogram"),
        ("ICU note", "ICU Note"),
        ("OPD", "OPD Card"),
        ("emergency", "Emergency Note"),
        # ... 10 total types
    ]

    text_lower = ocr_text.lower()
    
    # Find first matching type (ordered by specificity)
    for pattern, doc_type in document_types:
        if pattern in text_lower:
            return doc_type  # Stop at first match

    return "Unknown"
Classification strategy:
  • Ordered patterns: Most specific types checked first
  • First match wins: break prevents overwriting with less specific types
  • Pattern-based: Zero-shot capability (no training data needed)
  • Interpretable: Clear reason for each classification

5.4 Performance Metrics

88.6%
Image Detection F1
89%
Classification Accuracy
87.1%
OCR Confidence
94%
PHI Detection Recall

6. Feature Evolution Matrix

Comprehensive comparison of features across all three versions.

Feature V1 V2 V3
Content Modality Text Text Text + Images
Clinical Detection Keywords Keywords + Indian Context + Patterns
Content Scoring Binary Quantitative Multi-dimensional
Politeness Basic Delay Rate Limiting Rate Limiting
robots.txt ❌ βœ… βœ…
Retry Logic Basic Exponential Exponential
PHI Detection 2 patterns 20+ patterns 20+ patterns (OCR)
Language Support English English English + Hindi
Document Types N/A N/A 10 types
Test Coverage 85% 88% 86%
Lines of Code ~2,500 ~4,000 ~6,500

7. Lessons Learned

7.1 Architectural Lessons

Lesson 1: Modularity Enables Evolution

The modular pipeline architecture proved essential for evolutionary development. Each version could add or modify components without redesigning the entire system.

Recommendation: Design research software with clear component boundaries and well-defined interfaces.

Lesson 2: Configuration Over Hardcoding

Moving from hardcoded values (V1) to configuration files (V2/V3) enabled customization without code changes.

Recommendation: Externalize all tunable parameters to configuration files.

Lesson 3: Testing Infrastructure Pays Dividends

The comprehensive test suite (85%+ coverage) enabled confident refactoring across versions.

Recommendation: Invest in testing infrastructure from the beginning; it enables future evolution.

7.2 Domain-Specific Lessons

Lesson 4: Cultural Adaptation Requires Knowledge Engineering

Generic NLP techniques failed for Indian clinical text. Success required domain-specific knowledge: abbreviations, hospital names, documentation patterns.

Recommendation: Engage domain experts early; build knowledge bases, not just models.

Lesson 5: Ethics Must Evolve with Capabilities

As collection capabilities expanded (text β†’ images), ethical safeguards needed parallel enhancement (basic delays β†’ rate limiting β†’ robots.txt).

Recommendation: Design ethics frameworks that scale with system capabilities.

Lesson 6: Multi-Modal Requires Specialized Pipelines

Image collection required fundamentally different processing (OCR, document classification) while maintaining unified ethics.

Recommendation: Plan for modality-specific processing within unified architectural frameworks.

8. Unified Ethics Framework

All versions adhere to core ethical principles with evolving implementation:

Principle V1 V2 V3
PHI Protection βœ… Pattern detection βœ… Enhanced patterns βœ… OCR-based detection
Respectful Crawling ⚠️ Basic delays βœ… Rate limiting βœ… Rate limiting
Transparency βœ… Logging βœ… Enhanced logging βœ… Comprehensive audit
Purpose Limitation βœ… Research only βœ… Research only βœ… Research only
Data Minimization βœ… Relevance filtering βœ… Indian content focus βœ… Clinical image focus

Ethics Evolution

V1 Ethics: PHI pattern detection (2 patterns), basic politeness delays, research-only usage
V2 Ethics Enhancements: Expanded PHI patterns (20+), token bucket rate limiting, robots.txt compliance, exponential backoff retries
V3 Ethics Enhancements: OCR-based PHI detection, image-level PHI flagging, document type tracking, comprehensive metadata

9. Knowledge Check

Test your understanding of the Clinical Research Agent evolution.

0 / 0

πŸ”‘ Key Takeaways