HGSOC Liquid Biopsy Expert — RAG System

Overview

A domain-specific expert RAG system that answers clinical and research questions about liquid biopsy biomarkers in high-grade serous ovarian cancer (HGSOC). Unlike generic biomedical RAG systems, this one is grounded in a curated two-tier corpus: the full-text papers from a PROSPERO-registered systematic review (the evidence base), combined with a structured research wiki covering HGSOC biology, liquid biopsy methodology, and related literature.

Tech & Architecture

Corpus curation: Docling-based extraction pipeline for full-text PDFs (tables, figures, supplementary materials); structured Markdown knowledge base (~180 documents across two tiers)
Retrieval: hybrid BM25 + dense retrieval (MedCPT / BioLORD adapters) over deterministic JSONL chunk exports; metadata filters expose corpus tier at query time
Evaluation: three-arm comparison (long-context agent vs. RAG vs. QLoRA fine-tune) scored on field accuracy, citation accuracy, and hallucination rate against extraction_v2.db (PROSPERO CRD420261405303)
Search automation: PICO-to-boolean query generation benchmarked against a frozen 2,927-record human PubMed search (134/158 priority records recovered by template arm)

Results & Highlights

PROSPERO-registered systematic review as the evaluation ground truth — higher provenance than any LLM-annotated benchmark in the field
Full PRISMA pipeline automated: search recall benchmarking (Task 1), title/abstract screening with per-criterion scoring (Task 2), structured data extraction (Task 3)
Per-criterion screening labels captured for 42 PI-confirmed full-text decisions, including 9 reclassifications that reveal the precise boundary where automated screeners systematically fail
Two-tier corpus design decouples SR-quality extraction (benchmark-grounded) from broader domain Q&A (wiki-grounded), enabling both rigorous evaluation and expert-level synthesis