← Writing

Multimodal RAG over clinical guidelines: what the benchmarks don't tell you

September 2024

At Yonata, I built a RAG pipeline over 10,000 clinical guideline chunks for a cardiovascular and diabetes pilot. We had 800 synthetic patient profiles that needed to be matched against relevant guideline sections. Five physicians would validate the output.

The system hit 89% retrieval recall in automated eval. The physicians rejected 40% of the suggestions. Here's what the benchmark missed.

The pipeline

Clinical guidelines come as PDFs — structured documents with tables, numbered lists, and embedded figures. I used Mistral to convert PDFs to markdown before chunking, which preserved table structure better than naive text extraction. Chunks were indexed in Weaviate with metadata: guideline name, section, condition, evidence grade.

The metadata schema separated searchable index from full document storage — a design decision that let us filter by condition and evidence grade before doing vector search, dramatically reducing the retrieval candidate set.

What 89% recall actually means

The automated eval measured whether the relevant guideline chunk appeared in the top-5 retrieved results. 89% of the time, it did. This felt good.

But the physicians weren't evaluating retrieval. They were evaluating clinical appropriateness — whether the retrieved guideline section was the right recommendation for this specific patient, not just topically related to their condition.

A patient with Type 2 diabetes and chronic kidney disease (CKD) has different medication recommendations than a patient with Type 2 diabetes alone. Our retrieval would correctly find diabetes guideline chunks (high recall) but often missed the comorbidity-specific sections (low clinical precision).

What we changed

We added a second retrieval pass specifically for comorbidities. After the primary retrieval, a lightweight classifier extracted comorbidities from the patient profile and ran a targeted search for sections mentioning those conditions in combination. Results from both passes were merged with a weighted score, giving higher weight to chunks that matched both the primary condition and the comorbidities.

Physician acceptance rate went from 60% to 79% in the next validation round.

The confidence scoring problem

The validation platform showed a confidence score alongside each suggestion — essentially the cosine similarity from the vector search. Physicians quickly learned to distrust it. A high cosine similarity between "patient has hypertension" and a guideline chunk about hypertension management means the chunk is topically relevant, not that it's clinically appropriate for this patient.

We replaced the raw similarity score with a calibrated confidence: similarity × evidence grade weight × comorbidity match score. Still imperfect, but at least it incorporated the clinical structure of the guidelines rather than pure semantic distance.

The lesson

RAG evals measure retrieval. Users measure usefulness. For general-purpose domains these are correlated. For specialized domains — clinical, legal, financial — they diverge because domain experts apply implicit filtering rules that aren't in the query and aren't in the chunk metadata.

The right approach is to build your eval set with domain experts from the start, not to build a system and then validate it with experts. The gap between automated metrics and human judgment is information about what your system is missing.