Clinical Trial Information Retrieval System
NLP-based search engine for biomedical text — Lucene indexing, BM25 ranking, and embedding-based retrieval evaluated with TREC metrics on clinical trial documents.
Visit websiteProblem
Clinical trial registries contain millions of free-text documents describing eligibility criteria, interventions, and outcomes. Clinicians and researchers need to query these efficiently — keyword search fails on domain-specific terminology and semantic paraphrase. This system combines classical IR with neural embeddings to bridge that gap.
Retrieval Pipeline
- Indexing: Apache Lucene inverted index with custom biomedical text analysers — tokenisation, stemming, stop-word filtering tuned for clinical terminology
- Lexical retrieval: BM25 ranking as the first-stage retriever; fast candidate set generation from the Lucene index
- Semantic re-ranking: BioBERT fine-tuned for passage relevance scoring; dense embeddings computed via FAISS for approximate nearest-neighbour search
- Hybrid fusion: Reciprocal Rank Fusion combining BM25 and semantic scores for final ranked output
TREC Evaluation
| Metric | BM25 only | BM25 + BioBERT |
|---|---|---|
| MAP | 0.31 | 0.42 |
| NDCG@10 | 0.38 | 0.51 |
| P@10 | 0.34 | 0.47 |
