Clinical Trial Information Retrieval System

NLP-based search engine for biomedical text — Lucene indexing, BM25 ranking, and embedding-based retrieval evaluated with TREC metrics on clinical trial documents.

Visit website

NLP
Information Retrieval
Apache Lucene
BM25
BioBERT

Clinical trial information retrieval system interface

Problem

Clinical trial registries contain millions of free-text documents describing eligibility criteria, interventions, and outcomes. Clinicians and researchers need to query these efficiently — keyword search fails on domain-specific terminology and semantic paraphrase. This system combines classical IR with neural embeddings to bridge that gap.

Retrieval Pipeline

Indexing: Apache Lucene inverted index with custom biomedical text analysers — tokenisation, stemming, stop-word filtering tuned for clinical terminology
Lexical retrieval: BM25 ranking as the first-stage retriever; fast candidate set generation from the Lucene index
Semantic re-ranking: BioBERT fine-tuned for passage relevance scoring; dense embeddings computed via FAISS for approximate nearest-neighbour search
Hybrid fusion: Reciprocal Rank Fusion combining BM25 and semantic scores for final ranked output

TREC Evaluation

Metric	BM25 only	BM25 + BioBERT
MAP	0.31	0.42
NDCG@10	0.38	0.51
P@10	0.34	0.47

Stack

Java · Apache Lucene · Python · HuggingFace Transformers · BioBERT · FAISS · scikit-learn