Skip to main content

Clinical Trial Information Retrieval System

NLP-based search engine for biomedical text — Lucene indexing, BM25 ranking, and embedding-based retrieval evaluated with TREC metrics on clinical trial documents.

Visit website
  • NLP
  • Information Retrieval
  • Apache Lucene
  • BM25
  • BioBERT
Clinical trial information retrieval system interface

Problem

Clinical trial registries contain millions of free-text documents describing eligibility criteria, interventions, and outcomes. Clinicians and researchers need to query these efficiently — keyword search fails on domain-specific terminology and semantic paraphrase. This system combines classical IR with neural embeddings to bridge that gap.

Retrieval Pipeline

  • Indexing: Apache Lucene inverted index with custom biomedical text analysers — tokenisation, stemming, stop-word filtering tuned for clinical terminology
  • Lexical retrieval: BM25 ranking as the first-stage retriever; fast candidate set generation from the Lucene index
  • Semantic re-ranking: BioBERT fine-tuned for passage relevance scoring; dense embeddings computed via FAISS for approximate nearest-neighbour search
  • Hybrid fusion: Reciprocal Rank Fusion combining BM25 and semantic scores for final ranked output

TREC Evaluation

MetricBM25 onlyBM25 + BioBERT
MAP0.310.42
NDCG@100.380.51
P@100.340.47