How We Built a 40-Knowledge-Base System for a Government Regulator
When a government regulatory authority approached us, they had a problem that sounds deceptively simple: 10,000+ documents, 40 separate knowledge bases, and analysts spending hours manually searching for precedents.
The Challenge
Regulatory documents don't come in clean formats. We faced scanned PDFs from the 1990s, handwritten annotations, tables embedded in images, and documents in multiple languages — all requiring identical treatment.
- 10,000+ source documents, 1M+ pages total
- 40 independent knowledge bases with separate access rules
- Sub-3-second query requirement for all search operations
- Zero tolerance for hallucinations — answers must cite exact source paragraphs
Our Architecture
Ingestion layer: Mistral OCR handled the hardest part — converting scanned documents into structured text with 98.7% accuracy, preserving tables and layout context that other OCR solutions lost.
Indexing: We used a dual-index approach. Typesense for BM25 keyword search (fast, deterministic), Weaviate for semantic vector search. Results from both are merged and re-ranked by a custom scoring function that weights recency, authority, and query relevance.
Retrieval: Each query hits both indices simultaneously. The top-20 candidates from each are merged, re-ranked, and fed to the LLM with strict citation requirements — the model must reference the exact document and paragraph, or return no answer.
Results
After 3 months in production:
- Average query time: 1.8 seconds (vs 45-minute manual search)
- Answer accuracy (human-verified): 94.2%
- Hallucination rate: 0.3% — all caught by citation validation layer
- Analyst time saved: 6 hours/day per analyst (team of 40)
What We Learned
The hardest problem wasn't the AI — it was data governance. Different knowledge bases had different access rules, and analysts could only query bases they were authorized for. We built a permission layer that sits between the query router and the index, enforcing access at retrieval time rather than at display time.
The second hardest problem: convincing the client that a 94% accurate system was ready for production. The answer was transparency — every answer shows its source, confidence level, and what it couldn't find. Analysts learned to trust the tool because they could verify every claim.