Case Study · May 25, 2026

How We Built a 40-Knowledge-Base System for a Government Regulator

When a government regulatory authority approached us, they had a problem that sounds deceptively simple: 10,000+ documents, 40 separate knowledge bases, and analysts spending hours manually searching for precedents.

The Challenge

Regulatory documents don't come in clean formats. We faced scanned PDFs from the 1990s, handwritten annotations, tables embedded in images, and documents in multiple languages — all requiring identical treatment.

10,000+ source documents, 1M+ pages total
40 independent knowledge bases with separate access rules
Sub-3-second query requirement for all search operations
Zero tolerance for hallucinations — answers must cite exact source paragraphs

Our Architecture

Ingestion layer: Mistral OCR handled the hardest part — converting scanned documents into structured text with 98.7% accuracy, preserving tables and layout context that other OCR solutions lost.

Indexing: We used a dual-index approach. Typesense for BM25 keyword search (fast, deterministic), Weaviate for semantic vector search. Results from both are merged and re-ranked by a custom scoring function that weights recency, authority, and query relevance.

Retrieval: Each query hits both indices simultaneously. The top-20 candidates from each are merged, re-ranked, and fed to the LLM with strict citation requirements — the model must reference the exact document and paragraph, or return no answer.

Results

After 3 months in production:

Average query time: 1.8 seconds (vs 45-minute manual search)
Answer accuracy (human-verified): 94.2%
Hallucination rate: 0.3% — all caught by citation validation layer
Analyst time saved: 6 hours/day per analyst (team of 40)

What We Learned

The hardest problem wasn't the AI — it was data governance. Different knowledge bases had different access rules, and analysts could only query bases they were authorized for. We built a permission layer that sits between the query router and the index, enforcing access at retrieval time rather than at display time.

The second hardest problem: convincing the client that a 94% accurate system was ready for production. The answer was transparency — every answer shows its source, confidence level, and what it couldn't find. Analysts learned to trust the tool because they could verify every claim.

arrow_back All Insights

Ready to Evolve?Готовы к эволюции?

Secure your position in the next phase of industrial intelligence. The brief begins with a single conversation.

Закрепите позицию в следующей фазе промышленного интеллекта. Всё начинается с одного разговора.

Book Intelligence BriefНачать проект

Ready to Evolve?Готовы к эволюции?

Secure your position in the next phase of industrial intelligence. The brief begins with a single conversation.

Закрепите позицию в следующей фазе промышленного интеллекта. Всё начинается с одного разговора.

Book Intelligence BriefНачать проект