Ministry of Statistics and Programme Implementation (MoSPI), Government of India
Government

AI-enabled Intelligent Search Solutions for Documents

RAG-powered semantic search and Q&A system for MoSPI's vast document archive — enabling natural language querying with voice and text input, multilingual support (Hindi, English, Kannada), and deep citations linking directly to source PDF pages.

RAGLLMGovernmentDocument AIVector Search
Build Something Similar

Impact Metrics

>80%
Extraction Accuracy
Docling preserves tabular data from complex PDFs with high fidelity
>90%
Search Relevance
Hybrid chunking + BGE Large embeddings deliver highly relevant results
5-10 sec
Query Latency
Prioritizes accuracy over speed with 120B model validation
3
Languages Supported
Hindi, English, and Kannada for input and output
10,000+
Documents Indexed
Statistical reports, surveys, and policy papers
Full Scale
Scalability
Dockerized architecture with Qdrant handling millions of vectors
The Challenge

Vast Document Archives Locked Behind Manual Search

MoSPI relies on manually retrieving information from a vast volume of PDFs and images. Specialized knowledge is often required to locate specific data points buried across thousands of statistical reports, surveys, and policy papers.

Standard OCR tools fail to preserve the structure of statistical tables, rendering numeric data unusable for analysis. Complex layouts with merged cells, multi-column formats, and Hindi text would break during extraction, producing garbage data that undermined analyst confidence.

There was no cross-document synthesis capability — each document had to be read independently, making it impossible to answer questions that required connecting data across multiple reports or time periods.

Key Pain Points

Hours spent manually searching document archives for single data points
Standard OCR tools fail to preserve statistical table structure
No cross-document synthesis — each document had to be read independently
Multi-language documents (Hindi, English, Kannada) unsupported by existing tools
No audit trail for which source documents backed a given data claim
The Solution

A Secure, Self-Hosted RAG Pipeline Tailored for Government Documents

Built an intelligent ingestion pipeline powered by Docling (open-source by IBM) instead of standard Regex or PyPDF — preserving the structural integrity of tables and layouts critical for statistical reports. Uses hybrid chunking with chunk overlap for context continuity.

Vector-based semantic retrieval using Qdrant vector store with BGE Large embeddings, capturing semantic nuance for highly relevant search results. The system supports both text and voice input queries.

Generation powered by an OSS 120B LLM (comparable to OpenAI o3), running securely on self-hosted infrastructure. Every response includes deep citations linking directly to the specific PDF page in a new tab. The system evolved through three phases: initial Beta Build (Llama Scout + recursive splitting), Refinement (120B model + Docling + hybrid chunking), and the Knowledge Base pivot (pre-indexing for zero-latency chat).

Our Approach

1
Docling-powered document ingestion with OCR, table extraction, and layout preservation
2
Hybrid chunking strategy with chunk overlap for context continuity
3
Qdrant vector store with BGE Large embeddings for semantic search
4
Self-hosted OSS 120B LLM for secure, accurate response generation
5
Knowledge Base architecture — admins pre-process and index documents, users chat instantly
6
Voice and text input with multilingual support (Hindi, English, Kannada)
7
Deep citations linking directly to source PDF pages
8
MoSPI branding, admin panels, and role-based access control
9
Ministry SMTP integration with whitelisted IP for secure email delivery

Key Features Delivered

Semantic Q&A with voice and text input
Docling-powered OCR preserving table structure and layouts
Multilingual support for Hindi, English, and Kannada
Deep citations linking directly to specific PDF pages
Knowledge Base — admins pre-index documents for zero-latency user chat
Role-based access control with MoSPI branding
LLM-powered image indexing — charts and images made searchable via picture descriptions
Query history and admin analytics dashboard
Technology Stack

Built With

SvelteKit (Frontend)Tailwind CSS + shadcn/uiFastAPI (Backend)Socket.IO (Real-time)Qdrant (Vector DB)PostgreSQLDocling (OCR)BGE Large (Embeddings)OSS 120B LLM (Inference)LangChainOllamaDocker + NginxAzure CloudRedis
Results

Outcomes Achieved

The system transformed how MoSPI analysts interact with India's national statistical archive — reducing research from hours of manual search to seconds of natural language querying, with complete traceability via deep PDF page citations and support for Hindi, English, and Kannada.

>80%
Extraction Accuracy
>90%
Search Relevance
5-10 sec
Query Latency
3
Languages Supported
10,000+
Documents Indexed
Full Scale
Scalability

Want Similar Results?

Let's discuss how we can build a similar solution for your organization — with the same certified quality and production-grade delivery.