Ministry of Statistics and Programme Implementation (MoSPI), Government of India

Government

AI-enabled Intelligent Search Solutions for Documents

RAG-powered semantic search and Q&A system for MoSPI's vast document archive — enabling natural language querying with voice and text input, multilingual support (Hindi, English, Kannada), and deep citations linking directly to source PDF pages.

RAGLLMGovernmentDocument AIVector Search

Build Something Similar

Impact Metrics

>80%

Extraction Accuracy

Docling preserves tabular data from complex PDFs with high fidelity

>90%

Search Relevance

Hybrid chunking + BGE Large embeddings deliver highly relevant results

5-10 sec

Query Latency

Prioritizes accuracy over speed with 120B model validation

Languages Supported

Hindi, English, and Kannada for input and output

10,000+

Documents Indexed

Statistical reports, surveys, and policy papers

Full Scale

Scalability

Dockerized architecture with Qdrant handling millions of vectors

The Challenge

Vast Document Archives Locked Behind Manual Search

MoSPI relies on manually retrieving information from a vast volume of PDFs and images. Specialized knowledge is often required to locate specific data points buried across thousands of statistical reports, surveys, and policy papers.

Standard OCR tools fail to preserve the structure of statistical tables, rendering numeric data unusable for analysis. Complex layouts with merged cells, multi-column formats, and Hindi text would break during extraction, producing garbage data that undermined analyst confidence.

There was no cross-document synthesis capability — each document had to be read independently, making it impossible to answer questions that required connecting data across multiple reports or time periods.

Key Pain Points

Hours spent manually searching document archives for single data points

Standard OCR tools fail to preserve statistical table structure

No cross-document synthesis — each document had to be read independently

Multi-language documents (Hindi, English, Kannada) unsupported by existing tools

No audit trail for which source documents backed a given data claim

The Solution

A Secure, Self-Hosted RAG Pipeline Tailored for Government Documents

Built an intelligent ingestion pipeline powered by Docling (open-source by IBM) instead of standard Regex or PyPDF — preserving the structural integrity of tables and layouts critical for statistical reports. Uses hybrid chunking with chunk overlap for context continuity.

Vector-based semantic retrieval using Qdrant vector store with BGE Large embeddings, capturing semantic nuance for highly relevant search results. The system supports both text and voice input queries.

Generation powered by an OSS 120B LLM (comparable to OpenAI o3), running securely on self-hosted infrastructure. Every response includes deep citations linking directly to the specific PDF page in a new tab. The system evolved through three phases: initial Beta Build (Llama Scout + recursive splitting), Refinement (120B model + Docling + hybrid chunking), and the Knowledge Base pivot (pre-indexing for zero-latency chat).

Our Approach

Docling-powered document ingestion with OCR, table extraction, and layout preservation

Hybrid chunking strategy with chunk overlap for context continuity

Qdrant vector store with BGE Large embeddings for semantic search

Self-hosted OSS 120B LLM for secure, accurate response generation

Knowledge Base architecture — admins pre-process and index documents, users chat instantly

Voice and text input with multilingual support (Hindi, English, Kannada)

Deep citations linking directly to source PDF pages

MoSPI branding, admin panels, and role-based access control

Ministry SMTP integration with whitelisted IP for secure email delivery

Key Features Delivered

Semantic Q&A with voice and text input

Docling-powered OCR preserving table structure and layouts

Multilingual support for Hindi, English, and Kannada

Deep citations linking directly to specific PDF pages

Knowledge Base — admins pre-index documents for zero-latency user chat

Role-based access control with MoSPI branding

LLM-powered image indexing — charts and images made searchable via picture descriptions

Query history and admin analytics dashboard

Technology Stack

Built With

SvelteKit (Frontend)Tailwind CSS + shadcn/uiFastAPI (Backend)Socket.IO (Real-time)Qdrant (Vector DB)PostgreSQLDocling (OCR)BGE Large (Embeddings)OSS 120B LLM (Inference)LangChainOllamaDocker + NginxAzure CloudRedis

Results

Outcomes Achieved

The system transformed how MoSPI analysts interact with India's national statistical archive — reducing research from hours of manual search to seconds of natural language querying, with complete traceability via deep PDF page citations and support for Hindi, English, and Kannada.

>80%

Extraction Accuracy

>90%

Search Relevance

5-10 sec

Query Latency

Languages Supported

10,000+

Documents Indexed

Full Scale

Scalability

Related Case Studies

Ministry of Statistics and Programme Implementation (MoSPI), Government of India

AI-Based Legacy Data Extraction & Processing

Automated extraction and structuring of legacy statistical data from PDFs, CSVs, and Excel files — with a human-in-the-loop Feeder system, semantic table discovery, and natural language data analytics via Text2SQL.

View case study

Cleo

RAG-Powered Past Ticket Intelligence for Zendesk

Ingested historical Zendesk tickets into a knowledge base and built an in-app Zendesk widget that analyses open tickets, cites relevant past tickets, synthesises root cause, diagnostic steps, and resolution — with AI chat for engineers.

View case study

Mohan Impex

Custom ERP Suite for Food Ingredients Distribution & Manufacturing

End-to-end custom ERP system for one of India's leading food ingredient distributors — covering warehouse operations, supply chain, factory & NPD lab, sales, import-export, HRM, and financial accounting with GST compliance.

View case study

Want Similar Results?

Let's discuss how we can build a similar solution for your organization — with the same certified quality and production-grade delivery.

Start a Conversation View All Case Studies