AI-enabled Intelligent Search Solutions for Documents
RAG-powered semantic search and Q&A system for MoSPI's vast document archive — enabling natural language querying with voice and text input, multilingual support (Hindi, English, Kannada), and deep citations linking directly to source PDF pages.
Impact Metrics
Vast Document Archives Locked Behind Manual Search
MoSPI relies on manually retrieving information from a vast volume of PDFs and images. Specialized knowledge is often required to locate specific data points buried across thousands of statistical reports, surveys, and policy papers.
Standard OCR tools fail to preserve the structure of statistical tables, rendering numeric data unusable for analysis. Complex layouts with merged cells, multi-column formats, and Hindi text would break during extraction, producing garbage data that undermined analyst confidence.
There was no cross-document synthesis capability — each document had to be read independently, making it impossible to answer questions that required connecting data across multiple reports or time periods.
Key Pain Points
A Secure, Self-Hosted RAG Pipeline Tailored for Government Documents
Built an intelligent ingestion pipeline powered by Docling (open-source by IBM) instead of standard Regex or PyPDF — preserving the structural integrity of tables and layouts critical for statistical reports. Uses hybrid chunking with chunk overlap for context continuity.
Vector-based semantic retrieval using Qdrant vector store with BGE Large embeddings, capturing semantic nuance for highly relevant search results. The system supports both text and voice input queries.
Generation powered by an OSS 120B LLM (comparable to OpenAI o3), running securely on self-hosted infrastructure. Every response includes deep citations linking directly to the specific PDF page in a new tab. The system evolved through three phases: initial Beta Build (Llama Scout + recursive splitting), Refinement (120B model + Docling + hybrid chunking), and the Knowledge Base pivot (pre-indexing for zero-latency chat).
Our Approach
Key Features Delivered
Built With
Outcomes Achieved
The system transformed how MoSPI analysts interact with India's national statistical archive — reducing research from hours of manual search to seconds of natural language querying, with complete traceability via deep PDF page citations and support for Hindi, English, and Kannada.
Related Case Studies
AI-Based Legacy Data Extraction & Processing
Automated extraction and structuring of legacy statistical data from PDFs, CSVs, and Excel files — with a human-in-the-loop Feeder system, semantic table discovery, and natural language data analytics via Text2SQL.
AI System for Historical Ticket Knowledge — Instant L1 Resolution
Specialized RAG system using historical support ticket databases as a living knowledge base, dramatically reducing agent research time and improving resolution quality.
Intelligent Enterprise AI Platform — Knowledge Management at Scale
Enterprise-grade intelligent knowledge management platform unifying organizational knowledge with AI-powered search, synthesis, and collaborative workflows.