AI-Based Legacy Data Extraction & Processing
Automated extraction and structuring of legacy statistical data from PDFs, CSVs, and Excel files — with a human-in-the-loop Feeder system, semantic table discovery, and natural language data analytics via Text2SQL.
Impact Metrics
Vast Legacy Data Locked in Unstructured Formats
MoSPI holds vast amounts of legacy data in PDFs, CSVs, and Excel files. Extracting meaningful insights currently requires extensive manual effort, specialized coding knowledge, and handling complex formats including merged cells and Hindi text.
OCR engines are powerful but not infallible — noise, split headers, and garbage characters can corrupt a database. The challenge was not just extraction, but ensuring only verified, accurate data enters the system. A naive pipeline would produce garbage-in, garbage-out results that undermine analyst confidence.
Even after extraction, identical tables appear across different years with the same names (e.g., 'Table 2' in 2022 vs 2023), making it impossible for AI to distinguish them without human-curated metadata. Cross-table analysis, trend comparison, and statistical operations were impossible without manual data wrangling.
Key Pain Points
Human-in-the-Loop Data Pipeline with Semantic Discovery and Text2SQL
We built a custom pipeline that automatically extracts tables from Excel, CSV, and PDF files and stores them in SQL (relational database). The Feeder system implements a human-in-the-loop approach — after automated OCR extraction, admins can edit, merge tables with common headers, rename captions for disambiguation, and approve data before it enters the vector store. Every change is tracked in audit logs.
For data discovery, we built the MoSPI Data Intelligence Hub — a semantic search layer over 490+ indexed tables. Users describe what they're looking for in natural language, and a 4-stage retrieval pipeline (Doc Search → Table Retrieval → Semantic Filter → SQL Gen) identifies the most relevant tables from thousands of candidates.
The system enables natural language analytics via Text2SQL — users can ask questions like 'mean production of coke plants in November 2024' and get precise SQL-backed answers with chart generation. We enriched each table with metadata (caption, column names, Q&A pairs) and built a dedicated table catalog for table-level semantic search.
Our Approach
Key Features Delivered
Built With
Outcomes Achieved
The system transformed MoSPI's legacy data into a searchable, queryable intelligence hub — enabling analysts to find, compare, and analyze statistical tables across decades of reports using natural language, with complete audit trails and human-verified data accuracy.
Related Case Studies
AI-enabled Intelligent Search Solutions for Documents
RAG-powered semantic search and Q&A system for MoSPI's vast document archive — enabling natural language querying with voice and text input, multilingual support (Hindi, English, Kannada), and deep citations linking directly to source PDF pages.
Intelligent Enterprise AI Platform — Knowledge Management at Scale
Enterprise-grade intelligent knowledge management platform unifying organizational knowledge with AI-powered search, synthesis, and collaborative workflows.
AI System for Historical Ticket Knowledge — Instant L1 Resolution
Specialized RAG system using historical support ticket databases as a living knowledge base, dramatically reducing agent research time and improving resolution quality.