
Chapter 09 — Problem Statement
Building a Retrieval-Augmented Generation (RAG) system for the Constitution of India presents unique challenges that standard chunking pipelines are not equipped to handle:
| Challenge | Why It's Hard |
|---|---|
| Footnote Contamination | Every page has 3–5 footnotes starting with numbers (e.g., "1. Ins. by Constitution (First Amendment)..."). These get embedded alongside article text, poisoning vector similarity. |
| Structural Ambiguity | Article 19 (a Fundamental Right) and Entry 19 in the Seventh Schedule (Price Control) share the same number. Semantic search cannot distinguish them. |
| Massive Document Size | 402 pages, 395+ Articles, 12 Schedules. A single wrong chunk boundary can merge two unrelated articles. |
| Amendment Noise | Dozens of [Omitted], [Repealed], and [Renumbered] annotations clutter the text. |
Core Requirement
When a user asks "What is Article 19?", the system must return only the text of Article 19 — not a footnote, not a Schedule entry, not a random amendment.
Hallucination-Resistant RAG — Constitution of India Case StudyPage 14
Chapter 10 — Phase 1: The Initial Approach & Failure
Strategy Used
Parser: LlamaParse (Agentic Tier — 10 credits/page) Chunker: MarkdownHeaderTextSplitter (LangChain) Embedder: Jina AI (jina-embeddings-v3) Vector DB: Pinecone
What Happened
Constitution PDF (402 pages)
LlamaParse (Agentic Tier)
Markdown Output (merged pages)
MarkdownHeaderSplitter
624 Giant Chunks (avg 5000+ chars)
Semantic Search
Wrong Results (footnotes + noise)
Figure 1 — Phase 1 pipeline: giant noisy chunks led to wrong retrieval results.
Failure Analysis
| Problem | Root Cause | Impact |
|---|---|---|
| Only 624 chunks for 395+ Articles | LlamaParse merged pages into massive blocks, ignoring Article boundaries | Multiple articles fused into single chunks; precision retrieval impossible |
| Footnote Poisoning | Parser treated footnotes as body text | Query for "Article 19" matched footnote "19. Ins. by Constitution..." with high similarity |
| Semantic Ambiguity | Pure vector similarity cannot distinguish Article 19 from Schedule Entry 19 | LLM received garbage context → hallucinated answers |
| High Cost, Low ROI | 402 pages × 10 credits = 4,020 LlamaParse credits per sync | Multiple re-syncs during debugging burned 30K+ credits |
Key Lesson
Premium parsers (LlamaParse, Unstructured.io) excel at tables, invoices, and structured forms. For dense, unformatted legal text with hundreds of numbered clauses, they perform worse than a well-engineered free parser.
Hallucination-Resistant RAG — Constitution of India Case StudyPage 15
Chapter 11 — Phase 2: The Architectural Pivot
The entire pipeline was scrapped and rebuilt from scratch using a deterministic, structure-aware Hierarchical Chunking approach.
New Architecture
Constitution PDF
PyMuPDF (Free, Local)
Page Cleaning
Split at ____
Discarded
Clean Text (0 footnotes)
Footnotes (removed)
Article-Boundary Chunker
3,248 Precise Chunks
Metadata Injection
article_number: 19 article_number: 370 article_number: 21A
Pinecone Upload
Smart Retriever (Metadata Filter)
Figure 2 — The rebuilt, structure-aware ingestion and retrieval pipeline.
Strategy 1 — Aggressive Noise Isolation (The Footnote Slicer)
The Constitution PDF consistently separates article text from footnotes using a solid line of underscores.
| Metric | Before | After |
|---|---|---|
| Footnotes in index | ~500+ | 0 |
| Noise chunks | ~15% of total | 0% |
Strategy 2 — Article-Boundary (Hierarchical) Chunking
Industry Term
This technique is known as Hierarchical Chunking (also called Multi-Granularity Chunking) in RAG research — the parent represents a full Article; children are smaller sub-sections used for embedding. Rather than splitting text by arbitrary character counts, chunk boundaries are made to align with the document's real logical structure, which is what makes precise, article-level retrieval possible on a 402-page legal text.
| Metric | LlamaParse (Phase 1) | PyMuPDF + Hierarchical Chunking (Phase 2) |
|---|---|---|
| Total Chunks | 624 | 3,248 |
| Avg Chunk Size | ~5,000 chars | ~400 chars |
| Article Isolation | Multiple articles per chunk | One article per parent |
| Cost per Sync | 4,020 credits | 0 credits (free) |
# parser.py — Split page text at underscore separator, discard everything below
parts = re.split(r'\_{10,}', page\_text)
clean\_text = parts[0] # Keep only the text ABOVE the line
# chunker.py — Hierarchical split on Article boundaries using regex
raw\_splits = re.split(r'\n(?=\d{1,3}[A-Z]*\.\s+[A-Z])', full\_text)
# Result: 1,928 precise parent chunks (each = 1 Article/Clause)
# These parents are further split into 3,248 child chunks for embedding
Hallucination-Resistant RAG — Constitution of India Case StudyPage 16
Strategy 3 — Dynamic Metadata Injection
Instead of relying on the embedding model to guess which Article a chunk belongs to, the identity is hardcoded into the vector metadata at chunk-creation time.
| Metric | Count |
|---|---|
| Total chunks | 3,248 |
| Chunks with article_number tag | 2,204 (67.8%) |
| Chunks without tag (Preamble, Schedules, Headings) | 1,044 (32.2%) |
| Noise / footnote chunks | 0 |
# chunker.py — Extract article number during the split
match = re.match(r'^(\d{1,3}[A-Z]*)\.', chunk\_text)
if match:
article\_num = match.group(1) # "19", "21A", "370"
chunk\_metadata["article\_number"] = article\_num
{
"id": "a3f8c2...",
"values": [0.023, -0.041, 0.089, ...],
"metadata": {
"source\_file": "constitution of india.pdf",
"article\_number": "19",
"chunk\_type": "parent\_child",
"is\_omitted": false
}
}
Hallucination-Resistant RAG — Constitution of India Case StudyPage 17
Strategy 4 — Smart Router & Strict Metadata Filtering
User: What is Article 19?
LLM Classifier (Query Router)
article_number detected?
Yes: '19'
No (general query)
Strict Metadata Filter article_number = 19
Semantic Search (normal vector similarity)
Returns ONLY Article 19 chunks
Returns top-k similar chunks
LLM Generator
Figure 3 — Query router: metadata filtering bypasses semantic search entirely when an article number is detected.
# graph.py — Retriever Node
target\_article = intent.get("article\_number")
if target\_article and target\_article.lower() not in ("null", "none", ""):
# BYPASS semantic search — use exact database lookup
pinecone\_filter["article\_number"] = {"$eq": target\_article}
results = index.query(
vector=query\_embedding,
top\_k=25,
filter=pinecone\_filter # {"article\_number": {"$eq": "19"}}
)
Why This Works
This is equivalent to a SQL WHERE article_number = '19' clause. It doesn't matter what the embedding similarity score is — the database physically cannot return chunks from any other article.
Hallucination-Resistant RAG — Constitution of India Case StudyPage 18