Legal Hierarchical Chunking

Chapter 09 — Problem Statement

Building a Retrieval-Augmented Generation (RAG) system for the Constitution of India presents unique challenges that standard chunking pipelines are not equipped to handle:

Challenge	Why It's Hard
Footnote Contamination	Every page has 3–5 footnotes starting with numbers (e.g., "1. Ins. by Constitution (First Amendment)..."). These get embedded alongside article text, poisoning vector similarity.
Structural Ambiguity	Article 19 (a Fundamental Right) and Entry 19 in the Seventh Schedule (Price Control) share the same number. Semantic search cannot distinguish them.
Massive Document Size	402 pages, 395+ Articles, 12 Schedules. A single wrong chunk boundary can merge two unrelated articles.
Amendment Noise	Dozens of [Omitted], [Repealed], and [Renumbered] annotations clutter the text.

Core Requirement

When a user asks "What is Article 19?", the system must return only the text of Article 19 — not a footnote, not a Schedule entry, not a random amendment.

Hallucination-Resistant RAG — Constitution of India Case StudyPage 14

Chapter 10 — Phase 1: The Initial Approach & Failure

Strategy Used

Parser: LlamaParse (Agentic Tier — 10 credits/page) Chunker: MarkdownHeaderTextSplitter (LangChain) Embedder: Jina AI (jina-embeddings-v3) Vector DB: Pinecone

What Happened

Constitution PDF (402 pages)

LlamaParse (Agentic Tier)

Markdown Output (merged pages)

MarkdownHeaderSplitter

624 Giant Chunks (avg 5000+ chars)

Semantic Search

Wrong Results (footnotes + noise)

Figure 1 — Phase 1 pipeline: giant noisy chunks led to wrong retrieval results.

Failure Analysis

Problem	Root Cause	Impact
Only 624 chunks for 395+ Articles	LlamaParse merged pages into massive blocks, ignoring Article boundaries	Multiple articles fused into single chunks; precision retrieval impossible
Footnote Poisoning	Parser treated footnotes as body text	Query for "Article 19" matched footnote "19. Ins. by Constitution..." with high similarity
Semantic Ambiguity	Pure vector similarity cannot distinguish Article 19 from Schedule Entry 19	LLM received garbage context → hallucinated answers
High Cost, Low ROI	402 pages × 10 credits = 4,020 LlamaParse credits per sync	Multiple re-syncs during debugging burned 30K+ credits

Key Lesson

Premium parsers (LlamaParse, Unstructured.io) excel at tables, invoices, and structured forms. For dense, unformatted legal text with hundreds of numbered clauses, they perform worse than a well-engineered free parser.

Hallucination-Resistant RAG — Constitution of India Case StudyPage 15

Chapter 11 — Phase 2: The Architectural Pivot

The entire pipeline was scrapped and rebuilt from scratch using a deterministic, structure-aware Hierarchical Chunking approach.

New Architecture

Constitution PDF

PyMuPDF (Free, Local)

Page Cleaning

Split at ____

Discarded

Clean Text (0 footnotes)

Footnotes (removed)

Article-Boundary Chunker

3,248 Precise Chunks

Metadata Injection

article_number: 19 article_number: 370 article_number: 21A

Pinecone Upload

Smart Retriever (Metadata Filter)

Figure 2 — The rebuilt, structure-aware ingestion and retrieval pipeline.

Strategy 1 — Aggressive Noise Isolation (The Footnote Slicer)

The Constitution PDF consistently separates article text from footnotes using a solid line of underscores.

Metric	Before	After
Footnotes in index	~500+	0
Noise chunks	~15% of total	0%

Strategy 2 — Article-Boundary (Hierarchical) Chunking

Industry Term

This technique is known as Hierarchical Chunking (also called Multi-Granularity Chunking) in RAG research — the parent represents a full Article; children are smaller sub-sections used for embedding. Rather than splitting text by arbitrary character counts, chunk boundaries are made to align with the document's real logical structure, which is what makes precise, article-level retrieval possible on a 402-page legal text.

Metric	LlamaParse (Phase 1)	PyMuPDF + Hierarchical Chunking (Phase 2)
Total Chunks	624	3,248
Avg Chunk Size	~5,000 chars	~400 chars
Article Isolation	Multiple articles per chunk	One article per parent
Cost per Sync	4,020 credits	0 credits (free)

# parser.py — Split page text at underscore separator, discard everything below
parts = re.split(r'\_{10,}', page\_text)
clean\_text = parts[0] # Keep only the text ABOVE the line
# chunker.py — Hierarchical split on Article boundaries using regex
raw\_splits = re.split(r'\n(?=\d{1,3}[A-Z]*\.\s+[A-Z])', full\_text)
# Result: 1,928 precise parent chunks (each = 1 Article/Clause)
# These parents are further split into 3,248 child chunks for embedding

Hallucination-Resistant RAG — Constitution of India Case StudyPage 16

Strategy 3 — Dynamic Metadata Injection

Instead of relying on the embedding model to guess which Article a chunk belongs to, the identity is hardcoded into the vector metadata at chunk-creation time.

Metric	Count
Total chunks	3,248
Chunks with article_number tag	2,204 (67.8%)
Chunks without tag (Preamble, Schedules, Headings)	1,044 (32.2%)
Noise / footnote chunks	0

# chunker.py — Extract article number during the split
match = re.match(r'^(\d{1,3}[A-Z]*)\.', chunk\_text)
if match:
    article\_num = match.group(1) # "19", "21A", "370"
    chunk\_metadata["article\_number"] = article\_num

{
"id": "a3f8c2...",
"values": [0.023, -0.041, 0.089, ...],
"metadata": {
"source\_file": "constitution of india.pdf",
"article\_number": "19",
"chunk\_type": "parent\_child",
"is\_omitted": false
}
}

Hallucination-Resistant RAG — Constitution of India Case StudyPage 17

Strategy 4 — Smart Router & Strict Metadata Filtering

User: What is Article 19?

LLM Classifier (Query Router)

article_number detected?

Yes: '19'

No (general query)

Strict Metadata Filter article_number = 19

Semantic Search (normal vector similarity)

Returns ONLY Article 19 chunks

Returns top-k similar chunks

LLM Generator

Figure 3 — Query router: metadata filtering bypasses semantic search entirely when an article number is detected.

# graph.py — Retriever Node
target\_article = intent.get("article\_number")
if target\_article and target\_article.lower() not in ("null", "none", ""):
    # BYPASS semantic search — use exact database lookup
    pinecone\_filter["article\_number"] = {"$eq": target\_article}
    results = index.query(
        vector=query\_embedding,
        top\_k=25,
        filter=pinecone\_filter # {"article\_number": {"$eq": "19"}}
    )

Why This Works

This is equivalent to a SQL WHERE article_number = '19' clause. It doesn't matter what the embedding similarity score is — the database physically cannot return chunks from any other article.

Hallucination-Resistant RAG — Constitution of India Case StudyPage 18

Chapter 10 — Phase 1: The Initial Approach & Failure

Strategy Used​

What Happened​

Failure Analysis​

Chapter 11 — Phase 2: The Architectural Pivot

New Architecture​

Strategy 1 — Aggressive Noise Isolation (The Footnote Slicer)​

Strategy 2 — Article-Boundary (Hierarchical) Chunking​

Strategy 3 — Dynamic Metadata Injection​

Strategy 4 — Smart Router & Strict Metadata Filtering​