Skip to main content

Legal Hierarchical Chunking

Chapter 09 — Problem Statement

Building a Retrieval-Augmented Generation (RAG) system for the Constitution of India presents unique challenges that standard chunking pipelines are not equipped to handle:

ChallengeWhy It's Hard
Footnote ContaminationEvery page has 3–5 footnotes starting with numbers (e.g., "1. Ins. by Constitution (First Amendment)..."). These get embedded alongside article text, poisoning vector similarity.
Structural AmbiguityArticle 19 (a Fundamental Right) and Entry 19 in the Seventh Schedule (Price Control) share the same number. Semantic search cannot distinguish them.
Massive Document Size402 pages, 395+ Articles, 12 Schedules. A single wrong chunk boundary can merge two unrelated articles.
Amendment NoiseDozens of [Omitted], [Repealed], and [Renumbered] annotations clutter the text.

Core Requirement

When a user asks "What is Article 19?", the system must return only the text of Article 19 — not a footnote, not a Schedule entry, not a random amendment.

Hallucination-Resistant RAG — Constitution of India Case StudyPage 14

Chapter 10 — Phase 1: The Initial Approach & Failure

Strategy Used

Parser: LlamaParse (Agentic Tier — 10 credits/page) Chunker: MarkdownHeaderTextSplitter (LangChain) Embedder: Jina AI (jina-embeddings-v3) Vector DB: Pinecone

What Happened

Constitution PDF (402 pages)

LlamaParse (Agentic Tier)

Markdown Output (merged pages)

MarkdownHeaderSplitter

624 Giant Chunks (avg 5000+ chars)

Semantic Search

Wrong Results (footnotes + noise)

Figure 1 — Phase 1 pipeline: giant noisy chunks led to wrong retrieval results.

Failure Analysis

ProblemRoot CauseImpact
Only 624 chunks for 395+ ArticlesLlamaParse merged pages into massive blocks, ignoring Article boundariesMultiple articles fused into single chunks; precision retrieval impossible
Footnote PoisoningParser treated footnotes as body textQuery for "Article 19" matched footnote "19. Ins. by Constitution..." with high similarity
Semantic AmbiguityPure vector similarity cannot distinguish Article 19 from Schedule Entry 19LLM received garbage context → hallucinated answers
High Cost, Low ROI402 pages × 10 credits = 4,020 LlamaParse credits per syncMultiple re-syncs during debugging burned 30K+ credits

Key Lesson

Premium parsers (LlamaParse, Unstructured.io) excel at tables, invoices, and structured forms. For dense, unformatted legal text with hundreds of numbered clauses, they perform worse than a well-engineered free parser.

Hallucination-Resistant RAG — Constitution of India Case StudyPage 15

Chapter 11 — Phase 2: The Architectural Pivot

The entire pipeline was scrapped and rebuilt from scratch using a deterministic, structure-aware Hierarchical Chunking approach.

New Architecture

Constitution PDF

PyMuPDF (Free, Local)

Page Cleaning

Split at ____

Discarded

Clean Text (0 footnotes)

Footnotes (removed)

Article-Boundary Chunker

3,248 Precise Chunks

Metadata Injection

article_number: 19 article_number: 370 article_number: 21A

Pinecone Upload

Smart Retriever (Metadata Filter)

Figure 2 — The rebuilt, structure-aware ingestion and retrieval pipeline.

Strategy 1 — Aggressive Noise Isolation (The Footnote Slicer)

The Constitution PDF consistently separates article text from footnotes using a solid line of underscores.

MetricBeforeAfter
Footnotes in index~500+0
Noise chunks~15% of total0%

Strategy 2 — Article-Boundary (Hierarchical) Chunking

Industry Term

This technique is known as Hierarchical Chunking (also called Multi-Granularity Chunking) in RAG research — the parent represents a full Article; children are smaller sub-sections used for embedding. Rather than splitting text by arbitrary character counts, chunk boundaries are made to align with the document's real logical structure, which is what makes precise, article-level retrieval possible on a 402-page legal text.

MetricLlamaParse (Phase 1)PyMuPDF + Hierarchical Chunking (Phase 2)
Total Chunks6243,248
Avg Chunk Size~5,000 chars~400 chars
Article IsolationMultiple articles per chunkOne article per parent
Cost per Sync4,020 credits0 credits (free)
# parser.py — Split page text at underscore separator, discard everything below
parts = re.split(r'\_{10,}', page\_text)
clean\_text = parts[0] # Keep only the text ABOVE the line
# chunker.py — Hierarchical split on Article boundaries using regex
raw\_splits = re.split(r'\n(?=\d{1,3}[A-Z]*\.\s+[A-Z])', full\_text)
# Result: 1,928 precise parent chunks (each = 1 Article/Clause)
# These parents are further split into 3,248 child chunks for embedding

Hallucination-Resistant RAG — Constitution of India Case StudyPage 16

Strategy 3 — Dynamic Metadata Injection

Instead of relying on the embedding model to guess which Article a chunk belongs to, the identity is hardcoded into the vector metadata at chunk-creation time.

MetricCount
Total chunks3,248
Chunks with article_number tag2,204 (67.8%)
Chunks without tag (Preamble, Schedules, Headings)1,044 (32.2%)
Noise / footnote chunks0
# chunker.py — Extract article number during the split
match = re.match(r'^(\d{1,3}[A-Z]*)\.', chunk\_text)
if match:
article\_num = match.group(1) # "19", "21A", "370"
chunk\_metadata["article\_number"] = article\_num
{
"id": "a3f8c2...",
"values": [0.023, -0.041, 0.089, ...],
"metadata": {
"source\_file": "constitution of india.pdf",
"article\_number": "19",
"chunk\_type": "parent\_child",
"is\_omitted": false
}
}

Hallucination-Resistant RAG — Constitution of India Case StudyPage 17

Strategy 4 — Smart Router & Strict Metadata Filtering

User: What is Article 19?

LLM Classifier (Query Router)

article_number detected?

Yes: '19'

No (general query)

Strict Metadata Filter article_number = 19

Semantic Search (normal vector similarity)

Returns ONLY Article 19 chunks

Returns top-k similar chunks

LLM Generator

Figure 3 — Query router: metadata filtering bypasses semantic search entirely when an article number is detected.

# graph.py — Retriever Node
target\_article = intent.get("article\_number")
if target\_article and target\_article.lower() not in ("null", "none", ""):
# BYPASS semantic search — use exact database lookup
pinecone\_filter["article\_number"] = {"$eq": target\_article}
results = index.query(
vector=query\_embedding,
top\_k=25,
filter=pinecone\_filter # {"article\_number": {"$eq": "19"}}
)

Why This Works

This is equivalent to a SQL WHERE article_number = '19' clause. It doesn't matter what the embedding similarity score is — the database physically cannot return chunks from any other article.

Hallucination-Resistant RAG — Constitution of India Case StudyPage 18