Skip to main content

Chunking Strategy Guide

Wrong chunk size = silently bad retrieval. This is where most RAG implementations fail silently โ€” no error, just bad answers.


Strategy Comparisonโ€‹

StrategyBest ForHow It WorksToolCost
Fixed SizeGeneral RAG defaultSplits at char limit regardless of meaningRecursiveCharacterTextSplitterFast ยท Free
โญ Parent-ChildLegal / Medical / Deep contextChild (400) โ†’ Qdrant search. Parent (2000) โ†’ LLM context. Single lookup.RecursiveCharacterTextSplitter (twice)Fast ยท Free โ€” My production approach
SemanticHigh-accuracy, precision > costGroups sentences by cosine similarity. Breaks where meaning shifts.SemanticChunker (LangChain)Slow ยท Expensive (API per sentence)
Markdown HeaderDocs, Notion exportsSplits on H1/H2/H3. Chunk inherits parent heading.MarkdownHeaderTextSplitterFast ยท Free
HTML SectionWeb pagesSplits on <h1> <h2> <div>. Section context in metadata.HTMLHeaderTextSplitterFast ยท Free
Token-BasedLLM token-limit controlSplits by actual token count โ€” prevents context overflowTokenTextSplitter / tiktokenFast ยท Free
CSV RowTabular dataEach row = one chunk. Row + headers embedded together.Pandas iterrows()Fast ยท No overlap
Code SplitterSource code RAGSplits on functions/classes โ€” not arbitrary char limitsRecursiveCharacterTextSplitter(language=...)Fast ยท Free

Parent-Child Chunking โ€” Deep Diveโ€‹

This is the most important pattern for production RAG. Here's exactly why and how.

The Problem With Naive Chunkingโ€‹

Small chunks (400 chars) โ†’ precise vector search BUT poor LLM context
Large chunks (2000 chars) โ†’ rich LLM context BUT noisy vector search

You can't win with one size. So use two sizes.

The Solution โ€” Two-Level Chunkingโ€‹

from langchain.text_splitter import RecursiveCharacterTextSplitter

# PARENT: Large, contextual, for LLM
parent_splitter = RecursiveCharacterTextSplitter(
chunk_size=2000, # ~1.5 pages of legal text
chunk_overlap=200 # Preserve cross-boundary context
)

# CHILD: Small, precise, for vector search
child_splitter = RecursiveCharacterTextSplitter(
chunk_size=400, # ~3-4 paragraphs
chunk_overlap=50 # Minimal overlap (search precision)
)

How It Works โ€” Step by Stepโ€‹

PDF (Constitution.pdf โ€” 400 pages)
โ†“ PyMuPDFLoader
โ†“ parent_splitter โ†’ ~685 parent chunks (2000 chars each)
โ†“ For each parent โ†’ child_splitter โ†’ ~4-5 child chunks per parent
โ†“ Total โ†’ ~2,845 child chunks

WHAT GOES INTO QDRANT:
Only CHILD chunks stored as vector points

EACH CHILD PAYLOAD CONTAINS:
text: child text (400 chars) โ† shown as source preview in UI
parent_text: FULL parent text โ† sent to LLM as context
parent_id: "{file_hash}_{p_idx}" โ† for deduplication
source_file: "Constitution_of_India_2024.pdf"
page: 92
chunk_type: "child"
is_temporary: False
uploaded_by: "system"

The Key Insightโ€‹

# During SEARCH:
# Small child chunks โ†’ precise retrieval (less noise)

# During GENERATION:
# Parent text (2000 chars) โ†’ sent to LLM
# โ†’ rich legal context with surrounding provisions,
# definitions, and section numbers
# โ†’ prevents hallucination
Interview Answer

"During vector search, small child chunks give precise retrieval because they have less noise. But during generation, I send the parent text (2000 chars) to the LLM โ€” giving it rich legal context. This prevents hallucination because parent text includes surrounding legal provisions, definitions, and section numbers."


Real Numbers From My Production Systemโ€‹

DocumentParent ChunksChild ChunksFile Size
Constitution of India 20246852,8452.4 MB
Bharatiya Nagarik Suraksha Sanhita5552,6562.1 MB
Bharatiya Nyaya Sanhita 20232711,3110.9 MB
Motor Vehicles Act 19882621,2481.2 MB
Consumer Protection Act 2019814161.2 MB
IT Act 2000 (Updated)834200.8 MB
TOTAL1,9378,896~8.6 MB

Quick Reference โ€” Sizes I Useโ€‹

Standard RAG default:     chunk_size=1000, chunk_overlap=200
Legal / Medical RAG: parent=2000, child=400, child_overlap=50
Code RAG: language-aware splitter โ€” function/class boundaries
CSV RAG: no splitter โ€” one chunk per row + column headers
Markdown docs: MarkdownHeaderTextSplitter first, then RecursiveChar