Agentic Financial Parser

Architecture & Technical Documentation

8-Node Agentic RAG · LangGraph StateGraph · Zero-Cost Infrastructure

Engineered by Ambuj Kumar Tripathi — GenAI Solution Architect · RAG Systems Specialist

Built an enterprise-grade 8-node Agentic RAG system using LangGraph StateGraph with Jina v3 MRL embeddings (75% storage savings), dual-strategy chunking, 3-layer hallucination prevention, and 7-layer upload security — deployed on zero-cost infrastructure serving Indian Budget, Tax Laws, and Constitution documents.

TECH STACK

Layer	Technology	Purpose
Frontend	React 18 + Vite 6	SPA with SSE word-by-word streaming
Backend	FastAPI + Uvicorn	Async API server
Orchestration	LangGraph StateGraph	8-node agentic pipeline with conditional routing
LLM	Qwen 2.5 72B via OpenRouter	Generation + Classification
Embeddings	Jina v3 MRL — 256d via API	1024→256d truncation \| 75% storage savings \| 0 RAM
VectorDB	Pinecone Serverless	Dual namespace: core brain + temp user uploads
Document DB	MongoDB Atlas (Motor async)	Chat history with TTL indexes — GDPR compliant
File Registry	Supabase PostgreSQL + Storage	SHA-256 sync engine, PDF storage
PDF Parsing	LlamaParse (3 tiers) + PyMuPDF	Cloud for complex tables, local for plain text
Security	PII Shield + JWT + SlowAPI	Aadhaar/PAN/Phone masking before LLM
Observability	Langfuse	LLM trace + generation spans + latency metrics
Resilience	pybreaker Circuit Breakers	3 failures → 30s cooldown → auto-recovery
Cache	Upstash Redis	SHA-256 response cache (1hr TTL) + rate limiting
Deployment	Docker multi-stage on Render Free	512MB RAM — zero GPU cost — all inference via API

HIGH-LEVEL SYSTEM ARCHITECTURE

The complete system spans four layers — from the React frontend through the FastAPI gateway into the LangGraph orchestration engine, backed by five managed cloud services. Every component runs on free-tier infrastructure with zero GPU cost.

Layer	Components	Deployment
Client	React 18 SPA, SSE streaming, Google OAuth	Served by FastAPI (same container)
API Gateway	FastAPI + Uvicorn, JWT auth, PII Shield, rate limiting	Render Free Tier (512MB RAM)
Orchestration	LangGraph 8-node StateGraph, pybreaker circuit breakers	Same container — async event loop
Infrastructure	Pinecone, MongoDB Atlas, Supabase, Upstash Redis, Langfuse	All managed cloud — zero self-hosted

8-NODE PIPELINE — SYSTEM ARCHITECTURE

Every user query enters through the PII Shield and flows through the LangGraph StateGraph. The Classifier node makes a single LLM call to route the query into one of four paths — abusive, greeting, vague, or rag. The RAG path proceeds through Retriever → Generator → Hallucination Guard before results are saved to MongoDB via PostProcess and streamed back to the UI via SSE.

NODE-BY-NODE BREAKDOWN

Node	Name	Function	Cost
1	Classifier	1 LLM call classifies query (abusive/greeting/vague/rag) AND determines search scope (system_only/user_only/hybrid) — avoids unnecessary Pinecone namespace queries	1 LLM call
2	Reject	Blocks abusive queries with a firm professional response. Regex-based — zero LLM cost.	0 LLM calls
3	Greet	Handles greetings WITHOUT hitting VectorDB — saves Pinecone + Jina credits. 1 lightweight LLM call.	1 LLM call
4	CrossQuestioner	If query is too vague, asks ONE clarifying question (max 2 rounds) before triggering retrieval. Prevents wasting retrieval credits on ambiguous queries.	1 LLM call
5	Retriever	Dual Pinecone search: Core Brain (top_k=20, is_temporary=False) + Temp Uploads (top_k=5, is_temporary=True, uploaded_by=user). Deduplicates by parent_id — child vectors searched, parent text fed to LLM.	Jina embed + 2 Pinecone calls
6	Generator	less than 40% confidence → Fallback immediately (0 LLM call wasted). PII Shield masks before LLM. Strict context-only system prompt: mandatory citations, Pro Tips, Follow-ups. Language mirroring: English/Hinglish/Hindi.	1 LLM call (conditional)
7	Hallucination Guard	Separate LLM-as-judge call verifies the generated answer is grounded in retrieved context. If hallucinated → Fallback node. If grounded → PostProcess.	1 LLM call
8	PostProcess	Saves user message + AI response to MongoDB Atlas. Logs query_type, confidence score, latency via Langfuse distributed tracing. Returns response to frontend via SSE stream.	DB write + Langfuse log
—	Fallback	Activated by: (a) no chunks in Pinecone, (b) confidence less than 40%, (c) hallucination detected. Returns clear 'I don't have this information' — never fabricates answers.	0 LLM calls

DATA INGESTION PIPELINE

Every PDF is fingerprinted with SHA-256 and compared against the Supabase registry. Only new or changed documents are parsed and embedded — unchanged files are skipped, saving Jina API credits. Parsing strategy is chosen based on document complexity.

3,854 chunks | 3,854 live vectors | Pinecone Serverless

LLAMAPARSE 3-TIER PARSING SYSTEM

Tier	Credits/Page	Used For	Example Documents
Agentic Plus	45 cr/page	Infographics, charts, visual tables	Budget at a Glance, summary charts
Agentic	10 cr/page	Complex financial tables, math, memoranda	Finance Bill, Tax Memorandum
Cost Effective	3 cr/page	Structured legal text, clean formatting	Constitution, RBI KYC Guidelines
PyMuPDF (free)	0 — local	Plain prose text, temp user uploads	PF Scheme, user-uploaded PDFs

DUAL CHUNKING STRATEGY

Markdown tables break mid-row with character-based splitting, losing column headers. The solution: dual strategy based on parsing source. MarkdownHeaderTextSplitter for LlamaParse output keeps tables intact. Parent-Child for PyMuPDF gives precise retrieval with rich context — parent text stored in child metadata, no second DB lookup needed.

Strategy	Chunk Size	Applied When	Key Advantage
MarkdownHeaderTextSplitter	Header-based	LlamaParse output — complex tables, structured docs	Table rows intact — column headers preserved in every chunk
Parent-Child Recursive	Parent=2000 Child=400 chars	PyMuPDF output — plain prose, temp uploads	Child vectors searched for precision; parent text fed to LLM — no second DB lookup

JINA V3 MRL EMBEDDINGS — MATRYOSHKA REPRESENTATION LEARNING

Matryoshka Representation Learning (MRL) trains embeddings such that the first N dimensions of a 1024-dimensional vector are already a high-quality 256-dimensional representation — enabling truncation at inference time without retraining. Task-specific LoRA adapters enable asymmetric search: different encoders for query vs document passage.

Property	Value
Full embedding dimension	1024d (standard Jina v3 output)
Truncated dimension used	256d — truncated at API level via MRL
Pinecone storage savings	75% reduction vs 1024d
Retrieval accuracy retained	~95% preserved after truncation
Query adapter	retrieval.query (asymmetric — for user queries)
Document adapter	retrieval.passage (asymmetric — for indexed chunks)
RAM usage	0 bytes local — all inference via Jina AI API

3-LAYER HALLUCINATION PREVENTION SYSTEM

Layer	Mechanism	Trigger	Action
1	Confidence Gate	Cosine similarity of top Pinecone match less than 40%	Immediate fallback — 0 LLM call. Prevents low-quality generation on weak context.
2	Strict System Prompt	Every generation call	Context-only answers. MISSING INFO RULE: if context doesn't contain the answer, say so explicitly. Never invent section numbers, figures, or statistics.
3	LLM-as-Judge Guard	Post-generation, before serving to user	Separate LLM call verifies answer is grounded in retrieved context. Hallucinated → fallback. Grounded → serve to user.

SECURITY ARCHITECTURE — 7-LAYER UPLOAD SECURITY

Layer	Check	HTTP Status on Fail
1	File extension must be .pdf	415 Unsupported Media Type
2	Magic byte verification — first 4 bytes must be %PDF-	415 Unsupported Media Type
3	Chunked streaming read (1MB/chunk) — reject if total > 10MB — OOM attack protection	413 Payload Too Large
4	PDF bomb protection — PyMuPDF page count must be max 500 pages	400 Bad Request
5	IP-based rate limiting — 5 uploads/hour via SlowAPI	429 Too Many Requests
6	Per-user file quota — max 3 active temp files per session	429 Too Many Requests
7	SHA-256 content dedup — identical file already indexed → skip (0 API tokens consumed)	200 Skipped

Additional Security Features

PII Shield (regex): Masks Aadhaar, PAN, Mobile, Email, IFSC, Bank Account BEFORE LLM — no personal data ever reaches OpenRouter
pybreaker Circuit Breakers: All external API calls wrapped — 3 failures → 30s cooldown → auto-recovery. Prevents cascading failures.
JWT Auth: HS256 signed tokens, 7-day expiry, secure cookie handling
Google OAuth 2.0: Authlib integration, SameSite=None cross-site cookie support
Surgical Vector Deletion: Failed embeddings never leave orphaned vectors — SHA-256 idempotent upsert ensures clean state

DEPLOYMENT STRATEGY — RENDER FREE TIER (512MB RAM)

Strategy	Implementation	Problem Solved
Docker multi-stage	Node Alpine → Python slim	Single container — frontend served by FastAPI, no separate web server
Zero local ML models	All inference via Jina + OpenRouter API	Eliminates 1–2GB RAM from local model loading
MRL 256d embeddings	Jina v3 truncated at API level	75% less Pinecone storage — fits free tier limits
gc.collect() per file	Explicit GC during sync loop	Prevents memory accumulation across 3,854 chunk processing
UptimeRobot ping	Every 5 min to /health endpoint	Prevents Render free tier cold starts (30s+ spin-up avoided)
Supabase keep-alive	/health pings fp_file_registry	Prevents Supabase 7-day database sleep
Batch size = 5	5 chunks/Jina call + 200ms pause	Respects Jina rate limits — stable for large document syncs

INFRASTRUCTURE METRICS — ALL 3 SYSTEMS COMBINED

Project	Chunks	Live Vectors	Vector DB	Strategy
Agentic Financial Parser	18,758	18,758	Pinecone Serverless	LlamaParse + Markdown
Citizen Legal RAG	12,770	9,594	Qdrant Cloud	Parent-Child (PyMuPDF)
Citizen Safety AI	Merged	Merged	Pinecone Serverless	Core Logic Absorbed
GRAND TOTAL	31,528	28,352	Multi-DB	Production Scale

RESUME BULLETS — PICK 5-6

Engineered an 8-node Agentic RAG pipeline using LangGraph StateGraph with conditional edge routing (Classifier → Retriever → Generator → Hallucination Guard → PostProcess) for real-time Indian financial document analysis
Implemented Jina v3 MRL embeddings — 1024→256d via API-level MRL for 75% storage reduction with ~95% retrieval accuracy preserved; task-specific LoRA adapters (retrieval.query vs retrieval.passage) for asymmetric semantic search
Designed dual-strategy chunking: MarkdownHeaderTextSplitter for LlamaParse tables (preserving table integrity) + Parent-Child Recursive Retrieval (2000→400 chars) — precise retrieval without secondary DB lookups
Built 3-layer hallucination prevention: (1) less than 40% confidence aggressive fallback, (2) strict context-only system prompt with MISSING INFO RULE, (3) post-generation LLM-as-judge grounding verification
Implemented 7-layer upload security (10MB OOM-safe streaming, %PDF- magic byte check, PDF bomb guard, SHA-256 dedup, IP rate limiting) with PII masking (Aadhaar/PAN/Mobile) before LLM inference and pybreaker circuit breakers
Architected tiered document parsing: LlamaParse Agentic Plus (infographics), Agentic (complex tables), Cost Effective (structured text) + PyMuPDF free fallback — with SHA-256 sync engine preventing redundant re-indexing
Developed real-time SSE word-by-word streaming with pipeline node visualization — ChatGPT-like progressive delivery with source citations and confidence scores
Deployed entire production stack on free-tier (Render 512MB, Pinecone Serverless, MongoDB Atlas, Supabase, OpenRouter, Jina, Langfuse) — zero GPU cost, all inference API-based

Tech: LangGraph · Jina v3 MRL · Pinecone Serverless · OpenRouter (Qwen 72B) · LlamaParse · FastAPI · React · MongoDB · Supabase · Langfuse · pybreaker · Upstash Redis

PROJECT STRUCTURE

agentic-rag-financial-parser/
├── app/                             # FastAPI Backend
│   ├── main.py                      # App entry point + lifespan manager
│   ├── api/                         # API routes
│   │   ├── auth.py                  # Google OAuth + JWT (HS256, 7-day expiry)
│   │   ├── oauth.py                 # Authlib config
│   │   └── upload.py                # 7-layer secure upload endpoint
│   ├── core/                        # Core utilities
│   │   ├── config.py                # Pydantic Settings (env validation)
│   │   ├── constants.py             # Chunking hyperparameters + LlamaParse tiers
│   │   └── pii_shield.py            # Regex PII masking (Aadhaar/PAN/Phone/Email)
│   ├── db/                          # Database clients
│   │   ├── mongodb.py               # Async Motor (chat_history, chunks, temp_uploads)
│   │   ├── pinecone_client.py       # Vector DB (256d, cosine, AWS us-east-1)
│   │   └── supabase_client.py       # PostgreSQL fp_file_registry
│   └── rag/                         # RAG pipeline
│       ├── graph.py                 # 8-node LangGraph StateGraph (CORE)
│       ├── routes.py                # Chat + Stream + History + Admin endpoints
│       ├── chunker.py               # Dual chunking (MarkdownHeader + Parent-Child)
│       ├── embedder.py              # Jina v3 MRL (1024->256d)
│       ├── parser.py                # LlamaParse (3-tier) + PyMuPDF
│       └── sync.py                  # SHA-256 incremental sync engine
│
├── frontend/                        # React 18 + Vite 6
│   └── src/
│       ├── App.jsx                  # Routes + React.lazy loading
│       ├── main.jsx                 # React root
│       ├── api/client.js            # Axios instance with JWT interceptor
│       ├── context/AuthContext.jsx   # JWT state management
│       └── pages/
│           ├── Landing.jsx          # Public landing page
│           ├── Dashboard.jsx        # Chat UI + SSE streaming
│           ├── Admin.jsx            # Admin panel (sync/upload/chunks)
│           └── AuthCallback.jsx     # OAuth callback handler
│
├── data/
│   ├── raw_pdf/                     # Core knowledge base PDFs
│   └── temp_uploads/                # User temp uploads (24hr TTL)
│
├── requirements.txt                 # Python dependencies
├── Dockerfile                       # Multi-stage build (Node Alpine -> Python slim)
└── DOCUMENTATION.md                 # Developer setup guide

API ENDPOINTS REFERENCE

Authentication

Method	Endpoint	Description
GET	`/auth/login`	Redirect to Google OAuth
GET	`/auth/callback`	Handle OAuth callback
POST	`/auth/logout`	Logout + cleanup temp vectors
POST	`/api/auth/dev-login`	Dev-only bypass
GET	`/api/me`	Current user info + is_admin flag

Chat

Method	Endpoint	Description
POST	`/api/chat`	Standard RAG query
POST	`/api/chat/stream`	SSE word-by-word streaming
GET	`/api/chat/history`	Get conversation history
DELETE	`/api/chat/history`	Clear history

Upload

Method	Endpoint	Description
POST	`/api/upload/temp`	Upload temp PDF (7-layer security)

Admin (ADMIN_EMAIL only)

Method	Endpoint	Description
POST	`/api/admin/sync`	Run SHA-256 sync engine
DELETE	`/api/admin/documents/{name}`	Delete document + vectors
GET	`/api/admin/chunks`	View pending HITL chunks
POST	`/api/admin/chunks/approve`	Approve/reject/edit chunks
GET	`/api/admin/stats`	Dashboard statistics

512MB RAM OPTIMIZATION (RENDER FREE TIER)

Running a production RAG system on 512MB RAM requires aggressive memory optimization. Every component is tuned to prevent OOM crashes.

Critical Configurations

Parameter	Value	Why
Embedding batch size	5 chunks/call	Prevents memory spike during bulk indexing
Chat history limit	6 messages	Keeps LLM context window small
Pinecone top_k	12-15 (core), 5 (temp)	Fewer results = less memory for post-processing
LLM max_tokens	2048	Caps response size to prevent large allocations
MongoDB pool	maxPoolSize=5, minPoolSize=1	Fewer idle connections consuming memory
MRL dimension	256d (not 1024d)	75% less memory during embedding operations
gc.collect()	After every PDF sync	Explicit garbage collection prevents accumulation

Recommended Uvicorn Command

uvicorn app.main:app --host 0.0.0.0 --port ${PORT:-8000} --workers 1 --limit-concurrency 10 --timeout-keep-alive 30

--limit-concurrency 10 prevents memory exhaustion from simultaneous requests. --workers 1 ensures single process (no 512MB x N multiplication).

What NOT to Add (Memory Hogs)

Feature	Why Avoid
Celery/Background Workers	Each worker = separate process = 512MB x N
In-memory caching	Redis is already external — no need for local cache
Large embedding batches	Keep EMBED_BATCH_SIZE = 5
PDF preview generation	Loads entire PDF in memory
WebSocket connections	SSE is already memory-efficient
Local ML models (spaCy, transformers)	500MB+ RAM — use API-based inference instead

Crash Prevention Checklist

--limit-concurrency 10 in uvicorn CMD
MongoDB pool size = 5 (not default 10)
LLM max_tokens = 2048 (not 4096)
Pinecone top_k = 12 (not 20)
gc.collect() after PDF processing
Rate limiting active (10 req/min via Redis)
Circuit breaker active (3 fail -> 30s reset)
Chunked upload streaming (1MB chunks, not full file in memory)

Render-Specific Tips

Free Tier Cold Starts — Instance sleeps after inactivity. First request takes 30s+. UptimeRobot HEAD ping every 5 min prevents this.
Supabase Sleep — Free Supabase databases sleep after 7 days of inactivity. /health endpoint pings fp_file_registry to keep alive.
Log Limits — Render free tier has log retention limits. Keep logging minimal in production.
No Background Tasks — They run in same process, consume same 512MB. Use async/await instead.

TECH STACK​

HIGH-LEVEL SYSTEM ARCHITECTURE​

8-NODE PIPELINE — SYSTEM ARCHITECTURE​

NODE-BY-NODE BREAKDOWN​

DATA INGESTION PIPELINE​

LLAMAPARSE 3-TIER PARSING SYSTEM​

DUAL CHUNKING STRATEGY​

JINA V3 MRL EMBEDDINGS — MATRYOSHKA REPRESENTATION LEARNING​

3-LAYER HALLUCINATION PREVENTION SYSTEM​

SECURITY ARCHITECTURE — 7-LAYER UPLOAD SECURITY​

DEPLOYMENT STRATEGY — RENDER FREE TIER (512MB RAM)​

INFRASTRUCTURE METRICS — ALL 3 SYSTEMS COMBINED​

RESUME BULLETS — PICK 5-6​

PROJECT STRUCTURE​

API ENDPOINTS REFERENCE​

Authentication​

Chat​

Upload​

Admin (ADMIN_EMAIL only)​

512MB RAM OPTIMIZATION (RENDER FREE TIER)​

Critical Configurations​

Recommended Uvicorn Command​

What NOT to Add (Memory Hogs)​

Crash Prevention Checklist​

Render-Specific Tips​