Skip to main content

Agentic Financial Parser

Architecture & Technical Documentation

8-Node Agentic RAG · LangGraph StateGraph · Zero-Cost Infrastructure

Engineered by Ambuj Kumar Tripathi — GenAI Solution Architect · RAG Systems Specialist

Built an enterprise-grade 8-node Agentic RAG system using LangGraph StateGraph with Jina v3 MRL embeddings (75% storage savings), dual-strategy chunking, 3-layer hallucination prevention, and 7-layer upload security — deployed on zero-cost infrastructure serving Indian Budget, Tax Laws, and Constitution documents.


TECH STACK

LayerTechnologyPurpose
FrontendReact 18 + Vite 6SPA with SSE word-by-word streaming
BackendFastAPI + UvicornAsync API server
OrchestrationLangGraph StateGraph8-node agentic pipeline with conditional routing
LLMQwen 2.5 72B via OpenRouterGeneration + Classification
EmbeddingsJina v3 MRL — 256d via API1024→256d truncation | 75% storage savings | 0 RAM
VectorDBPinecone ServerlessDual namespace: core brain + temp user uploads
Document DBMongoDB Atlas (Motor async)Chat history with TTL indexes — GDPR compliant
File RegistrySupabase PostgreSQL + StorageSHA-256 sync engine, PDF storage
PDF ParsingLlamaParse (3 tiers) + PyMuPDFCloud for complex tables, local for plain text
SecurityPII Shield + JWT + SlowAPIAadhaar/PAN/Phone masking before LLM
ObservabilityLangfuseLLM trace + generation spans + latency metrics
Resiliencepybreaker Circuit Breakers3 failures → 30s cooldown → auto-recovery
CacheUpstash RedisSHA-256 response cache (1hr TTL) + rate limiting
DeploymentDocker multi-stage on Render Free512MB RAM — zero GPU cost — all inference via API

8-NODE PIPELINE — SYSTEM ARCHITECTURE

Every user query enters through the PII Shield and flows through the LangGraph StateGraph. The Classifier node makes a single LLM call to route the query into one of four paths — abusive, greeting, vague, or rag. The RAG path proceeds through Retriever → Generator → Hallucination Guard before results are saved to MongoDB via PostProcess and streamed back to the UI via SSE.

NODE-BY-NODE BREAKDOWN

NodeNameFunctionCost
1Classifier1 LLM call classifies query (abusive/greeting/vague/rag) AND determines search scope (system_only/user_only/hybrid) — avoids unnecessary Pinecone namespace queries1 LLM call
2RejectBlocks abusive queries with a firm professional response. Regex-based — zero LLM cost.0 LLM calls
3GreetHandles greetings WITHOUT hitting VectorDB — saves Pinecone + Jina credits. 1 lightweight LLM call.1 LLM call
4CrossQuestionerIf query is too vague, asks ONE clarifying question (max 2 rounds) before triggering retrieval. Prevents wasting retrieval credits on ambiguous queries.1 LLM call
5RetrieverDual Pinecone search: Core Brain (top_k=20, is_temporary=False) + Temp Uploads (top_k=5, is_temporary=True, uploaded_by=user). Deduplicates by parent_id — child vectors searched, parent text fed to LLM.Jina embed + 2 Pinecone calls
6Generatorless than 40% confidence → Fallback immediately (0 LLM call wasted). PII Shield masks before LLM. Strict context-only system prompt: mandatory citations, Pro Tips, Follow-ups. Language mirroring: English/Hinglish/Hindi.1 LLM call (conditional)
7Hallucination GuardSeparate LLM-as-judge call verifies the generated answer is grounded in retrieved context. If hallucinated → Fallback node. If grounded → PostProcess.1 LLM call
8PostProcessSaves user message + AI response to MongoDB Atlas. Logs query_type, confidence score, latency via Langfuse distributed tracing. Returns response to frontend via SSE stream.DB write + Langfuse log
FallbackActivated by: (a) no chunks in Pinecone, (b) confidence less than 40%, (c) hallucination detected. Returns clear 'I don't have this information' — never fabricates answers.0 LLM calls

DATA INGESTION PIPELINE

Every PDF is fingerprinted with SHA-256 and compared against the Supabase registry. Only new or changed documents are parsed and embedded — unchanged files are skipped, saving Jina API credits. Parsing strategy is chosen based on document complexity.

3,854 chunks | 3,854 live vectors | Pinecone Serverless

LLAMAPARSE 3-TIER PARSING SYSTEM

TierCredits/PageUsed ForExample Documents
Agentic Plus45 cr/pageInfographics, charts, visual tablesBudget at a Glance, summary charts
Agentic10 cr/pageComplex financial tables, math, memorandaFinance Bill, Tax Memorandum
Cost Effective3 cr/pageStructured legal text, clean formattingConstitution, RBI KYC Guidelines
PyMuPDF (free)0 — localPlain prose text, temp user uploadsPF Scheme, user-uploaded PDFs

DUAL CHUNKING STRATEGY

Markdown tables break mid-row with character-based splitting, losing column headers. The solution: dual strategy based on parsing source. MarkdownHeaderTextSplitter for LlamaParse output keeps tables intact. Parent-Child for PyMuPDF gives precise retrieval with rich context — parent text stored in child metadata, no second DB lookup needed.

StrategyChunk SizeApplied WhenKey Advantage
MarkdownHeaderTextSplitterHeader-basedLlamaParse output — complex tables, structured docsTable rows intact — column headers preserved in every chunk
Parent-Child RecursiveParent=2000
Child=400 chars
PyMuPDF output — plain prose, temp uploadsChild vectors searched for precision; parent text fed to LLM — no second DB lookup

JINA V3 MRL EMBEDDINGS — MATRYOSHKA REPRESENTATION LEARNING

Matryoshka Representation Learning (MRL) trains embeddings such that the first N dimensions of a 1024-dimensional vector are already a high-quality 256-dimensional representation — enabling truncation at inference time without retraining. Task-specific LoRA adapters enable asymmetric search: different encoders for query vs document passage.

PropertyValue
Full embedding dimension1024d (standard Jina v3 output)
Truncated dimension used256d — truncated at API level via MRL
Pinecone storage savings75% reduction vs 1024d
Retrieval accuracy retained~95% preserved after truncation
Query adapterretrieval.query (asymmetric — for user queries)
Document adapterretrieval.passage (asymmetric — for indexed chunks)
RAM usage0 bytes local — all inference via Jina AI API

3-LAYER HALLUCINATION PREVENTION SYSTEM

LayerMechanismTriggerAction
1Confidence GateCosine similarity of top Pinecone match less than 40%Immediate fallback — 0 LLM call. Prevents low-quality generation on weak context.
2Strict System PromptEvery generation callContext-only answers. MISSING INFO RULE: if context doesn't contain the answer, say so explicitly. Never invent section numbers, figures, or statistics.
3LLM-as-Judge GuardPost-generation, before serving to userSeparate LLM call verifies answer is grounded in retrieved context. Hallucinated → fallback. Grounded → serve to user.

SECURITY ARCHITECTURE — 7-LAYER UPLOAD SECURITY

LayerCheckHTTP Status on Fail
1File extension must be .pdf415 Unsupported Media Type
2Magic byte verification — first 4 bytes must be %PDF-415 Unsupported Media Type
3Chunked streaming read (1MB/chunk) — reject if total > 10MB — OOM attack protection413 Payload Too Large
4PDF bomb protection — PyMuPDF page count must be max 500 pages400 Bad Request
5IP-based rate limiting — 5 uploads/hour via SlowAPI429 Too Many Requests
6Per-user file quota — max 3 active temp files per session429 Too Many Requests
7SHA-256 content dedup — identical file already indexed → skip (0 API tokens consumed)200 Skipped

Additional Security Features

  • PII Shield (regex): Masks Aadhaar, PAN, Mobile, Email, IFSC, Bank Account BEFORE LLM — no personal data ever reaches OpenRouter
  • pybreaker Circuit Breakers: All external API calls wrapped — 3 failures → 30s cooldown → auto-recovery. Prevents cascading failures.
  • JWT Auth: HS256 signed tokens, 7-day expiry, secure cookie handling
  • Google OAuth 2.0: Authlib integration, SameSite=None cross-site cookie support
  • Surgical Vector Deletion: Failed embeddings never leave orphaned vectors — SHA-256 idempotent upsert ensures clean state

DEPLOYMENT STRATEGY — RENDER FREE TIER (512MB RAM)

StrategyImplementationProblem Solved
Docker multi-stageNode Alpine → Python slimSingle container — frontend served by FastAPI, no separate web server
Zero local ML modelsAll inference via Jina + OpenRouter APIEliminates 1–2GB RAM from local model loading
MRL 256d embeddingsJina v3 truncated at API level75% less Pinecone storage — fits free tier limits
gc.collect() per fileExplicit GC during sync loopPrevents memory accumulation across 3,854 chunk processing
UptimeRobot pingEvery 5 min to /health endpointPrevents Render free tier cold starts (30s+ spin-up avoided)
Supabase keep-alive/health pings fp_file_registryPrevents Supabase 7-day database sleep
Batch size = 55 chunks/Jina call + 200ms pauseRespects Jina rate limits — stable for large document syncs

INFRASTRUCTURE METRICS — ALL 3 SYSTEMS COMBINED

ProjectChunksLive VectorsVector DBStrategy
Agentic Financial Parser3,8543,854Pinecone ServerlessLlamaParse + Markdown
Citizen Legal RAG10,8338,958Qdrant CloudParent-Child (PyMuPDF)
Citizen Safety AI721641Pinecone ServerlessLocal Processing
GRAND TOTAL15,40813,453Multi-DBProduction Scale

RESUME BULLETS — PICK 5-6

  • Engineered an 8-node Agentic RAG pipeline using LangGraph StateGraph with conditional edge routing (Classifier → Retriever → Generator → Hallucination Guard → PostProcess) for real-time Indian financial document analysis
  • Implemented Jina v3 MRL embeddings — 1024→256d via API-level MRL for 75% storage reduction with ~95% retrieval accuracy preserved; task-specific LoRA adapters (retrieval.query vs retrieval.passage) for asymmetric semantic search
  • Designed dual-strategy chunking: MarkdownHeaderTextSplitter for LlamaParse tables (preserving table integrity) + Parent-Child Recursive Retrieval (2000→400 chars) — precise retrieval without secondary DB lookups
  • Built 3-layer hallucination prevention: (1) less than 40% confidence aggressive fallback, (2) strict context-only system prompt with MISSING INFO RULE, (3) post-generation LLM-as-judge grounding verification
  • Implemented 7-layer upload security (10MB OOM-safe streaming, %PDF- magic byte check, PDF bomb guard, SHA-256 dedup, IP rate limiting) with PII masking (Aadhaar/PAN/Mobile) before LLM inference and pybreaker circuit breakers
  • Architected tiered document parsing: LlamaParse Agentic Plus (infographics), Agentic (complex tables), Cost Effective (structured text) + PyMuPDF free fallback — with SHA-256 sync engine preventing redundant re-indexing
  • Developed real-time SSE word-by-word streaming with pipeline node visualization — ChatGPT-like progressive delivery with source citations and confidence scores
  • Deployed entire production stack on free-tier (Render 512MB, Pinecone Serverless, MongoDB Atlas, Supabase, OpenRouter, Jina, Langfuse) — zero GPU cost, all inference API-based

Tech: LangGraph · Jina v3 MRL · Pinecone Serverless · OpenRouter (Qwen 72B) · LlamaParse · FastAPI · React · MongoDB · Supabase · Langfuse · pybreaker · Upstash Redis