Agentic Financial Parser
Architecture & Technical Documentation
8-Node Agentic RAG · LangGraph StateGraph · Zero-Cost Infrastructure
Engineered by Ambuj Kumar Tripathi — GenAI Solution Architect · RAG Systems Specialist
Built an enterprise-grade 8-node Agentic RAG system using LangGraph StateGraph with Jina v3 MRL embeddings (75% storage savings), dual-strategy chunking, 3-layer hallucination prevention, and 7-layer upload security — deployed on zero-cost infrastructure serving Indian Budget, Tax Laws, and Constitution documents.
TECH STACK
| Layer | Technology | Purpose |
|---|---|---|
| Frontend | React 18 + Vite 6 | SPA with SSE word-by-word streaming |
| Backend | FastAPI + Uvicorn | Async API server |
| Orchestration | LangGraph StateGraph | 8-node agentic pipeline with conditional routing |
| LLM | Qwen 2.5 72B via OpenRouter | Generation + Classification |
| Embeddings | Jina v3 MRL — 256d via API | 1024→256d truncation | 75% storage savings | 0 RAM |
| VectorDB | Pinecone Serverless | Dual namespace: core brain + temp user uploads |
| Document DB | MongoDB Atlas (Motor async) | Chat history with TTL indexes — GDPR compliant |
| File Registry | Supabase PostgreSQL + Storage | SHA-256 sync engine, PDF storage |
| PDF Parsing | LlamaParse (3 tiers) + PyMuPDF | Cloud for complex tables, local for plain text |
| Security | PII Shield + JWT + SlowAPI | Aadhaar/PAN/Phone masking before LLM |
| Observability | Langfuse | LLM trace + generation spans + latency metrics |
| Resilience | pybreaker Circuit Breakers | 3 failures → 30s cooldown → auto-recovery |
| Cache | Upstash Redis | SHA-256 response cache (1hr TTL) + rate limiting |
| Deployment | Docker multi-stage on Render Free | 512MB RAM — zero GPU cost — all inference via API |
8-NODE PIPELINE — SYSTEM ARCHITECTURE
Every user query enters through the PII Shield and flows through the LangGraph StateGraph. The Classifier node makes a single LLM call to route the query into one of four paths — abusive, greeting, vague, or rag. The RAG path proceeds through Retriever → Generator → Hallucination Guard before results are saved to MongoDB via PostProcess and streamed back to the UI via SSE.
NODE-BY-NODE BREAKDOWN
| Node | Name | Function | Cost |
|---|---|---|---|
| 1 | Classifier | 1 LLM call classifies query (abusive/greeting/vague/rag) AND determines search scope (system_only/user_only/hybrid) — avoids unnecessary Pinecone namespace queries | 1 LLM call |
| 2 | Reject | Blocks abusive queries with a firm professional response. Regex-based — zero LLM cost. | 0 LLM calls |
| 3 | Greet | Handles greetings WITHOUT hitting VectorDB — saves Pinecone + Jina credits. 1 lightweight LLM call. | 1 LLM call |
| 4 | CrossQuestioner | If query is too vague, asks ONE clarifying question (max 2 rounds) before triggering retrieval. Prevents wasting retrieval credits on ambiguous queries. | 1 LLM call |
| 5 | Retriever | Dual Pinecone search: Core Brain (top_k=20, is_temporary=False) + Temp Uploads (top_k=5, is_temporary=True, uploaded_by=user). Deduplicates by parent_id — child vectors searched, parent text fed to LLM. | Jina embed + 2 Pinecone calls |
| 6 | Generator | less than 40% confidence → Fallback immediately (0 LLM call wasted). PII Shield masks before LLM. Strict context-only system prompt: mandatory citations, Pro Tips, Follow-ups. Language mirroring: English/Hinglish/Hindi. | 1 LLM call (conditional) |
| 7 | Hallucination Guard | Separate LLM-as-judge call verifies the generated answer is grounded in retrieved context. If hallucinated → Fallback node. If grounded → PostProcess. | 1 LLM call |
| 8 | PostProcess | Saves user message + AI response to MongoDB Atlas. Logs query_type, confidence score, latency via Langfuse distributed tracing. Returns response to frontend via SSE stream. | DB write + Langfuse log |
| — | Fallback | Activated by: (a) no chunks in Pinecone, (b) confidence less than 40%, (c) hallucination detected. Returns clear 'I don't have this information' — never fabricates answers. | 0 LLM calls |
DATA INGESTION PIPELINE
Every PDF is fingerprinted with SHA-256 and compared against the Supabase registry. Only new or changed documents are parsed and embedded — unchanged files are skipped, saving Jina API credits. Parsing strategy is chosen based on document complexity.
3,854 chunks | 3,854 live vectors | Pinecone Serverless
LLAMAPARSE 3-TIER PARSING SYSTEM
| Tier | Credits/Page | Used For | Example Documents |
|---|---|---|---|
| Agentic Plus | 45 cr/page | Infographics, charts, visual tables | Budget at a Glance, summary charts |
| Agentic | 10 cr/page | Complex financial tables, math, memoranda | Finance Bill, Tax Memorandum |
| Cost Effective | 3 cr/page | Structured legal text, clean formatting | Constitution, RBI KYC Guidelines |
| PyMuPDF (free) | 0 — local | Plain prose text, temp user uploads | PF Scheme, user-uploaded PDFs |
DUAL CHUNKING STRATEGY
Markdown tables break mid-row with character-based splitting, losing column headers. The solution: dual strategy based on parsing source. MarkdownHeaderTextSplitter for LlamaParse output keeps tables intact. Parent-Child for PyMuPDF gives precise retrieval with rich context — parent text stored in child metadata, no second DB lookup needed.
| Strategy | Chunk Size | Applied When | Key Advantage |
|---|---|---|---|
| MarkdownHeaderTextSplitter | Header-based | LlamaParse output — complex tables, structured docs | Table rows intact — column headers preserved in every chunk |
| Parent-Child Recursive | Parent=2000 Child=400 chars | PyMuPDF output — plain prose, temp uploads | Child vectors searched for precision; parent text fed to LLM — no second DB lookup |
JINA V3 MRL EMBEDDINGS — MATRYOSHKA REPRESENTATION LEARNING
Matryoshka Representation Learning (MRL) trains embeddings such that the first N dimensions of a 1024-dimensional vector are already a high-quality 256-dimensional representation — enabling truncation at inference time without retraining. Task-specific LoRA adapters enable asymmetric search: different encoders for query vs document passage.
| Property | Value |
|---|---|
| Full embedding dimension | 1024d (standard Jina v3 output) |
| Truncated dimension used | 256d — truncated at API level via MRL |
| Pinecone storage savings | 75% reduction vs 1024d |
| Retrieval accuracy retained | ~95% preserved after truncation |
| Query adapter | retrieval.query (asymmetric — for user queries) |
| Document adapter | retrieval.passage (asymmetric — for indexed chunks) |
| RAM usage | 0 bytes local — all inference via Jina AI API |
3-LAYER HALLUCINATION PREVENTION SYSTEM
| Layer | Mechanism | Trigger | Action |
|---|---|---|---|
| 1 | Confidence Gate | Cosine similarity of top Pinecone match less than 40% | Immediate fallback — 0 LLM call. Prevents low-quality generation on weak context. |
| 2 | Strict System Prompt | Every generation call | Context-only answers. MISSING INFO RULE: if context doesn't contain the answer, say so explicitly. Never invent section numbers, figures, or statistics. |
| 3 | LLM-as-Judge Guard | Post-generation, before serving to user | Separate LLM call verifies answer is grounded in retrieved context. Hallucinated → fallback. Grounded → serve to user. |
SECURITY ARCHITECTURE — 7-LAYER UPLOAD SECURITY
| Layer | Check | HTTP Status on Fail |
|---|---|---|
| 1 | File extension must be .pdf | 415 Unsupported Media Type |
| 2 | Magic byte verification — first 4 bytes must be %PDF- | 415 Unsupported Media Type |
| 3 | Chunked streaming read (1MB/chunk) — reject if total > 10MB — OOM attack protection | 413 Payload Too Large |
| 4 | PDF bomb protection — PyMuPDF page count must be max 500 pages | 400 Bad Request |
| 5 | IP-based rate limiting — 5 uploads/hour via SlowAPI | 429 Too Many Requests |
| 6 | Per-user file quota — max 3 active temp files per session | 429 Too Many Requests |
| 7 | SHA-256 content dedup — identical file already indexed → skip (0 API tokens consumed) | 200 Skipped |
Additional Security Features
- PII Shield (regex): Masks Aadhaar, PAN, Mobile, Email, IFSC, Bank Account BEFORE LLM — no personal data ever reaches OpenRouter
- pybreaker Circuit Breakers: All external API calls wrapped — 3 failures → 30s cooldown → auto-recovery. Prevents cascading failures.
- JWT Auth: HS256 signed tokens, 7-day expiry, secure cookie handling
- Google OAuth 2.0: Authlib integration, SameSite=None cross-site cookie support
- Surgical Vector Deletion: Failed embeddings never leave orphaned vectors — SHA-256 idempotent upsert ensures clean state
DEPLOYMENT STRATEGY — RENDER FREE TIER (512MB RAM)
| Strategy | Implementation | Problem Solved |
|---|---|---|
| Docker multi-stage | Node Alpine → Python slim | Single container — frontend served by FastAPI, no separate web server |
| Zero local ML models | All inference via Jina + OpenRouter API | Eliminates 1–2GB RAM from local model loading |
| MRL 256d embeddings | Jina v3 truncated at API level | 75% less Pinecone storage — fits free tier limits |
| gc.collect() per file | Explicit GC during sync loop | Prevents memory accumulation across 3,854 chunk processing |
| UptimeRobot ping | Every 5 min to /health endpoint | Prevents Render free tier cold starts (30s+ spin-up avoided) |
| Supabase keep-alive | /health pings fp_file_registry | Prevents Supabase 7-day database sleep |
| Batch size = 5 | 5 chunks/Jina call + 200ms pause | Respects Jina rate limits — stable for large document syncs |
INFRASTRUCTURE METRICS — ALL 3 SYSTEMS COMBINED
| Project | Chunks | Live Vectors | Vector DB | Strategy |
|---|---|---|---|---|
| Agentic Financial Parser | 3,854 | 3,854 | Pinecone Serverless | LlamaParse + Markdown |
| Citizen Legal RAG | 10,833 | 8,958 | Qdrant Cloud | Parent-Child (PyMuPDF) |
| Citizen Safety AI | 721 | 641 | Pinecone Serverless | Local Processing |
| GRAND TOTAL | 15,408 | 13,453 | Multi-DB | Production Scale |
RESUME BULLETS — PICK 5-6
- Engineered an 8-node Agentic RAG pipeline using LangGraph StateGraph with conditional edge routing (Classifier → Retriever → Generator → Hallucination Guard → PostProcess) for real-time Indian financial document analysis
- Implemented Jina v3 MRL embeddings — 1024→256d via API-level MRL for 75% storage reduction with ~95% retrieval accuracy preserved; task-specific LoRA adapters (retrieval.query vs retrieval.passage) for asymmetric semantic search
- Designed dual-strategy chunking: MarkdownHeaderTextSplitter for LlamaParse tables (preserving table integrity) + Parent-Child Recursive Retrieval (2000→400 chars) — precise retrieval without secondary DB lookups
- Built 3-layer hallucination prevention: (1) less than 40% confidence aggressive fallback, (2) strict context-only system prompt with MISSING INFO RULE, (3) post-generation LLM-as-judge grounding verification
- Implemented 7-layer upload security (10MB OOM-safe streaming, %PDF- magic byte check, PDF bomb guard, SHA-256 dedup, IP rate limiting) with PII masking (Aadhaar/PAN/Mobile) before LLM inference and pybreaker circuit breakers
- Architected tiered document parsing: LlamaParse Agentic Plus (infographics), Agentic (complex tables), Cost Effective (structured text) + PyMuPDF free fallback — with SHA-256 sync engine preventing redundant re-indexing
- Developed real-time SSE word-by-word streaming with pipeline node visualization — ChatGPT-like progressive delivery with source citations and confidence scores
- Deployed entire production stack on free-tier (Render 512MB, Pinecone Serverless, MongoDB Atlas, Supabase, OpenRouter, Jina, Langfuse) — zero GPU cost, all inference API-based
Tech: LangGraph · Jina v3 MRL · Pinecone Serverless · OpenRouter (Qwen 72B) · LlamaParse · FastAPI · React · MongoDB · Supabase · Langfuse · pybreaker · Upstash Redis