Skip to main content

Agentic Financial Parser

Architecture & Technical Documentation

8-Node Agentic RAG · LangGraph StateGraph · Zero-Cost Infrastructure

Engineered by Ambuj Kumar Tripathi — GenAI Solution Architect · RAG Systems Specialist

Built an enterprise-grade 8-node Agentic RAG system using LangGraph StateGraph with Jina v3 MRL embeddings (75% storage savings), dual-strategy chunking, 3-layer hallucination prevention, and 7-layer upload security — deployed on zero-cost infrastructure serving Indian Budget, Tax Laws, and Constitution documents.


TECH STACK

LayerTechnologyPurpose
FrontendReact 18 + Vite 6SPA with SSE word-by-word streaming
BackendFastAPI + UvicornAsync API server
OrchestrationLangGraph StateGraph8-node agentic pipeline with conditional routing
LLMQwen 2.5 72B via OpenRouterGeneration + Classification
EmbeddingsJina v3 MRL — 256d via API1024→256d truncation | 75% storage savings | 0 RAM
VectorDBPinecone ServerlessDual namespace: core brain + temp user uploads
Document DBMongoDB Atlas (Motor async)Chat history with TTL indexes — GDPR compliant
File RegistrySupabase PostgreSQL + StorageSHA-256 sync engine, PDF storage
PDF ParsingLlamaParse (3 tiers) + PyMuPDFCloud for complex tables, local for plain text
SecurityPII Shield + JWT + SlowAPIAadhaar/PAN/Phone masking before LLM
ObservabilityLangfuseLLM trace + generation spans + latency metrics
Resiliencepybreaker Circuit Breakers3 failures → 30s cooldown → auto-recovery
CacheUpstash RedisSHA-256 response cache (1hr TTL) + rate limiting
DeploymentDocker multi-stage on Render Free512MB RAM — zero GPU cost — all inference via API

HIGH-LEVEL SYSTEM ARCHITECTURE

The complete system spans four layers — from the React frontend through the FastAPI gateway into the LangGraph orchestration engine, backed by five managed cloud services. Every component runs on free-tier infrastructure with zero GPU cost.

LayerComponentsDeployment
ClientReact 18 SPA, SSE streaming, Google OAuthServed by FastAPI (same container)
API GatewayFastAPI + Uvicorn, JWT auth, PII Shield, rate limitingRender Free Tier (512MB RAM)
OrchestrationLangGraph 8-node StateGraph, pybreaker circuit breakersSame container — async event loop
InfrastructurePinecone, MongoDB Atlas, Supabase, Upstash Redis, LangfuseAll managed cloud — zero self-hosted

8-NODE PIPELINE — SYSTEM ARCHITECTURE

Every user query enters through the PII Shield and flows through the LangGraph StateGraph. The Classifier node makes a single LLM call to route the query into one of four paths — abusive, greeting, vague, or rag. The RAG path proceeds through Retriever → Generator → Hallucination Guard before results are saved to MongoDB via PostProcess and streamed back to the UI via SSE.

NODE-BY-NODE BREAKDOWN

NodeNameFunctionCost
1Classifier1 LLM call classifies query (abusive/greeting/vague/rag) AND determines search scope (system_only/user_only/hybrid) — avoids unnecessary Pinecone namespace queries1 LLM call
2RejectBlocks abusive queries with a firm professional response. Regex-based — zero LLM cost.0 LLM calls
3GreetHandles greetings WITHOUT hitting VectorDB — saves Pinecone + Jina credits. 1 lightweight LLM call.1 LLM call
4CrossQuestionerIf query is too vague, asks ONE clarifying question (max 2 rounds) before triggering retrieval. Prevents wasting retrieval credits on ambiguous queries.1 LLM call
5RetrieverDual Pinecone search: Core Brain (top_k=20, is_temporary=False) + Temp Uploads (top_k=5, is_temporary=True, uploaded_by=user). Deduplicates by parent_id — child vectors searched, parent text fed to LLM.Jina embed + 2 Pinecone calls
6Generatorless than 40% confidence → Fallback immediately (0 LLM call wasted). PII Shield masks before LLM. Strict context-only system prompt: mandatory citations, Pro Tips, Follow-ups. Language mirroring: English/Hinglish/Hindi.1 LLM call (conditional)
7Hallucination GuardSeparate LLM-as-judge call verifies the generated answer is grounded in retrieved context. If hallucinated → Fallback node. If grounded → PostProcess.1 LLM call
8PostProcessSaves user message + AI response to MongoDB Atlas. Logs query_type, confidence score, latency via Langfuse distributed tracing. Returns response to frontend via SSE stream.DB write + Langfuse log
FallbackActivated by: (a) no chunks in Pinecone, (b) confidence less than 40%, (c) hallucination detected. Returns clear 'I don't have this information' — never fabricates answers.0 LLM calls

DATA INGESTION PIPELINE

Every PDF is fingerprinted with SHA-256 and compared against the Supabase registry. Only new or changed documents are parsed and embedded — unchanged files are skipped, saving Jina API credits. Parsing strategy is chosen based on document complexity.

3,854 chunks | 3,854 live vectors | Pinecone Serverless

LLAMAPARSE 3-TIER PARSING SYSTEM

TierCredits/PageUsed ForExample Documents
Agentic Plus45 cr/pageInfographics, charts, visual tablesBudget at a Glance, summary charts
Agentic10 cr/pageComplex financial tables, math, memorandaFinance Bill, Tax Memorandum
Cost Effective3 cr/pageStructured legal text, clean formattingConstitution, RBI KYC Guidelines
PyMuPDF (free)0 — localPlain prose text, temp user uploadsPF Scheme, user-uploaded PDFs

DUAL CHUNKING STRATEGY

Markdown tables break mid-row with character-based splitting, losing column headers. The solution: dual strategy based on parsing source. MarkdownHeaderTextSplitter for LlamaParse output keeps tables intact. Parent-Child for PyMuPDF gives precise retrieval with rich context — parent text stored in child metadata, no second DB lookup needed.

StrategyChunk SizeApplied WhenKey Advantage
MarkdownHeaderTextSplitterHeader-basedLlamaParse output — complex tables, structured docsTable rows intact — column headers preserved in every chunk
Parent-Child RecursiveParent=2000
Child=400 chars
PyMuPDF output — plain prose, temp uploadsChild vectors searched for precision; parent text fed to LLM — no second DB lookup

JINA V3 MRL EMBEDDINGS — MATRYOSHKA REPRESENTATION LEARNING

Matryoshka Representation Learning (MRL) trains embeddings such that the first N dimensions of a 1024-dimensional vector are already a high-quality 256-dimensional representation — enabling truncation at inference time without retraining. Task-specific LoRA adapters enable asymmetric search: different encoders for query vs document passage.

PropertyValue
Full embedding dimension1024d (standard Jina v3 output)
Truncated dimension used256d — truncated at API level via MRL
Pinecone storage savings75% reduction vs 1024d
Retrieval accuracy retained~95% preserved after truncation
Query adapterretrieval.query (asymmetric — for user queries)
Document adapterretrieval.passage (asymmetric — for indexed chunks)
RAM usage0 bytes local — all inference via Jina AI API

3-LAYER HALLUCINATION PREVENTION SYSTEM

LayerMechanismTriggerAction
1Confidence GateCosine similarity of top Pinecone match less than 40%Immediate fallback — 0 LLM call. Prevents low-quality generation on weak context.
2Strict System PromptEvery generation callContext-only answers. MISSING INFO RULE: if context doesn't contain the answer, say so explicitly. Never invent section numbers, figures, or statistics.
3LLM-as-Judge GuardPost-generation, before serving to userSeparate LLM call verifies answer is grounded in retrieved context. Hallucinated → fallback. Grounded → serve to user.

SECURITY ARCHITECTURE — 7-LAYER UPLOAD SECURITY

LayerCheckHTTP Status on Fail
1File extension must be .pdf415 Unsupported Media Type
2Magic byte verification — first 4 bytes must be %PDF-415 Unsupported Media Type
3Chunked streaming read (1MB/chunk) — reject if total > 10MB — OOM attack protection413 Payload Too Large
4PDF bomb protection — PyMuPDF page count must be max 500 pages400 Bad Request
5IP-based rate limiting — 5 uploads/hour via SlowAPI429 Too Many Requests
6Per-user file quota — max 3 active temp files per session429 Too Many Requests
7SHA-256 content dedup — identical file already indexed → skip (0 API tokens consumed)200 Skipped

Additional Security Features

  • PII Shield (regex): Masks Aadhaar, PAN, Mobile, Email, IFSC, Bank Account BEFORE LLM — no personal data ever reaches OpenRouter
  • pybreaker Circuit Breakers: All external API calls wrapped — 3 failures → 30s cooldown → auto-recovery. Prevents cascading failures.
  • JWT Auth: HS256 signed tokens, 7-day expiry, secure cookie handling
  • Google OAuth 2.0: Authlib integration, SameSite=None cross-site cookie support
  • Surgical Vector Deletion: Failed embeddings never leave orphaned vectors — SHA-256 idempotent upsert ensures clean state

DEPLOYMENT STRATEGY — RENDER FREE TIER (512MB RAM)

StrategyImplementationProblem Solved
Docker multi-stageNode Alpine → Python slimSingle container — frontend served by FastAPI, no separate web server
Zero local ML modelsAll inference via Jina + OpenRouter APIEliminates 1–2GB RAM from local model loading
MRL 256d embeddingsJina v3 truncated at API level75% less Pinecone storage — fits free tier limits
gc.collect() per fileExplicit GC during sync loopPrevents memory accumulation across 3,854 chunk processing
UptimeRobot pingEvery 5 min to /health endpointPrevents Render free tier cold starts (30s+ spin-up avoided)
Supabase keep-alive/health pings fp_file_registryPrevents Supabase 7-day database sleep
Batch size = 55 chunks/Jina call + 200ms pauseRespects Jina rate limits — stable for large document syncs

INFRASTRUCTURE METRICS — ALL 3 SYSTEMS COMBINED

ProjectChunksLive VectorsVector DBStrategy
Agentic Financial Parser18,75818,758Pinecone ServerlessLlamaParse + Markdown
Citizen Legal RAG12,7709,594Qdrant CloudParent-Child (PyMuPDF)
Citizen Safety AIMergedMergedPinecone ServerlessCore Logic Absorbed
GRAND TOTAL31,52828,352Multi-DBProduction Scale

RESUME BULLETS — PICK 5-6

  • Engineered an 8-node Agentic RAG pipeline using LangGraph StateGraph with conditional edge routing (Classifier → Retriever → Generator → Hallucination Guard → PostProcess) for real-time Indian financial document analysis
  • Implemented Jina v3 MRL embeddings — 1024→256d via API-level MRL for 75% storage reduction with ~95% retrieval accuracy preserved; task-specific LoRA adapters (retrieval.query vs retrieval.passage) for asymmetric semantic search
  • Designed dual-strategy chunking: MarkdownHeaderTextSplitter for LlamaParse tables (preserving table integrity) + Parent-Child Recursive Retrieval (2000→400 chars) — precise retrieval without secondary DB lookups
  • Built 3-layer hallucination prevention: (1) less than 40% confidence aggressive fallback, (2) strict context-only system prompt with MISSING INFO RULE, (3) post-generation LLM-as-judge grounding verification
  • Implemented 7-layer upload security (10MB OOM-safe streaming, %PDF- magic byte check, PDF bomb guard, SHA-256 dedup, IP rate limiting) with PII masking (Aadhaar/PAN/Mobile) before LLM inference and pybreaker circuit breakers
  • Architected tiered document parsing: LlamaParse Agentic Plus (infographics), Agentic (complex tables), Cost Effective (structured text) + PyMuPDF free fallback — with SHA-256 sync engine preventing redundant re-indexing
  • Developed real-time SSE word-by-word streaming with pipeline node visualization — ChatGPT-like progressive delivery with source citations and confidence scores
  • Deployed entire production stack on free-tier (Render 512MB, Pinecone Serverless, MongoDB Atlas, Supabase, OpenRouter, Jina, Langfuse) — zero GPU cost, all inference API-based

Tech: LangGraph · Jina v3 MRL · Pinecone Serverless · OpenRouter (Qwen 72B) · LlamaParse · FastAPI · React · MongoDB · Supabase · Langfuse · pybreaker · Upstash Redis


PROJECT STRUCTURE

agentic-rag-financial-parser/
├── app/ # FastAPI Backend
│ ├── main.py # App entry point + lifespan manager
│ ├── api/ # API routes
│ │ ├── auth.py # Google OAuth + JWT (HS256, 7-day expiry)
│ │ ├── oauth.py # Authlib config
│ │ └── upload.py # 7-layer secure upload endpoint
│ ├── core/ # Core utilities
│ │ ├── config.py # Pydantic Settings (env validation)
│ │ ├── constants.py # Chunking hyperparameters + LlamaParse tiers
│ │ └── pii_shield.py # Regex PII masking (Aadhaar/PAN/Phone/Email)
│ ├── db/ # Database clients
│ │ ├── mongodb.py # Async Motor (chat_history, chunks, temp_uploads)
│ │ ├── pinecone_client.py # Vector DB (256d, cosine, AWS us-east-1)
│ │ └── supabase_client.py # PostgreSQL fp_file_registry
│ └── rag/ # RAG pipeline
│ ├── graph.py # 8-node LangGraph StateGraph (CORE)
│ ├── routes.py # Chat + Stream + History + Admin endpoints
│ ├── chunker.py # Dual chunking (MarkdownHeader + Parent-Child)
│ ├── embedder.py # Jina v3 MRL (1024->256d)
│ ├── parser.py # LlamaParse (3-tier) + PyMuPDF
│ └── sync.py # SHA-256 incremental sync engine

├── frontend/ # React 18 + Vite 6
│ └── src/
│ ├── App.jsx # Routes + React.lazy loading
│ ├── main.jsx # React root
│ ├── api/client.js # Axios instance with JWT interceptor
│ ├── context/AuthContext.jsx # JWT state management
│ └── pages/
│ ├── Landing.jsx # Public landing page
│ ├── Dashboard.jsx # Chat UI + SSE streaming
│ ├── Admin.jsx # Admin panel (sync/upload/chunks)
│ └── AuthCallback.jsx # OAuth callback handler

├── data/
│ ├── raw_pdf/ # Core knowledge base PDFs
│ └── temp_uploads/ # User temp uploads (24hr TTL)

├── requirements.txt # Python dependencies
├── Dockerfile # Multi-stage build (Node Alpine -> Python slim)
└── DOCUMENTATION.md # Developer setup guide

API ENDPOINTS REFERENCE

Authentication

MethodEndpointDescription
GET/auth/loginRedirect to Google OAuth
GET/auth/callbackHandle OAuth callback
POST/auth/logoutLogout + cleanup temp vectors
POST/api/auth/dev-loginDev-only bypass
GET/api/meCurrent user info + is_admin flag

Chat

MethodEndpointDescription
POST/api/chatStandard RAG query
POST/api/chat/streamSSE word-by-word streaming
GET/api/chat/historyGet conversation history
DELETE/api/chat/historyClear history

Upload

MethodEndpointDescription
POST/api/upload/tempUpload temp PDF (7-layer security)

Admin (ADMIN_EMAIL only)

MethodEndpointDescription
POST/api/admin/syncRun SHA-256 sync engine
DELETE/api/admin/documents/{name}Delete document + vectors
GET/api/admin/chunksView pending HITL chunks
POST/api/admin/chunks/approveApprove/reject/edit chunks
GET/api/admin/statsDashboard statistics

512MB RAM OPTIMIZATION (RENDER FREE TIER)

Running a production RAG system on 512MB RAM requires aggressive memory optimization. Every component is tuned to prevent OOM crashes.

Critical Configurations

ParameterValueWhy
Embedding batch size5 chunks/callPrevents memory spike during bulk indexing
Chat history limit6 messagesKeeps LLM context window small
Pinecone top_k12-15 (core), 5 (temp)Fewer results = less memory for post-processing
LLM max_tokens2048Caps response size to prevent large allocations
MongoDB poolmaxPoolSize=5, minPoolSize=1Fewer idle connections consuming memory
MRL dimension256d (not 1024d)75% less memory during embedding operations
gc.collect()After every PDF syncExplicit garbage collection prevents accumulation
uvicorn app.main:app --host 0.0.0.0 --port ${PORT:-8000} --workers 1 --limit-concurrency 10 --timeout-keep-alive 30

--limit-concurrency 10 prevents memory exhaustion from simultaneous requests. --workers 1 ensures single process (no 512MB x N multiplication).

What NOT to Add (Memory Hogs)

FeatureWhy Avoid
Celery/Background WorkersEach worker = separate process = 512MB x N
In-memory cachingRedis is already external — no need for local cache
Large embedding batchesKeep EMBED_BATCH_SIZE = 5
PDF preview generationLoads entire PDF in memory
WebSocket connectionsSSE is already memory-efficient
Local ML models (spaCy, transformers)500MB+ RAM — use API-based inference instead

Crash Prevention Checklist

  • --limit-concurrency 10 in uvicorn CMD
  • MongoDB pool size = 5 (not default 10)
  • LLM max_tokens = 2048 (not 4096)
  • Pinecone top_k = 12 (not 20)
  • gc.collect() after PDF processing
  • Rate limiting active (10 req/min via Redis)
  • Circuit breaker active (3 fail -> 30s reset)
  • Chunked upload streaming (1MB chunks, not full file in memory)

Render-Specific Tips

  1. Free Tier Cold Starts — Instance sleeps after inactivity. First request takes 30s+. UptimeRobot HEAD ping every 5 min prevents this.
  2. Supabase Sleep — Free Supabase databases sleep after 7 days of inactivity. /health endpoint pings fp_file_registry to keep alive.
  3. Log Limits — Render free tier has log retention limits. Keep logging minimal in production.
  4. No Background Tasks — They run in same process, consume same 512MB. Use async/await instead.