Architecture & Technical Documentation
8-Node Agentic RAG · LangGraph StateGraph · Zero-Cost Infrastructure
Engineered by Ambuj Kumar Tripathi — GenAI Solution Architect · RAG Systems Specialist
Built an enterprise-grade 8-node Agentic RAG system using LangGraph StateGraph with Jina v3 MRL embeddings (75% storage savings), dual-strategy chunking, 3-layer hallucination prevention, and 7-layer upload security — deployed on zero-cost infrastructure serving Indian Budget, Tax Laws, and Constitution documents.
TECH STACK
| Layer | Technology | Purpose |
|---|
| Frontend | React 18 + Vite 6 | SPA with SSE word-by-word streaming |
| Backend | FastAPI + Uvicorn | Async API server |
| Orchestration | LangGraph StateGraph | 8-node agentic pipeline with conditional routing |
| LLM | Qwen 2.5 72B via OpenRouter | Generation + Classification |
| Embeddings | Jina v3 MRL — 256d via API | 1024→256d truncation | 75% storage savings | 0 RAM |
| VectorDB | Pinecone Serverless | Dual namespace: core brain + temp user uploads |
| Document DB | MongoDB Atlas (Motor async) | Chat history with TTL indexes — GDPR compliant |
| File Registry | Supabase PostgreSQL + Storage | SHA-256 sync engine, PDF storage |
| PDF Parsing | LlamaParse (3 tiers) + PyMuPDF | Cloud for complex tables, local for plain text |
| Security | PII Shield + JWT + SlowAPI | Aadhaar/PAN/Phone masking before LLM |
| Observability | Langfuse | LLM trace + generation spans + latency metrics |
| Resilience | pybreaker Circuit Breakers | 3 failures → 30s cooldown → auto-recovery |
| Cache | Upstash Redis | SHA-256 response cache (1hr TTL) + rate limiting |
| Deployment | Docker multi-stage on Render Free | 512MB RAM — zero GPU cost — all inference via API |
HIGH-LEVEL SYSTEM ARCHITECTURE
The complete system spans four layers — from the React frontend through the FastAPI gateway into the LangGraph orchestration engine, backed by five managed cloud services. Every component runs on free-tier infrastructure with zero GPU cost.
| Layer | Components | Deployment |
|---|
| Client | React 18 SPA, SSE streaming, Google OAuth | Served by FastAPI (same container) |
| API Gateway | FastAPI + Uvicorn, JWT auth, PII Shield, rate limiting | Render Free Tier (512MB RAM) |
| Orchestration | LangGraph 8-node StateGraph, pybreaker circuit breakers | Same container — async event loop |
| Infrastructure | Pinecone, MongoDB Atlas, Supabase, Upstash Redis, Langfuse | All managed cloud — zero self-hosted |
8-NODE PIPELINE — SYSTEM ARCHITECTURE
Every user query enters through the PII Shield and flows through the LangGraph StateGraph. The Classifier node makes a single LLM call to route the query into one of four paths — abusive, greeting, vague, or rag. The RAG path proceeds through Retriever → Generator → Hallucination Guard before results are saved to MongoDB via PostProcess and streamed back to the UI via SSE.
NODE-BY-NODE BREAKDOWN
| Node | Name | Function | Cost |
|---|
| 1 | Classifier | 1 LLM call classifies query (abusive/greeting/vague/rag) AND determines search scope (system_only/user_only/hybrid) — avoids unnecessary Pinecone namespace queries | 1 LLM call |
| 2 | Reject | Blocks abusive queries with a firm professional response. Regex-based — zero LLM cost. | 0 LLM calls |
| 3 | Greet | Handles greetings WITHOUT hitting VectorDB — saves Pinecone + Jina credits. 1 lightweight LLM call. | 1 LLM call |
| 4 | CrossQuestioner | If query is too vague, asks ONE clarifying question (max 2 rounds) before triggering retrieval. Prevents wasting retrieval credits on ambiguous queries. | 1 LLM call |
| 5 | Retriever | Dual Pinecone search: Core Brain (top_k=20, is_temporary=False) + Temp Uploads (top_k=5, is_temporary=True, uploaded_by=user). Deduplicates by parent_id — child vectors searched, parent text fed to LLM. | Jina embed + 2 Pinecone calls |
| 6 | Generator | less than 40% confidence → Fallback immediately (0 LLM call wasted). PII Shield masks before LLM. Strict context-only system prompt: mandatory citations, Pro Tips, Follow-ups. Language mirroring: English/Hinglish/Hindi. | 1 LLM call (conditional) |
| 7 | Hallucination Guard | Separate LLM-as-judge call verifies the generated answer is grounded in retrieved context. If hallucinated → Fallback node. If grounded → PostProcess. | 1 LLM call |
| 8 | PostProcess | Saves user message + AI response to MongoDB Atlas. Logs query_type, confidence score, latency via Langfuse distributed tracing. Returns response to frontend via SSE stream. | DB write + Langfuse log |
| — | Fallback | Activated by: (a) no chunks in Pinecone, (b) confidence less than 40%, (c) hallucination detected. Returns clear 'I don't have this information' — never fabricates answers. | 0 LLM calls |
DATA INGESTION PIPELINE
Every PDF is fingerprinted with SHA-256 and compared against the Supabase registry. Only new or changed documents are parsed and embedded — unchanged files are skipped, saving Jina API credits. Parsing strategy is chosen based on document complexity.
3,854 chunks | 3,854 live vectors | Pinecone Serverless
LLAMAPARSE 3-TIER PARSING SYSTEM
| Tier | Credits/Page | Used For | Example Documents |
|---|
| Agentic Plus | 45 cr/page | Infographics, charts, visual tables | Budget at a Glance, summary charts |
| Agentic | 10 cr/page | Complex financial tables, math, memoranda | Finance Bill, Tax Memorandum |
| Cost Effective | 3 cr/page | Structured legal text, clean formatting | Constitution, RBI KYC Guidelines |
| PyMuPDF (free) | 0 — local | Plain prose text, temp user uploads | PF Scheme, user-uploaded PDFs |
DUAL CHUNKING STRATEGY
Markdown tables break mid-row with character-based splitting, losing column headers. The solution: dual strategy based on parsing source. MarkdownHeaderTextSplitter for LlamaParse output keeps tables intact. Parent-Child for PyMuPDF gives precise retrieval with rich context — parent text stored in child metadata, no second DB lookup needed.
| Strategy | Chunk Size | Applied When | Key Advantage |
|---|
| MarkdownHeaderTextSplitter | Header-based | LlamaParse output — complex tables, structured docs | Table rows intact — column headers preserved in every chunk |
| Parent-Child Recursive | Parent=2000 Child=400 chars | PyMuPDF output — plain prose, temp uploads | Child vectors searched for precision; parent text fed to LLM — no second DB lookup |
JINA V3 MRL EMBEDDINGS — MATRYOSHKA REPRESENTATION LEARNING
Matryoshka Representation Learning (MRL) trains embeddings such that the first N dimensions of a 1024-dimensional vector are already a high-quality 256-dimensional representation — enabling truncation at inference time without retraining. Task-specific LoRA adapters enable asymmetric search: different encoders for query vs document passage.
| Property | Value |
|---|
| Full embedding dimension | 1024d (standard Jina v3 output) |
| Truncated dimension used | 256d — truncated at API level via MRL |
| Pinecone storage savings | 75% reduction vs 1024d |
| Retrieval accuracy retained | ~95% preserved after truncation |
| Query adapter | retrieval.query (asymmetric — for user queries) |
| Document adapter | retrieval.passage (asymmetric — for indexed chunks) |
| RAM usage | 0 bytes local — all inference via Jina AI API |
3-LAYER HALLUCINATION PREVENTION SYSTEM
| Layer | Mechanism | Trigger | Action |
|---|
| 1 | Confidence Gate | Cosine similarity of top Pinecone match less than 40% | Immediate fallback — 0 LLM call. Prevents low-quality generation on weak context. |
| 2 | Strict System Prompt | Every generation call | Context-only answers. MISSING INFO RULE: if context doesn't contain the answer, say so explicitly. Never invent section numbers, figures, or statistics. |
| 3 | LLM-as-Judge Guard | Post-generation, before serving to user | Separate LLM call verifies answer is grounded in retrieved context. Hallucinated → fallback. Grounded → serve to user. |
SECURITY ARCHITECTURE — 7-LAYER UPLOAD SECURITY
| Layer | Check | HTTP Status on Fail |
|---|
| 1 | File extension must be .pdf | 415 Unsupported Media Type |
| 2 | Magic byte verification — first 4 bytes must be %PDF- | 415 Unsupported Media Type |
| 3 | Chunked streaming read (1MB/chunk) — reject if total > 10MB — OOM attack protection | 413 Payload Too Large |
| 4 | PDF bomb protection — PyMuPDF page count must be max 500 pages | 400 Bad Request |
| 5 | IP-based rate limiting — 5 uploads/hour via SlowAPI | 429 Too Many Requests |
| 6 | Per-user file quota — max 3 active temp files per session | 429 Too Many Requests |
| 7 | SHA-256 content dedup — identical file already indexed → skip (0 API tokens consumed) | 200 Skipped |
Additional Security Features
- PII Shield (regex): Masks Aadhaar, PAN, Mobile, Email, IFSC, Bank Account BEFORE LLM — no personal data ever reaches OpenRouter
- pybreaker Circuit Breakers: All external API calls wrapped — 3 failures → 30s cooldown → auto-recovery. Prevents cascading failures.
- JWT Auth: HS256 signed tokens, 7-day expiry, secure cookie handling
- Google OAuth 2.0: Authlib integration, SameSite=None cross-site cookie support
- Surgical Vector Deletion: Failed embeddings never leave orphaned vectors — SHA-256 idempotent upsert ensures clean state
DEPLOYMENT STRATEGY — RENDER FREE TIER (512MB RAM)
| Strategy | Implementation | Problem Solved |
|---|
| Docker multi-stage | Node Alpine → Python slim | Single container — frontend served by FastAPI, no separate web server |
| Zero local ML models | All inference via Jina + OpenRouter API | Eliminates 1–2GB RAM from local model loading |
| MRL 256d embeddings | Jina v3 truncated at API level | 75% less Pinecone storage — fits free tier limits |
| gc.collect() per file | Explicit GC during sync loop | Prevents memory accumulation across 3,854 chunk processing |
| UptimeRobot ping | Every 5 min to /health endpoint | Prevents Render free tier cold starts (30s+ spin-up avoided) |
| Supabase keep-alive | /health pings fp_file_registry | Prevents Supabase 7-day database sleep |
| Batch size = 5 | 5 chunks/Jina call + 200ms pause | Respects Jina rate limits — stable for large document syncs |
INFRASTRUCTURE METRICS — ALL 3 SYSTEMS COMBINED
| Project | Chunks | Live Vectors | Vector DB | Strategy |
|---|
| Agentic Financial Parser | 18,758 | 18,758 | Pinecone Serverless | LlamaParse + Markdown |
| Citizen Legal RAG | 12,770 | 9,594 | Qdrant Cloud | Parent-Child (PyMuPDF) |
| Citizen Safety AI | Merged | Merged | Pinecone Serverless | Core Logic Absorbed |
| GRAND TOTAL | 31,528 | 28,352 | Multi-DB | Production Scale |
RESUME BULLETS — PICK 5-6
- Engineered an 8-node Agentic RAG pipeline using LangGraph StateGraph with conditional edge routing (Classifier → Retriever → Generator → Hallucination Guard → PostProcess) for real-time Indian financial document analysis
- Implemented Jina v3 MRL embeddings — 1024→256d via API-level MRL for 75% storage reduction with ~95% retrieval accuracy preserved; task-specific LoRA adapters (retrieval.query vs retrieval.passage) for asymmetric semantic search
- Designed dual-strategy chunking: MarkdownHeaderTextSplitter for LlamaParse tables (preserving table integrity) + Parent-Child Recursive Retrieval (2000→400 chars) — precise retrieval without secondary DB lookups
- Built 3-layer hallucination prevention: (1) less than 40% confidence aggressive fallback, (2) strict context-only system prompt with MISSING INFO RULE, (3) post-generation LLM-as-judge grounding verification
- Implemented 7-layer upload security (10MB OOM-safe streaming, %PDF- magic byte check, PDF bomb guard, SHA-256 dedup, IP rate limiting) with PII masking (Aadhaar/PAN/Mobile) before LLM inference and pybreaker circuit breakers
- Architected tiered document parsing: LlamaParse Agentic Plus (infographics), Agentic (complex tables), Cost Effective (structured text) + PyMuPDF free fallback — with SHA-256 sync engine preventing redundant re-indexing
- Developed real-time SSE word-by-word streaming with pipeline node visualization — ChatGPT-like progressive delivery with source citations and confidence scores
- Deployed entire production stack on free-tier (Render 512MB, Pinecone Serverless, MongoDB Atlas, Supabase, OpenRouter, Jina, Langfuse) — zero GPU cost, all inference API-based
Tech: LangGraph · Jina v3 MRL · Pinecone Serverless · OpenRouter (Qwen 72B) · LlamaParse · FastAPI · React · MongoDB · Supabase · Langfuse · pybreaker · Upstash Redis
PROJECT STRUCTURE
agentic-rag-financial-parser/
├── app/ # FastAPI Backend
│ ├── main.py # App entry point + lifespan manager
│ ├── api/ # API routes
│ │ ├── auth.py # Google OAuth + JWT (HS256, 7-day expiry)
│ │ ├── oauth.py # Authlib config
│ │ └── upload.py # 7-layer secure upload endpoint
│ ├── core/ # Core utilities
│ │ ├── config.py # Pydantic Settings (env validation)
│ │ ├── constants.py # Chunking hyperparameters + LlamaParse tiers
│ │ └── pii_shield.py # Regex PII masking (Aadhaar/PAN/Phone/Email)
│ ├── db/ # Database clients
│ │ ├── mongodb.py # Async Motor (chat_history, chunks, temp_uploads)
│ │ ├── pinecone_client.py # Vector DB (256d, cosine, AWS us-east-1)
│ │ └── supabase_client.py # PostgreSQL fp_file_registry
│ └── rag/ # RAG pipeline
│ ├── graph.py # 8-node LangGraph StateGraph (CORE)
│ ├── routes.py # Chat + Stream + History + Admin endpoints
│ ├── chunker.py # Dual chunking (MarkdownHeader + Parent-Child)
│ ├── embedder.py # Jina v3 MRL (1024->256d)
│ ├── parser.py # LlamaParse (3-tier) + PyMuPDF
│ └── sync.py # SHA-256 incremental sync engine
│
├── frontend/ # React 18 + Vite 6
│ └── src/
│ ├── App.jsx # Routes + React.lazy loading
│ ├── main.jsx # React root
│ ├── api/client.js # Axios instance with JWT interceptor
│ ├── context/AuthContext.jsx # JWT state management
│ └── pages/
│ ├── Landing.jsx # Public landing page
│ ├── Dashboard.jsx # Chat UI + SSE streaming
│ ├── Admin.jsx # Admin panel (sync/upload/chunks)
│ └── AuthCallback.jsx # OAuth callback handler
│
├── data/
│ ├── raw_pdf/ # Core knowledge base PDFs
│ └── temp_uploads/ # User temp uploads (24hr TTL)
│
├── requirements.txt # Python dependencies
├── Dockerfile # Multi-stage build (Node Alpine -> Python slim)
└── DOCUMENTATION.md # Developer setup guide
API ENDPOINTS REFERENCE
Authentication
| Method | Endpoint | Description |
|---|
| GET | /auth/login | Redirect to Google OAuth |
| GET | /auth/callback | Handle OAuth callback |
| POST | /auth/logout | Logout + cleanup temp vectors |
| POST | /api/auth/dev-login | Dev-only bypass |
| GET | /api/me | Current user info + is_admin flag |
Chat
| Method | Endpoint | Description |
|---|
| POST | /api/chat | Standard RAG query |
| POST | /api/chat/stream | SSE word-by-word streaming |
| GET | /api/chat/history | Get conversation history |
| DELETE | /api/chat/history | Clear history |
Upload
| Method | Endpoint | Description |
|---|
| POST | /api/upload/temp | Upload temp PDF (7-layer security) |
Admin (ADMIN_EMAIL only)
| Method | Endpoint | Description |
|---|
| POST | /api/admin/sync | Run SHA-256 sync engine |
| DELETE | /api/admin/documents/{name} | Delete document + vectors |
| GET | /api/admin/chunks | View pending HITL chunks |
| POST | /api/admin/chunks/approve | Approve/reject/edit chunks |
| GET | /api/admin/stats | Dashboard statistics |
512MB RAM OPTIMIZATION (RENDER FREE TIER)
Running a production RAG system on 512MB RAM requires aggressive memory optimization. Every component is tuned to prevent OOM crashes.
Critical Configurations
| Parameter | Value | Why |
|---|
| Embedding batch size | 5 chunks/call | Prevents memory spike during bulk indexing |
| Chat history limit | 6 messages | Keeps LLM context window small |
| Pinecone top_k | 12-15 (core), 5 (temp) | Fewer results = less memory for post-processing |
| LLM max_tokens | 2048 | Caps response size to prevent large allocations |
| MongoDB pool | maxPoolSize=5, minPoolSize=1 | Fewer idle connections consuming memory |
| MRL dimension | 256d (not 1024d) | 75% less memory during embedding operations |
| gc.collect() | After every PDF sync | Explicit garbage collection prevents accumulation |
Recommended Uvicorn Command
uvicorn app.main:app --host 0.0.0.0 --port ${PORT:-8000} --workers 1 --limit-concurrency 10 --timeout-keep-alive 30
--limit-concurrency 10 prevents memory exhaustion from simultaneous requests. --workers 1 ensures single process (no 512MB x N multiplication).
What NOT to Add (Memory Hogs)
| Feature | Why Avoid |
|---|
| Celery/Background Workers | Each worker = separate process = 512MB x N |
| In-memory caching | Redis is already external — no need for local cache |
| Large embedding batches | Keep EMBED_BATCH_SIZE = 5 |
| PDF preview generation | Loads entire PDF in memory |
| WebSocket connections | SSE is already memory-efficient |
| Local ML models (spaCy, transformers) | 500MB+ RAM — use API-based inference instead |
Crash Prevention Checklist
--limit-concurrency 10 in uvicorn CMD
- MongoDB pool size = 5 (not default 10)
- LLM max_tokens = 2048 (not 4096)
- Pinecone top_k = 12 (not 20)
- gc.collect() after PDF processing
- Rate limiting active (10 req/min via Redis)
- Circuit breaker active (3 fail -> 30s reset)
- Chunked upload streaming (1MB chunks, not full file in memory)
Render-Specific Tips
- Free Tier Cold Starts — Instance sleeps after inactivity. First request takes 30s+. UptimeRobot HEAD ping every 5 min prevents this.
- Supabase Sleep — Free Supabase databases sleep after 7 days of inactivity. /health endpoint pings fp_file_registry to keep alive.
- Log Limits — Render free tier has log retention limits. Keep logging minimal in production.
- No Background Tasks — They run in same process, consume same 512MB. Use async/await instead.