Chapter 05 — Security Framework
5.1 Seven-Layer Secure Upload Pipeline
Every file upload passes through a sequential 7-layer security pipeline. Each layer is independently enforced — a file must clear ALL seven layers to proceed to parsing and indexing. Any layer failure terminates the upload with an explicit error response.
L1 — Streaming Size Limit (10MB)
Enforces a strict 10MB ceiling at the streaming boundary. Rejects files before they are fully buffered into RAM — critical for protecting the 512MB RAM constraint from memory exhaustion or denial-of-service uploads.
L2 — Zero-Byte Rejection
Immediately rejects files with 0 bytes before any processing begins. Prevents null-pointer exceptions, empty-document indexing errors, and edge cases in downstream parsing libraries.
L3 — .PDF Extension Whitelist
Strict file extension check — only files with the exact .pdf extension are accepted. Rejects .exe, .docx, .html, or any other extension regardless of claimed content type.
L4 — MIME Type & Magic Byte Verification
Deep file inspection beyond extension checking. Reads the actual file signature (magic bytes at byte offset 0) and verifies it matches the PDF specification (%PDF-). Prevents disguised malicious files that carry a .pdf extension but contain executable or other harmful content.
L5 — Page Count Limit (500 Pages / PDF Bomb Protection)
Rejects PDFs exceeding 500 pages. Protects against ZIP-bomb-style PDF exploits (e.g., deeply nested or recursively referenced PDFs) that could exhaust memory or CPU during parsing — particularly critical in a 512MB RAM environment.
L6 — JWT Authentication
Verifies the user's JSON Web Token before any file is accepted into the pipeline. Unauthenticated or token-expired upload requests are rejected at the authentication boundary — no file data is processed.
Agentic Financial Parser v2.0 — Technical DocumentationPage 9
L7 — SHA-256 Deduplication
Computes a SHA-256 cryptographic hash of each uploaded file and checks it against the file registry. If an identical file already exists, re-indexing is completely skipped — protecting LlamaParse API quota, Pinecone write operations, and compute resources.
5.2 PII Shield — Regex-Based Masking
A lightweight, zero-latency PII masking layer intercepts all user query text before it reaches the LLM. The masking runs as pure regex — no NLP model, no latency, no external API call. Sensitive Indian personal identifiers are replaced with clearly labelled placeholder tokens.
| PII Category | Pattern Detected | Masked As |
|---|---|---|
| Aadhaar Number | 12-digit unique ID pattern | [AADHAAR_MASKED] |
| PAN Card | ABCDE1234F 10-char format | [PAN_MASKED] |
| Bank Account | Indian bank account patterns | [BANK_ACCOUNT_MASKED] |
| Email Address | Standard email regex | [EMAIL_MASKED] |
| Phone Number | Indian mobile & landline | [PHONE_MASKED] |
5.3 Circuit Breakers — pybreaker
All three external API calls are wrapped with pybreaker circuit breakers. If any API fails 3 consecutive times, the circuit opens for 30 seconds — instantly serving graceful degradation responses instead of allowing cascading failures to propagate through the LangGraph pipeline.
| Protected API | Failure Threshold | Circuit Open | Fallback |
|---|---|---|---|
| Jina v3 Embedding API | 3 consecutive failures | 30 seconds | Graceful degradation response |
| OpenRouter (LLM) | 3 consecutive failures | 30 seconds | Graceful degradation response |
| Tavily Web Search API | 3 consecutive failures | 30 seconds | Graceful degradation response |
Agentic Financial Parser v2.0 — Technical DocumentationPage 10