Skip to main content

Chapter 05 — Security Framework

5.1 Seven-Layer Secure Upload Pipeline

Every file upload passes through a sequential 7-layer security pipeline. Each layer is independently enforced — a file must clear ALL seven layers to proceed to parsing and indexing. Any layer failure terminates the upload with an explicit error response.

L1 — Streaming Size Limit (10MB)

Enforces a strict 10MB ceiling at the streaming boundary. Rejects files before they are fully buffered into RAM — critical for protecting the 512MB RAM constraint from memory exhaustion or denial-of-service uploads.

L2 — Zero-Byte Rejection

Immediately rejects files with 0 bytes before any processing begins. Prevents null-pointer exceptions, empty-document indexing errors, and edge cases in downstream parsing libraries.

L3 — .PDF Extension Whitelist

Strict file extension check — only files with the exact .pdf extension are accepted. Rejects .exe, .docx, .html, or any other extension regardless of claimed content type.

L4 — MIME Type & Magic Byte Verification

Deep file inspection beyond extension checking. Reads the actual file signature (magic bytes at byte offset 0) and verifies it matches the PDF specification (%PDF-). Prevents disguised malicious files that carry a .pdf extension but contain executable or other harmful content.

L5 — Page Count Limit (500 Pages / PDF Bomb Protection)

Rejects PDFs exceeding 500 pages. Protects against ZIP-bomb-style PDF exploits (e.g., deeply nested or recursively referenced PDFs) that could exhaust memory or CPU during parsing — particularly critical in a 512MB RAM environment.

L6 — JWT Authentication

Verifies the user's JSON Web Token before any file is accepted into the pipeline. Unauthenticated or token-expired upload requests are rejected at the authentication boundary — no file data is processed.

Agentic Financial Parser v2.0 — Technical DocumentationPage 9

L7 — SHA-256 Deduplication

Computes a SHA-256 cryptographic hash of each uploaded file and checks it against the file registry. If an identical file already exists, re-indexing is completely skipped — protecting LlamaParse API quota, Pinecone write operations, and compute resources.

5.2 PII Shield — Regex-Based Masking

A lightweight, zero-latency PII masking layer intercepts all user query text before it reaches the LLM. The masking runs as pure regex — no NLP model, no latency, no external API call. Sensitive Indian personal identifiers are replaced with clearly labelled placeholder tokens.

PII CategoryPattern DetectedMasked As
Aadhaar Number12-digit unique ID pattern[AADHAAR_MASKED]
PAN CardABCDE1234F 10-char format[PAN_MASKED]
Bank AccountIndian bank account patterns[BANK_ACCOUNT_MASKED]
Email AddressStandard email regex[EMAIL_MASKED]
Phone NumberIndian mobile & landline[PHONE_MASKED]

5.3 Circuit Breakers — pybreaker

All three external API calls are wrapped with pybreaker circuit breakers. If any API fails 3 consecutive times, the circuit opens for 30 seconds — instantly serving graceful degradation responses instead of allowing cascading failures to propagate through the LangGraph pipeline.

Protected APIFailure ThresholdCircuit OpenFallback
Jina v3 Embedding API3 consecutive failures30 secondsGraceful degradation response
OpenRouter (LLM)3 consecutive failures30 secondsGraceful degradation response
Tavily Web Search API3 consecutive failures30 secondsGraceful degradation response

Agentic Financial Parser v2.0 — Technical DocumentationPage 10