Chapter 05 — Security Framework

5.1 Seven-Layer Secure Upload Pipeline

Every file upload passes through a sequential 7-layer security pipeline. Each layer is independently enforced — a file must clear ALL seven layers to proceed to parsing and indexing. Any layer failure terminates the upload with an explicit error response.

L1 — Streaming Size Limit (10MB)

Enforces a strict 10MB ceiling at the streaming boundary. Rejects files before they are fully buffered into RAM — critical for protecting the 512MB RAM constraint from memory exhaustion or denial-of-service uploads.

L2 — Zero-Byte Rejection

Immediately rejects files with 0 bytes before any processing begins. Prevents null-pointer exceptions, empty-document indexing errors, and edge cases in downstream parsing libraries.

L3 — .PDF Extension Whitelist

Strict file extension check — only files with the exact .pdf extension are accepted. Rejects .exe, .docx, .html, or any other extension regardless of claimed content type.

L4 — MIME Type & Magic Byte Verification

Deep file inspection beyond extension checking. Reads the actual file signature (magic bytes at byte offset 0) and verifies it matches the PDF specification (%PDF-). Prevents disguised malicious files that carry a .pdf extension but contain executable or other harmful content.

L5 — Page Count Limit (500 Pages / PDF Bomb Protection)

Rejects PDFs exceeding 500 pages. Protects against ZIP-bomb-style PDF exploits (e.g., deeply nested or recursively referenced PDFs) that could exhaust memory or CPU during parsing — particularly critical in a 512MB RAM environment.

L6 — JWT Authentication

Verifies the user's JSON Web Token before any file is accepted into the pipeline. Unauthenticated or token-expired upload requests are rejected at the authentication boundary — no file data is processed.

Agentic Financial Parser v2.0 — Technical DocumentationPage 9

L7 — SHA-256 Deduplication

Computes a SHA-256 cryptographic hash of each uploaded file and checks it against the file registry. If an identical file already exists, re-indexing is completely skipped — protecting LlamaParse API quota, Pinecone write operations, and compute resources.

5.2 PII Shield — Regex-Based Masking

A lightweight, zero-latency PII masking layer intercepts all user query text before it reaches the LLM. The masking runs as pure regex — no NLP model, no latency, no external API call. Sensitive Indian personal identifiers are replaced with clearly labelled placeholder tokens.

PII Category	Pattern Detected	Masked As
Aadhaar Number	12-digit unique ID pattern	[AADHAAR_MASKED]
PAN Card	ABCDE1234F 10-char format	[PAN_MASKED]
Bank Account	Indian bank account patterns	[BANK_ACCOUNT_MASKED]
Email Address	Standard email regex	[EMAIL_MASKED]
Phone Number	Indian mobile & landline	[PHONE_MASKED]

5.3 Circuit Breakers — pybreaker

All three external API calls are wrapped with pybreaker circuit breakers. If any API fails 3 consecutive times, the circuit opens for 30 seconds — instantly serving graceful degradation responses instead of allowing cascading failures to propagate through the LangGraph pipeline.

Protected API	Failure Threshold	Circuit Open	Fallback
Jina v3 Embedding API	3 consecutive failures	30 seconds	Graceful degradation response
OpenRouter (LLM)	3 consecutive failures	30 seconds	Graceful degradation response
Tavily Web Search API	3 consecutive failures	30 seconds	Graceful degradation response

Agentic Financial Parser v2.0 — Technical DocumentationPage 10

5.1 Seven-Layer Secure Upload Pipeline​

L1 — Streaming Size Limit (10MB)​

L2 — Zero-Byte Rejection​

L3 — .PDF Extension Whitelist​

L4 — MIME Type & Magic Byte Verification​

L5 — Page Count Limit (500 Pages / PDF Bomb Protection)​

L6 — JWT Authentication​

L7 — SHA-256 Deduplication​

5.2 PII Shield — Regex-Based Masking​

5.3 Circuit Breakers — pybreaker​