🛡️ Upload Security & Optimization Architecture
Indian Legal AI Expert — Production-Grade File Upload Pipeline
A multi-layered defense system protecting API quota, server resources, and vector database integrity.
Architecture Overview
Security Layers — Deep Dive
Layer 1: File Extension Validation
if not file.filename.endswith(".pdf"):
raise HTTPException(status_code=415, detail="Only PDF files are allowed")
| Property | Detail |
|---|---|
| What it blocks | .exe, .js, .html, .zip, renamed binaries |
| HTTP Status | 415 Unsupported Media Type |
| Why not sufficient alone | Attackers can rename malware.exe → malware.pdf |
[!NOTE] This is terminal line defense — fast and cheap, but easily bypassed. That's why Layer 2 exists.
Layer 2: Magic Bytes Verification
file_header = await file.read(4)
if file_header != b"%PDF":
raise HTTPException(status_code=415, detail="Invalid file: not a real PDF")
await file.seek(0) # Reset for full read
| Property | Detail |
|---|---|
| What it checks | First 4 bytes of file content must be %PDF (hex: 25 50 44 46) |
| What it blocks | Renamed executables, disguised malware, polyglot files |
| HTTP Status | 415 Unsupported Media Type |
| Cost | Near-zero — reads only 4 bytes before deciding |
[!IMPORTANT] This is a content-level check, not a filename check. A file renamed from
.exeto
Layer 3: Chunked Streaming with OOM Protection
MAX_UPLOAD_SIZE = 10 * 1024 * 1024 # 10MB hard limit
chunks = []
total_size = 0
while True:
chunk = await file.read(1024 * 1024) # Read 1MB at a time
if not chunk:
break
total_size += len(chunk)
if total_size > MAX_UPLOAD_SIZE:
raise HTTPException(status_code=413, detail="max 10MB allowed")
chunks.append(chunk)
| Property | Detail |
|---|---|
| Max file size | 10MB |
| Read strategy | 1MB chunks (never loads entire file into memory at once) |
| What it prevents | Out-of-Memory (OOM) attacks, server crash from oversized uploads |
| HTTP Status | 413 Payload Too Large |
[!TIP] Unlike
await file.read()which loads the entire file into RAM, chunked reading caps memory usage at ~1MB regardless of the upload size. A 10GB malicious upload will be rejected after reading just 10MB.
Layer 4: PDF Bomb Protection
MAX_PAGES = 500
docs = PyMuPDFLoader(tmp_path).load()
if len(docs) > MAX_PAGES:
raise ValueError(f"PDF too large: {len(docs)} pages (max {MAX_PAGES})")
| Property | Detail |
|---|---|
| Max pages | 500 |
| What it blocks | PDF bombs — small files that decompress into thousands of pages |
| Why 10MB check isn't enough | A 5MB PDF can contain 50,000+ pages via compression tricks |
[!CAUTION] A PDF bomb is a file that passes the 10MB size check but explodes in memory when parsed. Example: A 2MB PDF with 100,000 blank pages would consume gigabytes of RAM during text extraction and chunking. This layer prevents that.
Layer 5: IP-Based Rate Limiting
@router.post("/upload")
@limiter.limit("5/hour")
async def upload_temp_file(request: Request, ...):
| Property | Detail |
|---|---|
| Limit | 5 uploads per hour per IP address |
| Library | SlowAPI (built on top of limits) |
| HTTP Status | 429 Too Many Requests |
| Key function | IP-based (get_remote_address) |
[!NOTE] This is a network-level throttle. Even if a user creates multiple accounts, they're still rate-limited by their IP address. It prevents automated abuse scripts from rapidly consuming API quota.
Layer 6: Per-User File Quota
MAX_TEMP_FILES = 3
existing_files = get_user_temp_files(user_email)
if len(existing_files) >= MAX_TEMP_FILES:
# Allow re-upload of same filename (hash-check/replace)
existing_names = [f["file_name"] for f in existing_files]
if file.filename not in existing_names:
raise HTTPException(
status_code=429,
detail=f"Upload limit reached: max {MAX_TEMP_FILES} files allowed."
)
| Property | Detail |
|---|---|
| Max files per user | 3 unique files simultaneously |
| Scope | Per authenticated user (Google OAuth email) |
| Smart re-upload | Same filename allowed (triggers Layer 7 hash check) |
| Max exposure | 3 files × 10MB = 30MB per user session |
[!IMPORTANT] This layer has intelligent re-upload detection. If a user already has 3 files but uploads one with a filename that already exists, it's allowed through — because Layer 7 will either skip it (same content) or replace it (different content). No quota growth.
Layer 7: SHA-256 Content-Aware Deduplication
from app.utils.helpers import calculate_sha256
file_hash = calculate_sha256(file_bytes) # SHA-256 of full content
# Query Qdrant: does this exact file already exist for this user?
existing = client.scroll(
scroll_filter=Filter(must=[
FieldCondition(key="source_file", match=MatchValue(value=file_name)),
FieldCondition(key="uploaded_by", match=MatchValue(value=user_email)),
FieldCondition(key="is_temporary", match=MatchValue(value=True)),
]),
limit=1,
with_payload=["file_hash"]
)
if points and points[0].payload.get("file_hash") == file_hash:
return {"skipped": True, "reason": "Identical file already indexed"}
| Property | Detail |
|---|---|
| Algorithm | SHA-256 (256-bit cryptographic hash) |
| Comparison scope | Per user + per filename + hash match |
| What it saves | Embedding API tokens (Jina AI) — the most expensive operation |
| API tokens saved | ~4,500+ tokens per duplicate skip |
[!TIP] This optimization reduced duplicate upload API costs from ~4,500 tokens per upload to 0 tokens. Verified with Jina AI dashboard — multiple duplicate uploads showed zero token increase.
Temporary Vector Lifecycle
Attack Vector Analysis
| Attack | Protection | Result |
|---|---|---|
Upload .exe as PDF | Layer 1 + Layer 2 | ⛔ Blocked (extension + magic bytes) |
| 100MB file upload | Layer 3 | ⛔ Rejected at 10MB (chunked read) |
| PDF bomb (2MB → 50K pages) | Layer 4 | ⛔ Blocked at 500 pages |
| Rapid-fire upload spam | Layer 5 | ⛔ Rate limited (5/hour/IP) |
| Upload 100 different files | Layer 6 | ⛔ Capped at 3 files/user |
| Same file to waste tokens | Layer 7 | ⚡ Hash skip (0 tokens) |
| New account creation spam | Layer 5 + Layer 6 | ⛔ Same IP rate limited + 3 files max per account |
Measured Performance Impact
| Metric | Before Optimization | After Optimization |
|---|---|---|
| Duplicate file upload cost | ~4,500 tokens/upload | 0 tokens |
| Jina API calls for duplicate | 1 full embed_documents | 0 API calls |
| Response time for duplicate | 3-8 seconds | <100ms |
| Qdrant writes for duplicate | Full upsert | 0 writes |
Tech Stack
| Component | Technology |
|---|---|
| Backend | FastAPI (Python) |
| Vector Database | Qdrant Cloud |
| Embedding Model | Jina AI v2 (768-dim) |
| Authentication | Google OAuth 2.0 + JWT |
| Rate Limiting | SlowAPI |
| Hash Algorithm | SHA-256 (hashlib) |
| PDF Processing | PyMuPDF |
| Chunking Strategy | Parent-Child (LangChain) |
What is Deduplication?
Deduplication (de-duplication) is the process of identifying and eliminating duplicate copies of data. Instead of processing the same content multiple times and wasting compute resources, the system detects that it has already seen this exact data before and skips the expensive re-processing.
In the context of AI/ML systems, deduplication is critical because embedding generation (converting text into vector representations) is the most expensive operation — it consumes API tokens, network bandwidth, and compute time. Every unnecessary embedding call is wasted money.
How SHA-256 Hashing Enables Deduplication
SHA-256 (Secure Hash Algorithm, 256-bit) converts any input — regardless of size — into a fixed 64-character hexadecimal string called a hash or fingerprint.
Input: "The Constitution of India..." (50 pages, 200KB)
SHA-256 Hash: "a7f3b2c91d4e8f06..." (always 64 characters)
Key properties:
- Deterministic — same input always produces the same hash
- Unique — even 1 character change produces a completely different hash
- Fast — computing a hash takes microseconds, embedding takes seconds
- One-way — you cannot reverse a hash back to the original file
How Deduplication Works in This Project
Step-by-step flow in our system:
| Step | What Happens | Cost |
|---|---|---|
| 1 | User uploads report.pdf (first time) | — |
| 2 | calculate_sha256(file_bytes) → generates hash a7f3b2c9... | ~0.001ms |
| 3 | Query Qdrant: "Does this user have a file with hash a7f3b2c9...?" | ~50ms |
| 4 | No match found → Full index pipeline runs | ~4,500 tokens |
| 5 | Vectors stored in Qdrant with file_hash: "a7f3b2c9..." in payload | — |
| 6 | User uploads report.pdf again (same file) | — |
| 7 | calculate_sha256(file_bytes) → same hash a7f3b2c9... | ~0.001ms |
| 8 | Query Qdrant: "Does this user have a file with hash a7f3b2c9...?" | ~50ms |
| 9 | ✅ Match found! → Return {"skipped": true} immediately | 0 tokens |
[!IMPORTANT] The deduplication check costs essentially nothing (~50ms Qdrant query) but saves ~4,500+ Jina AI tokens per duplicate upload. Over multiple users and sessions, this translates to massive API cost savings while maintaining zero impact on user experience.
The Before vs. After: Fixing the "Silent Failure" Bug
The Problem (Before): Previously, duplicate files uploaded in the same session were still consuming Jina API quota. This happened because of a silent failure in the deduplication logic.
In index_temp_file(), the code attempted to query Qdrant to check if the hash already existed:
# BROKEN CODE (Missing Imports)
existing = client.scroll(
scroll_filter=Filter(...) # Filter, FieldCondition, MatchValue were NOT imported!
)
Because the qdrant_client.models were not imported, this code threw a NameError. However, because it was wrapped in a broad try...except block, the error was silently caught, and the code assumed the file was "new". It then re-chunked and re-embedded the entire duplicate file, wasting tokens every single time.
The Fix (Now): By simply adding the missing import statement:
from qdrant_client.models import PointStruct, Filter, FieldCondition, MatchValue
The Qdrant query now executes successfully. It correctly finds the existing file_hash for that user, skips the embedding process entirely, and returns 0 tokens used.
Session Logout Behavior:
This deduplication optimization operates strictly within a user's session. When the user logs out, the /auth/logout endpoint calls cleanup_user_temp_vectors(user_email), which deletes all vectors in Qdrant where is_temporary=True and uploaded_by=user_email.
Therefore:
- During session: Duplicate uploads = 0 tokens (Hash Skip).
- On Logout: All session vectors are completely wiped from Qdrant.
- Next Session: Uploading the same file will be treated as "new" (because the previous vectors were deleted upon the last logout), and it will be indexed normally.