Skip to main content

🛡️ Upload Security & Optimization Architecture

Indian Legal AI Expert — Production-Grade File Upload Pipeline
A multi-layered defense system protecting API quota, server resources, and vector database integrity.


Architecture Overview


Security Layers — Deep Dive

Layer 1: File Extension Validation

if not file.filename.endswith(".pdf"):
raise HTTPException(status_code=415, detail="Only PDF files are allowed")
PropertyDetail
What it blocks.exe, .js, .html, .zip, renamed binaries
HTTP Status415 Unsupported Media Type
Why not sufficient aloneAttackers can rename malware.exemalware.pdf

[!NOTE] This is terminal line defense — fast and cheap, but easily bypassed. That's why Layer 2 exists.


Layer 2: Magic Bytes Verification

file_header = await file.read(4)
if file_header != b"%PDF":
raise HTTPException(status_code=415, detail="Invalid file: not a real PDF")
await file.seek(0) # Reset for full read
PropertyDetail
What it checksFirst 4 bytes of file content must be %PDF (hex: 25 50 44 46)
What it blocksRenamed executables, disguised malware, polyglot files
HTTP Status415 Unsupported Media Type
CostNear-zero — reads only 4 bytes before deciding

[!IMPORTANT] This is a content-level check, not a filename check. A file renamed from .exe to .pdf will fail here because its binary header won't start with %PDF.


Layer 3: Chunked Streaming with OOM Protection

MAX_UPLOAD_SIZE = 10 * 1024 * 1024  # 10MB hard limit

chunks = []
total_size = 0
while True:
chunk = await file.read(1024 * 1024) # Read 1MB at a time
if not chunk:
break
total_size += len(chunk)
if total_size > MAX_UPLOAD_SIZE:
raise HTTPException(status_code=413, detail="max 10MB allowed")
chunks.append(chunk)
PropertyDetail
Max file size10MB
Read strategy1MB chunks (never loads entire file into memory at once)
What it preventsOut-of-Memory (OOM) attacks, server crash from oversized uploads
HTTP Status413 Payload Too Large

[!TIP] Unlike await file.read() which loads the entire file into RAM, chunked reading caps memory usage at ~1MB regardless of the upload size. A 10GB malicious upload will be rejected after reading just 10MB.


Layer 4: PDF Bomb Protection

MAX_PAGES = 500
docs = PyMuPDFLoader(tmp_path).load()

if len(docs) > MAX_PAGES:
raise ValueError(f"PDF too large: {len(docs)} pages (max {MAX_PAGES})")
PropertyDetail
Max pages500
What it blocksPDF bombs — small files that decompress into thousands of pages
Why 10MB check isn't enoughA 5MB PDF can contain 50,000+ pages via compression tricks

[!CAUTION] A PDF bomb is a file that passes the 10MB size check but explodes in memory when parsed. Example: A 2MB PDF with 100,000 blank pages would consume gigabytes of RAM during text extraction and chunking. This layer prevents that.


Layer 5: IP-Based Rate Limiting

@router.post("/upload")
@limiter.limit("5/hour")
async def upload_temp_file(request: Request, ...):
PropertyDetail
Limit5 uploads per hour per IP address
LibrarySlowAPI (built on top of limits)
HTTP Status429 Too Many Requests
Key functionIP-based (get_remote_address)

[!NOTE] This is a network-level throttle. Even if a user creates multiple accounts, they're still rate-limited by their IP address. It prevents automated abuse scripts from rapidly consuming API quota.


Layer 6: Per-User File Quota

MAX_TEMP_FILES = 3
existing_files = get_user_temp_files(user_email)

if len(existing_files) >= MAX_TEMP_FILES:
# Allow re-upload of same filename (hash-check/replace)
existing_names = [f["file_name"] for f in existing_files]
if file.filename not in existing_names:
raise HTTPException(
status_code=429,
detail=f"Upload limit reached: max {MAX_TEMP_FILES} files allowed."
)
PropertyDetail
Max files per user3 unique files simultaneously
ScopePer authenticated user (Google OAuth email)
Smart re-uploadSame filename allowed (triggers Layer 7 hash check)
Max exposure3 files × 10MB = 30MB per user session

[!IMPORTANT] This layer has intelligent re-upload detection. If a user already has 3 files but uploads one with a filename that already exists, it's allowed through — because Layer 7 will either skip it (same content) or replace it (different content). No quota growth.


Layer 7: SHA-256 Content-Aware Deduplication

from app.utils.helpers import calculate_sha256

file_hash = calculate_sha256(file_bytes) # SHA-256 of full content

# Query Qdrant: does this exact file already exist for this user?
existing = client.scroll(
scroll_filter=Filter(must=[
FieldCondition(key="source_file", match=MatchValue(value=file_name)),
FieldCondition(key="uploaded_by", match=MatchValue(value=user_email)),
FieldCondition(key="is_temporary", match=MatchValue(value=True)),
]),
limit=1,
with_payload=["file_hash"]
)

if points and points[0].payload.get("file_hash") == file_hash:
return {"skipped": True, "reason": "Identical file already indexed"}
PropertyDetail
AlgorithmSHA-256 (256-bit cryptographic hash)
Comparison scopePer user + per filename + hash match
What it savesEmbedding API tokens (Jina AI) — the most expensive operation
API tokens saved~4,500+ tokens per duplicate skip

[!TIP] This optimization reduced duplicate upload API costs from ~4,500 tokens per upload to 0 tokens. Verified with Jina AI dashboard — multiple duplicate uploads showed zero token increase.


Temporary Vector Lifecycle


Attack Vector Analysis

AttackProtectionResult
Upload .exe as PDFLayer 1 + Layer 2⛔ Blocked (extension + magic bytes)
100MB file uploadLayer 3⛔ Rejected at 10MB (chunked read)
PDF bomb (2MB → 50K pages)Layer 4⛔ Blocked at 500 pages
Rapid-fire upload spamLayer 5⛔ Rate limited (5/hour/IP)
Upload 100 different filesLayer 6⛔ Capped at 3 files/user
Same file to waste tokensLayer 7⚡ Hash skip (0 tokens)
New account creation spamLayer 5 + Layer 6⛔ Same IP rate limited + 3 files max per account

Measured Performance Impact

MetricBefore OptimizationAfter Optimization
Duplicate file upload cost~4,500 tokens/upload0 tokens
Jina API calls for duplicate1 full embed_documents0 API calls
Response time for duplicate3-8 seconds<100ms
Qdrant writes for duplicateFull upsert0 writes

Tech Stack

ComponentTechnology
BackendFastAPI (Python)
Vector DatabaseQdrant Cloud
Embedding ModelJina AI v2 (768-dim)
AuthenticationGoogle OAuth 2.0 + JWT
Rate LimitingSlowAPI
Hash AlgorithmSHA-256 (hashlib)
PDF ProcessingPyMuPDF
Chunking StrategyParent-Child (LangChain)

What is Deduplication?

Deduplication (de-duplication) is the process of identifying and eliminating duplicate copies of data. Instead of processing the same content multiple times and wasting compute resources, the system detects that it has already seen this exact data before and skips the expensive re-processing.

In the context of AI/ML systems, deduplication is critical because embedding generation (converting text into vector representations) is the most expensive operation — it consumes API tokens, network bandwidth, and compute time. Every unnecessary embedding call is wasted money.

How SHA-256 Hashing Enables Deduplication

SHA-256 (Secure Hash Algorithm, 256-bit) converts any input — regardless of size — into a fixed 64-character hexadecimal string called a hash or fingerprint.

Input: "The Constitution of India..."  (50 pages, 200KB)
SHA-256 Hash: "a7f3b2c91d4e8f06..." (always 64 characters)

Key properties:

  • Deterministic — same input always produces the same hash
  • Unique — even 1 character change produces a completely different hash
  • Fast — computing a hash takes microseconds, embedding takes seconds
  • One-way — you cannot reverse a hash back to the original file

How Deduplication Works in This Project

Step-by-step flow in our system:

StepWhat HappensCost
1User uploads report.pdf (first time)
2calculate_sha256(file_bytes) → generates hash a7f3b2c9...~0.001ms
3Query Qdrant: "Does this user have a file with hash a7f3b2c9...?"~50ms
4No match found → Full index pipeline runs~4,500 tokens
5Vectors stored in Qdrant with file_hash: "a7f3b2c9..." in payload
6User uploads report.pdf again (same file)
7calculate_sha256(file_bytes) → same hash a7f3b2c9...~0.001ms
8Query Qdrant: "Does this user have a file with hash a7f3b2c9...?"~50ms
9✅ Match found! → Return {"skipped": true} immediately0 tokens

[!IMPORTANT] The deduplication check costs essentially nothing (~50ms Qdrant query) but saves ~4,500+ Jina AI tokens per duplicate upload. Over multiple users and sessions, this translates to massive API cost savings while maintaining zero impact on user experience.

The Before vs. After: Fixing the "Silent Failure" Bug

The Problem (Before): Previously, duplicate files uploaded in the same session were still consuming Jina API quota. This happened because of a silent failure in the deduplication logic.

In index_temp_file(), the code attempted to query Qdrant to check if the hash already existed:

# BROKEN CODE (Missing Imports)
existing = client.scroll(
scroll_filter=Filter(...) # Filter, FieldCondition, MatchValue were NOT imported!
)

Because the qdrant_client.models were not imported, this code threw a NameError. However, because it was wrapped in a broad try...except block, the error was silently caught, and the code assumed the file was "new". It then re-chunked and re-embedded the entire duplicate file, wasting tokens every single time.

The Fix (Now): By simply adding the missing import statement:

from qdrant_client.models import PointStruct, Filter, FieldCondition, MatchValue

The Qdrant query now executes successfully. It correctly finds the existing file_hash for that user, skips the embedding process entirely, and returns 0 tokens used.

Session Logout Behavior: This deduplication optimization operates strictly within a user's session. When the user logs out, the /auth/logout endpoint calls cleanup_user_temp_vectors(user_email), which deletes all vectors in Qdrant where is_temporary=True and uploaded_by=user_email.

Therefore:

  1. During session: Duplicate uploads = 0 tokens (Hash Skip).
  2. On Logout: All session vectors are completely wiped from Qdrant.
  3. Next Session: Uploading the same file will be treated as "new" (because the previous vectors were deleted upon the last logout), and it will be indexed normally.