🐛 Issue Resolved: Fixing Multiple Uploads Re-Indexing Bug

Date: March 21, 2026
Context: Temporary File Uploads (Session-Based Knowledge Base)

🛑 The Core Problem (Before the Fix)

When a user uploaded a temporary PDF during a session:

It was correctly chunked, embedded via Jina AI, and stored in Qdrant with is_temporary=True.
THE BUG: If the user uploaded the EXACT SAME PDF multiple times in the SAME SESSION, the system was expected to detect it, skip the embedding process, and return {"skipped": true}.
Instead, the application re-processed, re-chunked, and re-embedded the duplicate file every single time, wasting ~4,500+ Jina API tokens per upload.

Cause: The "Silent Failure"

In the verification logic (backend/app/rag/pipeline.py), the code searched Qdrant to see if the file's SHA-256 hash already existed.

# BROKEN CODE (What was happening before)
try:
    existing = client.scroll(
        scroll_filter=Filter(...) # <--- ERROR HERE: 'Filter' was not imported!
    )
except Exception as e:
    # Error was silently caught! The code assumed the file was "new"
    # and proceeded to waste tokens on a full re-index.
    pass 

Because Filter, FieldCondition, and MatchValue from qdrant_client.models were missing, a NameError was thrown. The broad try...except block caught this silently. The application never checked the hash and blindly re-indexed the file.

✅ The Solution (After the Fix)

1. Correcting the Hash Check Logic

By adding a single missing import statement, the Qdrant filter successfully executes. Now, the backend successfully compares the uploaded file's SHA-256 hash with existing vectors in that session.

The Fix (pipeline.py):

# Added the missing core models
from qdrant_client.models import PointStruct, Filter, FieldCondition, MatchValue

# Qdrant successfully executes this search now
existing = client.scroll(
    scroll_filter=Filter(must=[
        FieldCondition(key="source_file", match=MatchValue(value=file_name)),
        FieldCondition(key="uploaded_by", match=MatchValue(value=user_email)),
        FieldCondition(key="is_temporary", match=MatchValue(value=True)),
    ]),
    limit=1,
    with_payload=["file_hash"]
)

Result: Uploading the same file multiple times now consumes 0 Jina API tokens and returns instantly <100ms.

2. Session Lifecycle (Natural Behavior)

The system now behaves optimally, enforcing boundaries across a user's session.

During an Active Session:
- Upload File A: Jina AI is called (~4,500 tokens). File is indexed.
- Upload File A again: Hash matches. Qdrant skips embedding. 0 tokens.
- Upload File B: Jina AI is called (~4,000 tokens). File is indexed.
On Logout (/auth/logout):
- The logout endpoint triggers cleanup_user_temp_vectors().
- Qdrant deletes all vectors matching is_temporary=True and this user's email.
- The temporary brain is wiped clean.
Next Session:
- If the user logs in again and uploads "File A", it is treated as a brand new file (because its previous vectors were deleted on logout). It will be indexed normally.

🛡️ Bonus: 2 New Anti-Abuse Protections Added

Even with the hash skip working, a malicious user could exploit API quota by uploading thousands of different files. To protect the application, two new layers were added:

Protection 1: IP-Based Rate Limiting (`5/hour`)

Added to /api/upload endpoint in routes.py:

from app.core.limiter import limiter

@router.post("/upload")
@limiter.limit("5/hour") # Blocks excessive upload scraping from the same IP
async def upload_temp_file(request: Request, file: UploadFile = File(...), user: dict = Depends(get_current_user)):

Protection 2: Max Temporary Files per User (`3 files max`)

Added smart logic to limit active Temporary vectors:

MAX_TEMP_FILES = 3
existing_files = get_user_temp_files(user_email)

if len(existing_files) >= MAX_TEMP_FILES:
    # If the user uploads a DIFFERENT file name while already at 3 files, block it.
    # However, if they re-upload the SAME file name, we allow it to pass so 
    # it can trigger the Hash-Skip (return true) or Hash-Mismatch (replace/delete old vectors)
    existing_names = [f["file_name"] for f in existing_files]
    if file.filename not in existing_names:
        raise HTTPException(
            status_code=429,
            detail=f"Upload limit reached: max {MAX_TEMP_FILES} files allowed. Delete an existing file first."
        )

Max Cost Exposure: Even if someone creates a fake account, they can only upload a maximum of 3 unique files (max 10MB each) per session, stringently capping Jina API leakage.

Summary

The pipeline for handling temporary documents is now fully enterprise-grade. It intelligently saves API tokens via hash deduplication, restricts abuse via rate-limiting/file-quotas, and reliably cleans up its own data on logout without interfering with the static administrative core brain.

🛑 The Core Problem (Before the Fix)​

Cause: The "Silent Failure"​

✅ The Solution (After the Fix)​

1. Correcting the Hash Check Logic​

2. Session Lifecycle (Natural Behavior)​

🛡️ Bonus: 2 New Anti-Abuse Protections Added​

Protection 1: IP-Based Rate Limiting (5/hour)​

Protection 2: Max Temporary Files per User (3 files max)​

Summary​