SHA-256 Sync Engine for Qdrant

The problem: you have 6 legal PDFs in production. The government releases an updated Constitution. How do you update the vectors without duplicating everything, without downtime, and without re-indexing files that didn't change?

The Problem

Naive re-indexing = delete all vectors + re-embed everything. Problems:

Rate limits on Jina AI free tier (1M tokens/month)
8,896 child vectors × Jina embedding = expensive
Users hitting the API while indexing happens

The Solution — Content Hash Registry

Store SHA-256 hash of every indexed file in Supabase Postgres. On each sync:

For each file in /legal_docs/:
  Compute SHA-256 hash of file content

  Case 1: Hash not in registry     → NEW FILE    → index it
  Case 2: Hash in registry, changed → UPDATED    → delete old vectors + re-index
  Case 3: File removed from folder  → DELETED    → delete vectors from Qdrant
  Case 4: Hash matches registry     → UNCHANGED  → skip (zero API calls)

Only changed files get re-embedded. Unchanged files: 0 API calls.

Implementation

Document Registry (Supabase Postgres)

CREATE TABLE document_registry (
  id            SERIAL PRIMARY KEY,
  file_name     TEXT NOT NULL UNIQUE,
  file_hash     TEXT NOT NULL,            -- SHA-256 of file bytes
  file_size     INTEGER,
  chunk_count   INTEGER,                  -- Total child chunks indexed
  last_indexed  TIMESTAMP DEFAULT NOW(),
  status        TEXT DEFAULT 'active'     -- active | deleted
);

Sync Engine Python

import hashlib
from pathlib import Path
from qdrant_client.models import Filter, FieldCondition, MatchValue

def compute_sha256(file_path: Path) -> str:
    sha256 = hashlib.sha256()
    with open(file_path, "rb") as f:
        for chunk in iter(lambda: f.read(8192), b""):
            sha256.update(chunk)
    return sha256.hexdigest()


async def sync_documents(docs_dir: Path):
    current_files = {f.name: f for f in docs_dir.glob("*.pdf")}
    registry = await get_registry()  # Dict[filename → hash]

    for filename, filepath in current_files.items():
        current_hash = compute_sha256(filepath)
        stored_hash = registry.get(filename)

        if stored_hash is None:
            # CASE 1: New file
            print(f"NEW: {filename}")
            await index_document(filepath, current_hash)

        elif stored_hash != current_hash:
            # CASE 2: File content changed
            print(f"UPDATED: {filename}")
            await delete_vectors_by_filename(filename)  # Delete old
            await index_document(filepath, current_hash)  # Re-index

    for filename in registry:
        if filename not in current_files:
            # CASE 3: File removed
            print(f"DELETED: {filename}")
            await delete_vectors_by_filename(filename)
            await update_registry_status(filename, "deleted")

Deletion by Payload Filter

async def delete_vectors_by_filename(filename: str):
    qdrant.delete(
        collection_name="legal_docs",
        points_selector=Filter(
            must=[FieldCondition(
                key="source_file",
                match=MatchValue(value=filename)
            )]
        )
    )
    await remove_from_registry(filename)

Deterministic Vector IDs (Idempotent)

import uuid

def make_vector_id(file_hash: str, parent_idx: int, child_idx: int) -> str:
    # Same content + same position = same ID always
    seed = f"{file_hash}_{parent_idx}_{child_idx}"
    return str(uuid.uuid5(uuid.NAMESPACE_DNS, seed))

# Qdrant upsert with deterministic IDs:
# Re-index same file (same hash) = same IDs = overwrite in place
# No duplicates. No orphaned vectors.

Production Numbers

Sync Scenario	API Calls	Time
All 6 files unchanged	0	~200ms
1 file updated (Constitution — 685 parents, 2845 children)	~570 Jina calls	~3 min
New file added (100 pages)	~200 Jina calls	~45s
File deleted	0	~1s

Why Not Just Re-Index Everything Each Time?

1M Jina tokens/month free limit.
Full re-index of 8,896 child chunks:
  ~8,896 chunks × avg 80 tokens = ~711,680 tokens per run
  = 71% of monthly quota in one sync
  
With SHA-256 sync:
  Typical month: 1-2 document updates
  Token usage: ~140,000 tokens
  Monthly quota remaining: 86%

The Problem​

The Solution — Content Hash Registry​

Implementation​

Document Registry (Supabase Postgres)​

Sync Engine Python​

Deletion by Payload Filter​

Deterministic Vector IDs (Idempotent)​

Production Numbers​

Why Not Just Re-Index Everything Each Time?​