Running RAG on 512MB RAM

Render's free tier gives you 512MB. A HuggingFace embedding model alone takes 400MB. Here's every trick I used to make a full production RAG stack survive this constraint.

Problem 1 — No Persistent Disk (Render Free Tier)

Render free tier spins down after 15 minutes of inactivity. When it spins back up, any in-memory state is gone. If ChromaDB is built in memory at startup — gone.

Fix: Pickle serialization to /tmp (the only writable path on Render free).

import pickle
from pathlib import Path

CACHE_PATH = Path("/tmp/chroma_cache.pkl")

def load_or_build_vectorstore():
    if CACHE_PATH.exists():
        with open(CACHE_PATH, "rb") as f:
            return pickle.load(f)

    # Build from scratch
    docs = load_all_pdfs()
    vectorstore = Chroma.from_documents(docs, embedding_function)

    with open(CACHE_PATH, "wb") as f:
        pickle.dump(vectorstore, f)

    return vectorstore

Catch: /tmp also clears on cold start. So for persistent storage across cold starts: commit pre-built ChromaDB to git (as described in Citizen Safety AI).

Problem 2 — LangChain Calling Embedding API Unnecessarily

Even with a pre-loaded ChromaDB, LangChain's Chroma wrapper was calling the embedding API on every query initialization — not for search, but for internal validation.

Fix: Bypass the LangChain wrapper entirely. Use the native chromadb client.

# ❌ LangChain wrapper — calls embedding API unnecessarily
from langchain.vectorstores import Chroma
vectorstore = Chroma(persist_directory="./chroma_db",
                     embedding_function=embeddings)  # API call here!

# ✅ Native chromadb client — no unnecessary API calls
import chromadb
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_collection("safety_docs")

# Search without embedding wrapper:
query_vector = embeddings.embed_query(query)  # One intentional call
results = collection.query(
    query_embeddings=[query_vector],
    n_results=5
)

Saved ~3 unnecessary embedding API calls per request on the free tier.

Problem 3 — The Embedding Model Graveyard

HuggingFace sentence-transformers:
  all-MiniLM-L6-v2   → 400MB RAM → OOM kill
  all-mpnet-base-v2   → 450MB RAM → OOM kill

Google Gemini (embedding-001):
  100 RPM, 1500 req/month quota
  First indexing run exhausted monthly quota
  Also deprecated January 14, 2026

Google Gemini (text-embedding-004):
  Also deprecated January 14, 2026

Jina AI (jina-embeddings-v2-base-en):
  API-based → 0MB local RAM
  1M tokens/month free
  Stable, no deprecations found
  ✅ This one works

Problem 4 — ChromaDB Telemetry Deadlock

ChromaDB sends an analytics ping on startup. On Render's restricted network, this DNS lookup hung for 30+ seconds — blocking every cold start.

# ❌ Default ChromaDB — hangs on startup on restricted networks
import chromadb
client = chromadb.PersistentClient(path="./chroma_db")

# ✅ Disable telemetry DNS call
from chromadb.config import Settings
client = chromadb.PersistentClient(
    path="./chroma_db",
    settings=Settings(anonymized_telemetry=False)
)

Fixed cold start from 35+ seconds to ~8 seconds.

Hard-Won Lessons

API-based embeddings are not a compromise — they're the correct choice for constrained servers. 0MB local RAM, generous free tiers, no version management.
LangChain abstractions hide API calls — when debugging unexpected rate limit errors, check if the wrapper is calling the embedding API more times than you think.
Every dependency has a RAM cost — at 512MB, you need to know the startup RAM of every import. import spacy; nlp = spacy.load("en_core_web_sm") = 200MB you didn't plan for.
Telemetry is a silent failure mode — ChromaDB, LangSmith, and others all have optional telemetry that DNS-resolves on startup. Disable all of them on constrained infra.
The constraint made the system better — every unnecessary dependency was removed. Every API call was counted. The result is a leaner, faster, more predictable system than if I'd had 16GB of RAM.

Problem 1 — No Persistent Disk (Render Free Tier)​

Problem 2 — LangChain Calling Embedding API Unnecessarily​

Problem 3 — The Embedding Model Graveyard​

Problem 4 — ChromaDB Telemetry Deadlock​

Hard-Won Lessons​

Problem 1 — No Persistent Disk (Render Free Tier)

Problem 2 — LangChain Calling Embedding API Unnecessarily

Problem 3 — The Embedding Model Graveyard

Problem 4 — ChromaDB Telemetry Deadlock

Hard-Won Lessons