Embedding Models Comparison

On 512MB servers, local embedding models are not an option. They will OOM kill your server before the first request.

Model Comparison Table

Model	Dimensions	Type	Best For	RAM on 512MB	Cost
⭐ Jina AI v2 base-en	768	API	Constrained servers — my production choice	~0 MB (API call)	1M tokens free/month
OpenAI text-embedding-3-small	1536	API	General RAG, best quality/cost balance	~0 MB (API call)	$0.02 per 1M tokens
OpenAI text-embedding-3-large	3072 (MRL: → 256)	API + MRL	Max accuracy — truncate to 512 for 80% storage savings	~0 MB (API call)	$0.13 per 1M tokens
all-MiniLM-L6-v2	384	⚠️ Local	Offline dev/testing ONLY	~400 MB — OOM RISK	Free but kills server
all-mpnet-base-v2	768	⚠️ Local	Higher accuracy than MiniLM — local dev only	~450 MB — OOM KILL	Free local only
Google textembedding-gecko	768	API	GCP ecosystem, good multilingual	~0 MB	Paid after free quota

Why I Use Jina AI in Production

from langchain_community.embeddings import JinaEmbeddings

embeddings = JinaEmbeddings(
    jina_api_key="your_key",
    model_name="jina-embeddings-v2-base-en"
)

# Batch embedding — critical for large files
# Jina free tier: per-request token limits
# Solution: batch at 5 chunks per call with 200ms pause

async def embed_in_batches(chunks: list[str], batch_size: int = 5):
    vectors = []
    for i in range(0, len(chunks), batch_size):
        batch = chunks[i:i + batch_size]
        batch_vectors = embeddings.embed_documents(batch)
        vectors.extend(batch_vectors)
        await asyncio.sleep(0.2)  # Rate limit protection
    return vectors

Why Jina over others on free tier:

1M tokens/month free — enough for full legal corpus + ongoing queries
API-based = 0MB local RAM overhead
768-dim = same as Google's gecko, good quality
Exponential backoff when rate limited: 3s → 6s → 12s → 24s → 48s

Matryoshka Representation Learning (MRL)

MRL is an embedding training technique where the full vector encodes meaning in its first N dimensions. You can truncate without retraining the model.

Configuration	Dimensions	Storage/Vector	Accuracy Loss	Best For
Full	3072	~12 KB/vector	0% baseline	Unlimited storage
MRL Truncated 1024	1024	~4 KB/vector	<1%	Balanced production
⭐ MRL Truncated 512	512	~2 KB/vector	<1.5%	512MB RAM servers
MRL Truncated 256	256	~1 KB/vector	<2%	Extreme constraints

How to Use MRL with OpenAI

from openai import OpenAI

client = OpenAI()

response = client.embeddings.create(
    model="text-embedding-3-large",
    input="Your text here",
    dimensions=512   # MRL truncation — pass this parameter
)

vector = response.data[0].embedding
# Returns 512-dim vector instead of 3072
# Store in Qdrant with vector_size=512
# Cosine similarity still works correctly

Storage Savings

512 dims instead of 3072 = 80% less Qdrant storage with <1.5% accuracy loss. On the free 1GB Qdrant tier, this means you can store 5x more vectors before hitting the limit.

The Embedding Model Graveyard (Things I Tried)

HuggingFace Transformers  → OOM kill immediately on Render 512MB
Gemini Embedding 001      → 100 RPM, 1500 req/month quota
                            First full indexing run exhausted monthly
                            quota before app finished starting
text-embedding-004        → Deprecated January 14, 2026
embedding-001             → Deprecated, returns 404
all-MiniLM-L6-v2         → ~400MB RAM, instant OOM on 512MB

Jina AI                   → ✅ Stable, generous free tier,
                            API-based (0 local RAM)
                            This finally worked.

Model Comparison Table​

Why I Use Jina AI in Production​

Matryoshka Representation Learning (MRL)​

How to Use MRL with OpenAI​

The Embedding Model Graveyard (Things I Tried)​

Model Comparison Table

Why I Use Jina AI in Production

Matryoshka Representation Learning (MRL)

How to Use MRL with OpenAI

The Embedding Model Graveyard (Things I Tried)