Skip to main content

Embedding Models Comparison

On 512MB servers, local embedding models are not an option. They will OOM kill your server before the first request.


Model Comparison Table

ModelDimensionsTypeBest ForRAM on 512MBCost
⭐ Jina AI v2 base-en768APIConstrained servers — my production choice~0 MB (API call)1M tokens free/month
OpenAI text-embedding-3-small1536APIGeneral RAG, best quality/cost balance~0 MB (API call)$0.02 per 1M tokens
OpenAI text-embedding-3-large3072 (MRL: → 256)API + MRLMax accuracy — truncate to 512 for 80% storage savings~0 MB (API call)$0.13 per 1M tokens
all-MiniLM-L6-v2384⚠️ LocalOffline dev/testing ONLY~400 MB — OOM RISKFree but kills server
all-mpnet-base-v2768⚠️ LocalHigher accuracy than MiniLM — local dev only~450 MB — OOM KILLFree local only
Google textembedding-gecko768APIGCP ecosystem, good multilingual~0 MBPaid after free quota

Why I Use Jina AI in Production

from langchain_community.embeddings import JinaEmbeddings

embeddings = JinaEmbeddings(
jina_api_key="your_key",
model_name="jina-embeddings-v2-base-en"
)

# Batch embedding — critical for large files
# Jina free tier: per-request token limits
# Solution: batch at 5 chunks per call with 200ms pause

async def embed_in_batches(chunks: list[str], batch_size: int = 5):
vectors = []
for i in range(0, len(chunks), batch_size):
batch = chunks[i:i + batch_size]
batch_vectors = embeddings.embed_documents(batch)
vectors.extend(batch_vectors)
await asyncio.sleep(0.2) # Rate limit protection
return vectors

Why Jina over others on free tier:

  • 1M tokens/month free — enough for full legal corpus + ongoing queries
  • API-based = 0MB local RAM overhead
  • 768-dim = same as Google's gecko, good quality
  • Exponential backoff when rate limited: 3s → 6s → 12s → 24s → 48s

Matryoshka Representation Learning (MRL)

MRL is an embedding training technique where the full vector encodes meaning in its first N dimensions. You can truncate without retraining the model.

ConfigurationDimensionsStorage/VectorAccuracy LossBest For
Full3072~12 KB/vector0% baselineUnlimited storage
MRL Truncated 10241024~4 KB/vector<1%Balanced production
⭐ MRL Truncated 512512~2 KB/vector<1.5%512MB RAM servers
MRL Truncated 256256~1 KB/vector<2%Extreme constraints

How to Use MRL with OpenAI

from openai import OpenAI

client = OpenAI()

response = client.embeddings.create(
model="text-embedding-3-large",
input="Your text here",
dimensions=512 # MRL truncation — pass this parameter
)

vector = response.data[0].embedding
# Returns 512-dim vector instead of 3072
# Store in Qdrant with vector_size=512
# Cosine similarity still works correctly
Storage Savings

512 dims instead of 3072 = 80% less Qdrant storage with <1.5% accuracy loss. On the free 1GB Qdrant tier, this means you can store 5x more vectors before hitting the limit.


The Embedding Model Graveyard (Things I Tried)

HuggingFace Transformers  → OOM kill immediately on Render 512MB
Gemini Embedding 001 → 100 RPM, 1500 req/month quota
First full indexing run exhausted monthly
quota before app finished starting
text-embedding-004 → Deprecated January 14, 2026
embedding-001 → Deprecated, returns 404
all-MiniLM-L6-v2 → ~400MB RAM, instant OOM on 512MB

Jina AI → ✅ Stable, generous free tier,
API-based (0 local RAM)
This finally worked.