Semantic Search & RAG: How Your AI Understands Before Answering
What embeddings are, why vectors are replacing keywords, and how RAG turns a generic LLM into a business expert — with real data, zero hallucinations.
Ask ChatGPT something about your company. About your own articles. About your products, pricing or internal policies. The answer will be polite, elaborate… and quite likely made up. Language models know nothing about your business — they know about everything else.
RAG (Retrieval-Augmented Generation) solves exactly this. Instead of expecting the LLM to "remember" your content, you first search for relevant information in your own knowledge base and then pass it to the model as context. The result: precise answers, with sources, free of hallucinations.
In this article we explain the entire pipeline — from what embeddings are to how we implemented the Codex chatbot you're using right now (yes, the button at the bottom right). Feel free to try it while you read.
The problem RAG solves
An LLM has a knowledge cutoff (usually months behind) and knows nothing about your private content. Fine-tuning is expensive and slow. RAG injects real context into every question — instantly updatable, no retraining needed, at a cost per query of ~$0.0008.
What Are Embeddings?
An embedding is the numerical representation of a text's meaning. Instead of treating words as character strings, an embedding model converts them into vectors — lists of numbers with hundreds or thousands of dimensions — where texts with similar meanings end up close together in vector space.
"cat" → [0.23, -0.41, 0.87, 0.12, ..., -0.33] // 1024 dimensions
"feline" → [0.25, -0.39, 0.85, 0.14, ..., -0.31] // Very close to "cat"!
"automobile" → [-0.67, 0.52, -0.11, 0.83, ..., 0.44] // Far from both
// Cosine distance:
// cat ↔ feline: 0.97 (nearly identical semantically)
// cat ↔ automobile: 0.12 (no semantic relationship) The magic is that this works beyond synonyms. "How do you protect my customer data?" and "CRM security and compliance" end up close in vector space even though they share almost no words. The model understands the intent, not the letters.
Dimensions
Each embedding has hundreds or thousands of dimensions. More dimensions = more semantic nuances captured. bge-m3, the model we use, generates vectors of 1,024 dimensions.
Multilingual by default
Modern models like bge-m3 are multilingual — a question in Spanish can find content in English because the meaning lives in the same vector space, regardless of language.
Instant generation
Generating the embedding of a question takes ~50ms on Workers AI. It's not like training a model — it's like converting text to its semantic "coordinate". It's an inference operation, not a training one.
Keyword Search vs Semantic Search
Traditional search (SQL LIKE '%keyword%' or full-text search) looks for exact word matches. It works well when you know exactly which words to use. But humans don't think in keywords — we think in questions.
| Aspect | Keyword Search | Semantic Search |
|---|---|---|
| What it searches | Word matching | Meaning similarity |
| Synonyms | Doesn't understand them | Captures them automatically |
| Natural questions | Poor — depends on word overlap | Excellent — understands intent |
| Multilingual | Requires separate indexes per language | A single vector space works for all |
| Scalability | Excellent with SQL indexes | Requires vector database (ANN) |
| Cost | Virtually zero | Low (embedding + vector query) |
Real example in Codex
A user asks "how do you handle data security?". With keyword search, the text would need to contain exactly "security" and "data". With semantic search, it also finds results about "GDPR", "compliance", "multi-tenant isolation", "encryption" and "Turnstile" — even though those words don't appear in the question.
What Is RAG and Why Does It Work?
RAG (Retrieval-Augmented Generation) is an architecture pattern that combines two phases: first you retrieve relevant information from your knowledge base, then you pass it to the generative model as context. The model doesn't need to "know" your data — it just needs to read it at the right moment.
User question
"How does Cadences connect with external systems?" — The user writes in natural language, without thinking about keywords.
Question embedding
Workers AI converts the question into a 1,024-dimension vector using bge-m3 (~50ms, free on Cloudflare).
Vector search (Retrieval)
The vector is compared against all indexed chunks in Vectorize using cosine distance. The 5 most similar fragments are returned — even if they share no words with the question.
Enriched context
The full text of each chunk is retrieved from D1 and injected into the LLM prompt as context. The model reads your actual content — it doesn't make it up.
Generated response (Generation)
DeepSeek V3.2 generates a response based exclusively on the retrieved context. If the information isn't in the chunks, it says "I don't have information about that" — eliminating hallucinations.
RAG vs Fine-Tuning vs Full Context
There are three ways to "teach" an LLM about your business. Each has its place, but for most enterprise use cases, RAG is the right choice.
| Method | Cost | Update Speed | Accuracy | Scalability |
|---|---|---|---|---|
| Full context | High (expensive tokens) | Instant | Excellent (if it fits) | Limited to context window |
| Fine-tuning | Very high ($100s–$1000s) | Days/weeks | Variable | Requires retraining |
| RAG | ~$0.0008/query | Minutes | High (verifiable sources) | 100K+ documents |
When NOT to use RAG?
- ⚠ When your entire knowledge base fits in the context window (~200K tokens = ~150K words). In that case, pass everything directly.
- ⚠ When you need the model to change its style or personality, not its knowledge. That's fine-tuning.
- ⚠ When questions are purely conversational ("hi, how are you?") without needing specific data.
Vectorize: The Vector Database at the Edge
A vector database stores embeddings and lets you search for the most similar ones to a query vector. The standard algorithm is ANN (Approximate Nearest Neighbors) — it doesn't compare against every vector one by one (that would be way too slow with millions), but uses index structures to find the closest ones in milliseconds.
Cloudflare Vectorize is Cloudflare's native solution. It lives at the edge (alongside Workers and D1), has a generous free tier, and integrates directly with Workers AI for generating embeddings — without leaving the ecosystem, without inter-service network latency.
Dual storage: Vectorize + D1
Vectors (embeddings) live in Vectorize for fast search. Original texts live in D1 (relational SQL). When Vectorize returns IDs of similar chunks, D1 provides the full text. Each service does what it does best.
Cosine distance
The metric that measures similarity between two vectors. A score of 1.0 = identical, 0.0 = unrelated. In practice, a result with score > 0.55 is relevant, > 0.70 is highly relevant, > 0.85 is practically an exact match.
Namespaces for filtering
Vectorize supports namespaces — logical partitions within the same index. We use the language (es, en) as namespace so that Spanish searches only compare against Spanish content.
[[vectorize]]
binding = "VECTORIZE"
index_name = "codex-knowledge" # 1024 dimensions, cosine
# Free tier: 200K vectors, 5M dims stored/month
# Enough for ~195 articles × ~30 chunks = ~5,850 vectors Chunking: The Art of Splitting Text
You can't generate an embedding of a full 4,000-word article — the meaning gets diluted. You need to split it into fragments (chunks) small enough to be semantically precise, but large enough to retain context.
📏 Chunk size: ~400 words
The sweet spot for models like bge-m3. Too short (50 words) loses context. Too long (1000+ words) dilutes meaning. 400 words captures ~2-3 coherent paragraphs.
🔄 Overlap: 50 words
Each chunk overlaps 50 words with the previous one. This prevents ideas crossing chunk boundaries from being lost. If the key point sits at the border, both chunks capture it.
🧹 Pre-processing cleanup
Before chunking, all HTML, CSS, SVG, Tailwind classes, scripts, and metadata are stripped. Only the pure text a human would read remains. This dramatically improves embedding quality.
Real Codex numbers
- ✦ 24 articles (12 in Spanish + 12 in English)
- ✦ 98 chunks generated (~4 chunks per article)
- ✦ 98 vectors of 1,024 dimensions in Vectorize
- ✦ ~100K dimensions stored (0.5% of the free tier)
- ✦ Full indexing in < 2 minutes
The Complete Codex Chatbot Stack
The chatbot you see on this website is a complete RAG system running 100% on Cloudflare. Here's each component and why we chose it:
| Component | Technology | Cost | Function |
|---|---|---|---|
| Embeddings | Workers AI bge-m3 | Free | Converts text → vector (1,024 dims) |
| Vector store | Cloudflare Vectorize | Free (current tier) | Stores and searches vectors by similarity |
| Text + metadata | Cloudflare D1 | Free (current tier) | Stores text chunks, titles, URLs |
| Primary LLM | DeepSeek V3.2 | $0.28 / 1M tokens in | Generates responses with RAG context |
| Fallback LLM | Workers AI Llama 3.1 | Free | Backup if DeepSeek unavailable |
| Frontend | Astro + vanilla JS | Static | Floating chat widget, voice, markdown |
// 1. Embed the question (~50ms)
const embedding = await env.AI.run('@cf/baai/bge-m3', {
text: [question]
});
// 2. Vector search in Vectorize (~30ms)
const results = await env.VECTORIZE.query(embedding.data[0], {
topK: 5,
namespace: lang, // 'es' or 'en'
returnMetadata: 'all',
filter: { minScore: 0.55 }
});
// 3. Fetch full text from D1
const chunks = await db.prepare(
'SELECT chunk_text, title, url FROM codex_chunks WHERE id IN (...)'
).all();
// 4. Generate response with DeepSeek
const response = await fetch('https://api.deepseek.com/chat/completions', {
body: JSON.stringify({
model: 'deepseek-chat',
messages: [
{ role: 'system', content: systemPrompt + context },
...history,
{ role: 'user', content: question }
]
})
}); Cost Analysis: ~$0.0008 per Query
One of RAG's biggest advantages over other architectures is cost. By using services with free tiers for embeddings and vector storage, the only variable cost is the generative LLM. Here's the real breakdown:
Question embedding
$0.0000Workers AI bge-m3: free on the Workers AI tier. No practical limit for normal use.
Vector search
$0.0000Vectorize: 30M queries/month included in the free tier. More than enough.
D1 read
$0.0000D1: 5M reads/day on free tier. Each query reads ~5 rows.
DeepSeek V3.2 (response)
~$0.0008~2K tokens input (context + question) × $0.28/1M + ~500 tokens output × $0.42/1M ≈ $0.0008.
Estimated monthly cost
With 1,000 queries/month (a reasonable volume for a technical blog): ~$0.80/month. The equivalent with GPT-4o would be ~$60/month. DeepSeek V3.2 offers GPT-4 comparable quality at 1.3% of the price.
Voice, Markdown & Sources: The Chat Experience
A RAG system is only useful if the interface is good. The Codex chatbot includes features that go beyond a simple text field:
Voice input
Native browser Web Speech API. Speak your question and it's automatically transcribed. No additional cost, works offline.
Markdown rendering
Responses render with formatting: bold, italic, lists, inline code, code blocks and clickable links.
Verifiable sources
Every response includes links to the original articles used as context, with their similarity score. You can verify the information.
Dark mode + i18n
Adapts to system theme, supports Spanish and English, and is responsive — full-screen on mobile.
From 24 Articles to 100K Documents
The Codex pipeline currently indexes 24 articles. But the architecture is designed to scale several orders of magnitude without changing technologies:
| Scale | Vectors | Search latency | Tier required |
|---|---|---|---|
| Blog (current) | ~100 | ~30ms | Free |
| Medium wiki | ~5,000 | ~30ms | Free |
| Enterprise knowledge base | ~50,000 | ~35ms | Workers Paid ($5/mo) |
| Massive documentation | ~200,000 | ~40ms | Workers Paid ($5/mo) |
Vectorize uses HNSW (Hierarchical Navigable Small World) as its indexing algorithm — the same family of algorithms used by Pinecone or pgvector. Search latency grows logarithmically, not linearly: going from 100 to 100K vectors barely adds 10ms.
Common Mistakes and How to Avoid Them
❌ Chunks too long
If a chunk has 2,000 words about 5 different topics, its embedding will be a blurred average of all of them. Search will be imprecise. Solution: 300–500 words per chunk.
❌ Not cleaning the HTML
Generating embeddings of <div class="flex items-center gap-3 mb-6"> wastes dimensions on irrelevant information. Solution: Strip all markup before embedding.
❌ Similarity threshold too low
Accepting results with score < 0.4 injects irrelevant context that confuses the LLM. Solution: Use minScore ≥ 0.55 and verify empirically.
❌ Ignoring the system prompt
Without clear instructions, the LLM may invent information outside the retrieved context. Solution: System prompt that explicitly says "only answer with information from the provided context".
❌ Not including useful metadata
If the LLM only sees loose text without knowing which article it comes from, it can't cite sources. Solution: Include title, URL and category of each chunk in the context.
AI That Knows Your Stuff
RAG is not a trend — it's the pattern that solves the fundamental problem of LLMs: they know a lot about everything, but nothing about your business. With embeddings, semantic search and generation based on real context, you can have an assistant that answers precisely about your content, cites sources and costs less than $1/month.
The Codex chatbot is the living proof. Ask it anything about Cadences — it searches 98 chunks from 24 articles, finds the most relevant ones by semantic similarity, and generates a grounded response with DeepSeek. All at the edge, all serverless, all for ~$0.0008 per question.
Technical summary
- ✦ Embeddings: Workers AI bge-m3, 1,024 dimensions, multilingual, free
- ✦ Vector DB: Cloudflare Vectorize with cosine distance, namespaces per language
- ✦ Text store: D1 with chunks, metadata (title, URL, category), indexed by slug
- ✦ LLM: DeepSeek V3.2 ($0.28/$0.42 per 1M tokens) + fallback Workers AI Llama
- ✦ Chunking: ~400 words with 50-word overlap, full HTML stripping
- ✦ Pipeline: question → embedding → Vectorize query → D1 fetch → DeepSeek → response with sources
- ✦ Real cost: ~$0.0008/query, ~$0.80/month for 1,000 questions
- ✦ Rate limit: 20 queries/day per IP (configurable), Turnstile-ready
Cadences Engineering
Technical documentation from the engineering team
Generative AI at Work
From theory to real practice
Related article →AI Agents in Cadences
From prompt to autonomous system