Back to Blog
AI & ML · 15 min read

Semantic Search & RAG: How Your AI Understands Before Answering

What embeddings are, why vectors are replacing keywords, and how RAG turns a generic LLM into a business expert — with real data, zero hallucinations.

Neural network representing semantic search and embeddings

Ask ChatGPT something about your company. About your own articles. About your products, pricing or internal policies. The answer will be polite, elaborate… and quite likely made up. Language models know nothing about your business — they know about everything else.

RAG (Retrieval-Augmented Generation) solves exactly this. Instead of expecting the LLM to "remember" your content, you first search for relevant information in your own knowledge base and then pass it to the model as context. The result: precise answers, with sources, free of hallucinations.

In this article we explain the entire pipeline — from what embeddings are to how we implemented the Codex chatbot you're using right now (yes, the button at the bottom right). Feel free to try it while you read.

🎯

The problem RAG solves

An LLM has a knowledge cutoff (usually months behind) and knows nothing about your private content. Fine-tuning is expensive and slow. RAG injects real context into every question — instantly updatable, no retraining needed, at a cost per query of ~$0.0008.

Fundamentals

What Are Embeddings?

An embedding is the numerical representation of a text's meaning. Instead of treating words as character strings, an embedding model converts them into vectors — lists of numbers with hundreds or thousands of dimensions — where texts with similar meanings end up close together in vector space.

// Conceptual embedding example
"cat"        → [0.23, -0.41, 0.87, 0.12, ..., -0.33]  // 1024 dimensions
"feline"     → [0.25, -0.39, 0.85, 0.14, ..., -0.31]  // Very close to "cat"!
"automobile" → [-0.67, 0.52, -0.11, 0.83, ..., 0.44]  // Far from both

// Cosine distance:
// cat ↔ feline:     0.97 (nearly identical semantically)
// cat ↔ automobile: 0.12 (no semantic relationship)

The magic is that this works beyond synonyms. "How do you protect my customer data?" and "CRM security and compliance" end up close in vector space even though they share almost no words. The model understands the intent, not the letters.

📐

Dimensions

Each embedding has hundreds or thousands of dimensions. More dimensions = more semantic nuances captured. bge-m3, the model we use, generates vectors of 1,024 dimensions.

🌐

Multilingual by default

Modern models like bge-m3 are multilingual — a question in Spanish can find content in English because the meaning lives in the same vector space, regardless of language.

Instant generation

Generating the embedding of a question takes ~50ms on Workers AI. It's not like training a model — it's like converting text to its semantic "coordinate". It's an inference operation, not a training one.

Comparison

Keyword Search vs Semantic Search

Traditional search (SQL LIKE '%keyword%' or full-text search) looks for exact word matches. It works well when you know exactly which words to use. But humans don't think in keywords — we think in questions.

Aspect Keyword Search Semantic Search
What it searches Word matching Meaning similarity
Synonyms Doesn't understand them Captures them automatically
Natural questions Poor — depends on word overlap Excellent — understands intent
Multilingual Requires separate indexes per language A single vector space works for all
Scalability Excellent with SQL indexes Requires vector database (ANN)
Cost Virtually zero Low (embedding + vector query)
💡

Real example in Codex

A user asks "how do you handle data security?". With keyword search, the text would need to contain exactly "security" and "data". With semantic search, it also finds results about "GDPR", "compliance", "multi-tenant isolation", "encryption" and "Turnstile" — even though those words don't appear in the question.

Architecture

What Is RAG and Why Does It Work?

RAG (Retrieval-Augmented Generation) is an architecture pattern that combines two phases: first you retrieve relevant information from your knowledge base, then you pass it to the generative model as context. The model doesn't need to "know" your data — it just needs to read it at the right moment.

1

User question

"How does Cadences connect with external systems?" — The user writes in natural language, without thinking about keywords.

2

Question embedding

Workers AI converts the question into a 1,024-dimension vector using bge-m3 (~50ms, free on Cloudflare).

3

Vector search (Retrieval)

The vector is compared against all indexed chunks in Vectorize using cosine distance. The 5 most similar fragments are returned — even if they share no words with the question.

4

Enriched context

The full text of each chunk is retrieved from D1 and injected into the LLM prompt as context. The model reads your actual content — it doesn't make it up.

5

Generated response (Generation)

DeepSeek V3.2 generates a response based exclusively on the retrieved context. If the information isn't in the chunks, it says "I don't have information about that" — eliminating hallucinations.

Strategies

RAG vs Fine-Tuning vs Full Context

There are three ways to "teach" an LLM about your business. Each has its place, but for most enterprise use cases, RAG is the right choice.

Method Cost Update Speed Accuracy Scalability
Full context High (expensive tokens) Instant Excellent (if it fits) Limited to context window
Fine-tuning Very high ($100s–$1000s) Days/weeks Variable Requires retraining
RAG ~$0.0008/query Minutes High (verifiable sources) 100K+ documents

When NOT to use RAG?

  • When your entire knowledge base fits in the context window (~200K tokens = ~150K words). In that case, pass everything directly.
  • When you need the model to change its style or personality, not its knowledge. That's fine-tuning.
  • When questions are purely conversational ("hi, how are you?") without needing specific data.
Infrastructure

Vectorize: The Vector Database at the Edge

A vector database stores embeddings and lets you search for the most similar ones to a query vector. The standard algorithm is ANN (Approximate Nearest Neighbors) — it doesn't compare against every vector one by one (that would be way too slow with millions), but uses index structures to find the closest ones in milliseconds.

Cloudflare Vectorize is Cloudflare's native solution. It lives at the edge (alongside Workers and D1), has a generous free tier, and integrates directly with Workers AI for generating embeddings — without leaving the ecosystem, without inter-service network latency.

🗄️

Dual storage: Vectorize + D1

Vectors (embeddings) live in Vectorize for fast search. Original texts live in D1 (relational SQL). When Vectorize returns IDs of similar chunks, D1 provides the full text. Each service does what it does best.

📊

Cosine distance

The metric that measures similarity between two vectors. A score of 1.0 = identical, 0.0 = unrelated. In practice, a result with score > 0.55 is relevant, > 0.70 is highly relevant, > 0.85 is practically an exact match.

🏷️

Namespaces for filtering

Vectorize supports namespaces — logical partitions within the same index. We use the language (es, en) as namespace so that Spanish searches only compare against Spanish content.

// Codex index configuration in wrangler.toml
[[vectorize]]
binding = "VECTORIZE"
index_name = "codex-knowledge"  # 1024 dimensions, cosine
# Free tier: 200K vectors, 5M dims stored/month
# Enough for ~195 articles × ~30 chunks = ~5,850 vectors
Data Preparation

Chunking: The Art of Splitting Text

You can't generate an embedding of a full 4,000-word article — the meaning gets diluted. You need to split it into fragments (chunks) small enough to be semantically precise, but large enough to retain context.

📏 Chunk size: ~400 words

The sweet spot for models like bge-m3. Too short (50 words) loses context. Too long (1000+ words) dilutes meaning. 400 words captures ~2-3 coherent paragraphs.

🔄 Overlap: 50 words

Each chunk overlaps 50 words with the previous one. This prevents ideas crossing chunk boundaries from being lost. If the key point sits at the border, both chunks capture it.

🧹 Pre-processing cleanup

Before chunking, all HTML, CSS, SVG, Tailwind classes, scripts, and metadata are stripped. Only the pure text a human would read remains. This dramatically improves embedding quality.

📊

Real Codex numbers

  • 24 articles (12 in Spanish + 12 in English)
  • 98 chunks generated (~4 chunks per article)
  • 98 vectors of 1,024 dimensions in Vectorize
  • ~100K dimensions stored (0.5% of the free tier)
  • Full indexing in < 2 minutes
Implementation

The Complete Codex Chatbot Stack

The chatbot you see on this website is a complete RAG system running 100% on Cloudflare. Here's each component and why we chose it:

Component Technology Cost Function
Embeddings Workers AI bge-m3 Free Converts text → vector (1,024 dims)
Vector store Cloudflare Vectorize Free (current tier) Stores and searches vectors by similarity
Text + metadata Cloudflare D1 Free (current tier) Stores text chunks, titles, URLs
Primary LLM DeepSeek V3.2 $0.28 / 1M tokens in Generates responses with RAG context
Fallback LLM Workers AI Llama 3.1 Free Backup if DeepSeek unavailable
Frontend Astro + vanilla JS Static Floating chat widget, voice, markdown
// Simplified RAG endpoint flow (codex-chat.js)
// 1. Embed the question (~50ms)
const embedding = await env.AI.run('@cf/baai/bge-m3', {
  text: [question]
});

// 2. Vector search in Vectorize (~30ms)
const results = await env.VECTORIZE.query(embedding.data[0], {
  topK: 5,
  namespace: lang,         // 'es' or 'en'
  returnMetadata: 'all',
  filter: { minScore: 0.55 }
});

// 3. Fetch full text from D1
const chunks = await db.prepare(
  'SELECT chunk_text, title, url FROM codex_chunks WHERE id IN (...)'
).all();

// 4. Generate response with DeepSeek
const response = await fetch('https://api.deepseek.com/chat/completions', {
  body: JSON.stringify({
    model: 'deepseek-chat',
    messages: [
      { role: 'system', content: systemPrompt + context },
      ...history,
      { role: 'user', content: question }
    ]
  })
});
Economics

Cost Analysis: ~$0.0008 per Query

One of RAG's biggest advantages over other architectures is cost. By using services with free tiers for embeddings and vector storage, the only variable cost is the generative LLM. Here's the real breakdown:

Question embedding

$0.0000

Workers AI bge-m3: free on the Workers AI tier. No practical limit for normal use.

Vector search

$0.0000

Vectorize: 30M queries/month included in the free tier. More than enough.

D1 read

$0.0000

D1: 5M reads/day on free tier. Each query reads ~5 rows.

DeepSeek V3.2 (response)

~$0.0008

~2K tokens input (context + question) × $0.28/1M + ~500 tokens output × $0.42/1M ≈ $0.0008.

💰

Estimated monthly cost

With 1,000 queries/month (a reasonable volume for a technical blog): ~$0.80/month. The equivalent with GPT-4o would be ~$60/month. DeepSeek V3.2 offers GPT-4 comparable quality at 1.3% of the price.

Interface

Voice, Markdown & Sources: The Chat Experience

A RAG system is only useful if the interface is good. The Codex chatbot includes features that go beyond a simple text field:

🎙️

Voice input

Native browser Web Speech API. Speak your question and it's automatically transcribed. No additional cost, works offline.

📝

Markdown rendering

Responses render with formatting: bold, italic, lists, inline code, code blocks and clickable links.

📎

Verifiable sources

Every response includes links to the original articles used as context, with their similarity score. You can verify the information.

🌙

Dark mode + i18n

Adapts to system theme, supports Spanish and English, and is responsive — full-screen on mobile.

Scalability

From 24 Articles to 100K Documents

The Codex pipeline currently indexes 24 articles. But the architecture is designed to scale several orders of magnitude without changing technologies:

Scale Vectors Search latency Tier required
Blog (current) ~100 ~30ms Free
Medium wiki ~5,000 ~30ms Free
Enterprise knowledge base ~50,000 ~35ms Workers Paid ($5/mo)
Massive documentation ~200,000 ~40ms Workers Paid ($5/mo)

Vectorize uses HNSW (Hierarchical Navigable Small World) as its indexing algorithm — the same family of algorithms used by Pinecone or pgvector. Search latency grows logarithmically, not linearly: going from 100 to 100K vectors barely adds 10ms.

Best Practices

Common Mistakes and How to Avoid Them

❌ Chunks too long

If a chunk has 2,000 words about 5 different topics, its embedding will be a blurred average of all of them. Search will be imprecise. Solution: 300–500 words per chunk.

❌ Not cleaning the HTML

Generating embeddings of <div class="flex items-center gap-3 mb-6"> wastes dimensions on irrelevant information. Solution: Strip all markup before embedding.

❌ Similarity threshold too low

Accepting results with score < 0.4 injects irrelevant context that confuses the LLM. Solution: Use minScore ≥ 0.55 and verify empirically.

❌ Ignoring the system prompt

Without clear instructions, the LLM may invent information outside the retrieved context. Solution: System prompt that explicitly says "only answer with information from the provided context".

❌ Not including useful metadata

If the LLM only sees loose text without knowing which article it comes from, it can't cite sources. Solution: Include title, URL and category of each chunk in the context.

Conclusion

AI That Knows Your Stuff

RAG is not a trend — it's the pattern that solves the fundamental problem of LLMs: they know a lot about everything, but nothing about your business. With embeddings, semantic search and generation based on real context, you can have an assistant that answers precisely about your content, cites sources and costs less than $1/month.

The Codex chatbot is the living proof. Ask it anything about Cadences — it searches 98 chunks from 24 articles, finds the most relevant ones by semantic similarity, and generates a grounded response with DeepSeek. All at the edge, all serverless, all for ~$0.0008 per question.

🧠

Technical summary

  • Embeddings: Workers AI bge-m3, 1,024 dimensions, multilingual, free
  • Vector DB: Cloudflare Vectorize with cosine distance, namespaces per language
  • Text store: D1 with chunks, metadata (title, URL, category), indexed by slug
  • LLM: DeepSeek V3.2 ($0.28/$0.42 per 1M tokens) + fallback Workers AI Llama
  • Chunking: ~400 words with 50-word overlap, full HTML stripping
  • Pipeline: question → embedding → Vectorize query → D1 fetch → DeepSeek → response with sources
  • Real cost: ~$0.0008/query, ~$0.80/month for 1,000 questions
  • Rate limit: 20 queries/day per IP (configurable), Turnstile-ready
C

Cadences Engineering

Technical documentation from the engineering team