ScaledByDesign/Insights
ServicesPricingAboutContact
Book a Call
Scaled By Design

Fractional CTO + execution partner for revenue-critical systems.

Company

  • About
  • Services
  • Contact

Resources

  • Insights
  • Pricing
  • FAQ

Legal

  • Privacy Policy
  • Terms of Service

© 2026 ScaledByDesign. All rights reserved.

contact@scaledbydesign.com

On This Page

What Embeddings Actually AreChoosing an Embedding ModelChunking StrategyChunk OverlapStoring and Querying EmbeddingsVector Database Optionspgvector ExampleProduction PipelineCommon Mistakes
  1. Insights
  2. AI & Automation
  3. A Practical Guide to AI Embeddings — Beyond the Hype

A Practical Guide to AI Embeddings — Beyond the Hype

May 27, 2026·ScaledByDesign·
aiembeddingsvector-databasesemantic-searchmachine-learning

What Embeddings Actually Are

An embedding is a list of numbers that captures the meaning of text, images, or other data. Similar items have similar numbers. That's it — the concept is simple even if the math behind it is complex.

// Two sentences with similar meaning
const embedding1 = await embed("How do I return a product?");
// → [0.023, -0.41, 0.87, 0.12, ... ] (1536 numbers)
 
const embedding2 = await embed("What's your return policy?");
// → [0.025, -0.39, 0.85, 0.14, ... ] (similar numbers!)
 
const embedding3 = await embed("The weather is nice today");
// → [-0.71, 0.33, -0.12, 0.56, ...] (very different numbers)
 
// Cosine similarity
cosineSimilarity(embedding1, embedding2); // 0.94 (very similar)
cosineSimilarity(embedding1, embedding3); // 0.11 (not related)

Choosing an Embedding Model

| Model                    | Dimensions | Speed    | Quality | Cost          |
|--------------------------|-----------|----------|---------|---------------|
| text-embedding-3-small   | 1536      | Fast     | Good    | $0.02/1M tok  |
| text-embedding-3-large   | 3072      | Medium   | Better  | $0.13/1M tok  |
| Cohere embed-v3          | 1024      | Fast     | Good    | $0.10/1M tok  |
| all-MiniLM-L6-v2 (local)| 384       | Very fast| OK      | Free (self-hosted)|
| nomic-embed-text (local) | 768       | Fast     | Good    | Free (self-hosted)|

Recommendation:
  → Start with text-embedding-3-small (best cost/quality for most use cases)
  → Use local models if you process >10M documents (cost savings)
  → Use text-embedding-3-large for legal/medical (accuracy-critical)

Chunking Strategy

Before embedding documents, you need to split them into chunks. This is where most implementations fail:

// ✗ Bad: fixed-size chunks that split sentences
function badChunking(text: string, size: number = 500): string[] {
  const chunks = [];
  for (let i = 0; i < text.length; i += size) {
    chunks.push(text.slice(i, i + size)); // Might cut mid-sentence
  }
  return chunks;
}
 
// ✓ Good: semantic chunking that respects document structure
function semanticChunking(markdown: string): Chunk[] {
  const chunks: Chunk[] = [];
  
  // Split by headers first (natural document boundaries)
  const sections = markdown.split(/\n#{1,3}\s/);
  
  for (const section of sections) {
    if (section.length < 200) {
      // Too small — merge with previous chunk
      if (chunks.length > 0) {
        chunks[chunks.length - 1].content += "\n" + section;
      }
    } else if (section.length > 1000) {
      // Too large — split by paragraphs
      const paragraphs = section.split(/\n\n/);
      let currentChunk = "";
      
      for (const para of paragraphs) {
        if ((currentChunk + para).length > 800) {
          chunks.push({ content: currentChunk.trim() });
          currentChunk = para;
        } else {
          currentChunk += "\n\n" + para;
        }
      }
      if (currentChunk.trim()) chunks.push({ content: currentChunk.trim() });
    } else {
      chunks.push({ content: section.trim() });
    }
  }
 
  return chunks;
}

Chunk Overlap

Add overlap between chunks so context isn't lost at boundaries:

function chunkWithOverlap(paragraphs: string[], targetSize: number = 600, overlap: number = 100): string[] {
  const chunks: string[] = [];
  let current = "";
 
  for (const para of paragraphs) {
    if ((current + para).length > targetSize && current.length > 0) {
      chunks.push(current);
      // Start next chunk with overlap from end of current
      current = current.slice(-overlap) + "\n\n" + para;
    } else {
      current += (current ? "\n\n" : "") + para;
    }
  }
  if (current) chunks.push(current);
  return chunks;
}

Storing and Querying Embeddings

Vector Database Options

Managed services:
  → Pinecone: Easiest to start, good for < 10M vectors
  → Weaviate Cloud: Full-featured, good hybrid search
  → Qdrant Cloud: Best performance/cost ratio

Self-hosted:
  → pgvector (PostgreSQL): Good for < 1M vectors, already have Postgres
  → Qdrant: Best for large scale self-hosted
  → Chroma: Good for prototyping, not for production scale

Recommendation:
  → Already use PostgreSQL? Start with pgvector
  → Need scale (>5M vectors)? Use Qdrant or Pinecone
  → Prototyping? Use Chroma locally

pgvector Example

-- Enable the extension
CREATE EXTENSION IF NOT EXISTS vector;
 
-- Create table with vector column
CREATE TABLE documents (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  content TEXT NOT NULL,
  embedding vector(1536),  -- Match your model's dimensions
  metadata JSONB DEFAULT '{}',
  created_at TIMESTAMPTZ DEFAULT NOW()
);
 
-- Create index for fast similarity search
CREATE INDEX ON documents 
  USING ivfflat (embedding vector_cosine_ops)
  WITH (lists = 100);  -- Tune: sqrt(row_count) is a good starting point
 
-- Query: find similar documents
SELECT id, content, 1 - (embedding <=> $1::vector) AS similarity
FROM documents
WHERE metadata->>'category' = 'returns'  -- Filter first, then vector search
ORDER BY embedding <=> $1::vector
LIMIT 5;

Production Pipeline

// Complete embedding pipeline for a knowledge base
async function indexDocuments(documents: Document[]) {
  for (const doc of documents) {
    // 1. Chunk the document
    const chunks = semanticChunking(doc.content);
 
    // 2. Generate embeddings in batches
    const embeddings = await batchEmbed(chunks.map(c => c.content), {
      batchSize: 100,  // Most APIs accept batches
    });
 
    // 3. Store with metadata
    await vectorDB.upsert(
      chunks.map((chunk, i) => ({
        id: `${doc.id}-chunk-${i}`,
        embedding: embeddings[i],
        metadata: {
          documentId: doc.id,
          title: doc.title,
          chunkIndex: i,
          category: doc.category,
          updatedAt: doc.updatedAt,
        },
        content: chunk.content,
      }))
    );
  }
}
 
// Search with metadata filtering
async function search(query: string, filters?: SearchFilters) {
  const queryEmbedding = await embed(query);
 
  return vectorDB.search(queryEmbedding, {
    topK: 5,
    minScore: 0.7,  // Only return relevant results
    filter: filters ? buildFilter(filters) : undefined,
  });
}

Common Mistakes

Mistake 1: Embedding entire documents
  → Problem: Long text dilutes the embedding signal
  → Fix: Chunk into 200-800 token segments

Mistake 2: Not cleaning text before embedding
  → Problem: HTML tags, navigation text pollute embeddings
  → Fix: Strip HTML, remove boilerplate, keep content only

Mistake 3: Mixing embedding models
  → Problem: Vectors from different models aren't comparable
  → Fix: Re-embed everything if you change models

Mistake 4: No metadata filtering
  → Problem: Vector search across all docs when you know the category
  → Fix: Filter by metadata FIRST, then vector search within results

Mistake 5: Not evaluating retrieval quality
  → Problem: You don't know if search results are actually good
  → Fix: Build a test set of queries with expected results, measure recall

Embeddings are the foundation, not the product. Get chunking, storage, and retrieval right, and everything built on top — semantic search, RAG, recommendations — will work better. Get them wrong, and no amount of prompt engineering will save you.

Previous
Checkout Funnel Optimization — The Technical Fixes That Recover Revenue
Insights
A Practical Guide to AI Embeddings — Beyond the HypeFine-Tuning vs. RAG — The Decision Framework for Production AIAI Agent Tool Calling Patterns That Actually Work in ProductionRAG Pipeline Optimization — From 8s to 400msLLM Cost Optimization — How We Cut a Client's AI Bill by 73%AI Hallucination Detection in Production — What Actually WorksWe Built an AI Code Review Bot — Here's What It Actually Catches (And What It Misses)Prompt Engineering Is Dead — Context Engineering Is What MattersYour AI Agent Isn't Working Because You Skipped the GuardrailsRAG vs Fine-Tuning: When to Use What in ProductionHow to Cut Your LLM Costs by 70% Without Losing QualityThe AI Implementation Playbook for Non-Technical FoundersWhy Most AI Chatbots Fail (And What Production-Grade Looks Like)Building AI Agents That Know When to Hand Off to HumansVibe Coding Is Destroying Your CodebaseAI Won't Fix Your Broken Data Pipeline

Ready to Ship?

Let's talk about your engineering challenges and how we can help.

Book a Call