A Practical Guide to AI Embeddings — Beyond the Hype

May 27, 2026·ScaledByDesign·

aiembeddingsvector-databasesemantic-searchmachine-learning

What Embeddings Actually Are

An embedding is a list of numbers that captures the meaning of text, images, or other data. Similar items have similar numbers. That's it — the concept is simple even if the math behind it is complex.

// Two sentences with similar meaning
const embedding1 = await embed("How do I return a product?");
// → [0.023, -0.41, 0.87, 0.12, ... ] (1536 numbers)
 
const embedding2 = await embed("What's your return policy?");
// → [0.025, -0.39, 0.85, 0.14, ... ] (similar numbers!)
 
const embedding3 = await embed("The weather is nice today");
// → [-0.71, 0.33, -0.12, 0.56, ...] (very different numbers)
 
// Cosine similarity
cosineSimilarity(embedding1, embedding2); // 0.94 (very similar)
cosineSimilarity(embedding1, embedding3); // 0.11 (not related)

Choosing an Embedding Model

| Model                    | Dimensions | Speed    | Quality | Cost          |
|--------------------------|-----------|----------|---------|---------------|
| text-embedding-3-small   | 1536      | Fast     | Good    | $0.02/1M tok  |
| text-embedding-3-large   | 3072      | Medium   | Better  | $0.13/1M tok  |
| Cohere embed-v3          | 1024      | Fast     | Good    | $0.10/1M tok  |
| all-MiniLM-L6-v2 (local)| 384       | Very fast| OK      | Free (self-hosted)|
| nomic-embed-text (local) | 768       | Fast     | Good    | Free (self-hosted)|

Recommendation:
  → Start with text-embedding-3-small (best cost/quality for most use cases)
  → Use local models if you process >10M documents (cost savings)
  → Use text-embedding-3-large for legal/medical (accuracy-critical)

Chunking Strategy

Before embedding documents, you need to split them into chunks. This is where most implementations fail:

// ✗ Bad: fixed-size chunks that split sentences
function badChunking(text: string, size: number = 500): string[] {
  const chunks = [];
  for (let i = 0; i < text.length; i += size) {
    chunks.push(text.slice(i, i + size)); // Might cut mid-sentence
  }
  return chunks;
}
 
// ✓ Good: semantic chunking that respects document structure
function semanticChunking(markdown: string): Chunk[] {
  const chunks: Chunk[] = [];
  
  // Split by headers first (natural document boundaries)
  const sections = markdown.split(/\n#{1,3}\s/);
  
  for (const section of sections) {
    if (section.length < 200) {
      // Too small — merge with previous chunk
      if (chunks.length > 0) {
        chunks[chunks.length - 1].content += "\n" + section;
      }
    } else if (section.length > 1000) {
      // Too large — split by paragraphs
      const paragraphs = section.split(/\n\n/);
      let currentChunk = "";
      
      for (const para of paragraphs) {
        if ((currentChunk + para).length > 800) {
          chunks.push({ content: currentChunk.trim() });
          currentChunk = para;
        } else {
          currentChunk += "\n\n" + para;
        }
      }
      if (currentChunk.trim()) chunks.push({ content: currentChunk.trim() });
    } else {
      chunks.push({ content: section.trim() });
    }
  }
 
  return chunks;
}

Chunk Overlap

Add overlap between chunks so context isn't lost at boundaries:

function chunkWithOverlap(paragraphs: string[], targetSize: number = 600, overlap: number = 100): string[] {
  const chunks: string[] = [];
  let current = "";
 
  for (const para of paragraphs) {
    if ((current + para).length > targetSize && current.length > 0) {
      chunks.push(current);
      // Start next chunk with overlap from end of current
      current = current.slice(-overlap) + "\n\n" + para;
    } else {
      current += (current ? "\n\n" : "") + para;
    }
  }
  if (current) chunks.push(current);
  return chunks;
}

Storing and Querying Embeddings

Vector Database Options

Managed services:
  → Pinecone: Easiest to start, good for < 10M vectors
  → Weaviate Cloud: Full-featured, good hybrid search
  → Qdrant Cloud: Best performance/cost ratio

Self-hosted:
  → pgvector (PostgreSQL): Good for < 1M vectors, already have Postgres
  → Qdrant: Best for large scale self-hosted
  → Chroma: Good for prototyping, not for production scale

Recommendation:
  → Already use PostgreSQL? Start with pgvector
  → Need scale (>5M vectors)? Use Qdrant or Pinecone
  → Prototyping? Use Chroma locally

pgvector Example

-- Enable the extension
CREATE EXTENSION IF NOT EXISTS vector;
 
-- Create table with vector column
CREATE TABLE documents (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  content TEXT NOT NULL,
  embedding vector(1536),  -- Match your model's dimensions
  metadata JSONB DEFAULT '{}',
  created_at TIMESTAMPTZ DEFAULT NOW()
);
 
-- Create index for fast similarity search
CREATE INDEX ON documents 
  USING ivfflat (embedding vector_cosine_ops)
  WITH (lists = 100);  -- Tune: sqrt(row_count) is a good starting point
 
-- Query: find similar documents
SELECT id, content, 1 - (embedding <=> $1::vector) AS similarity
FROM documents
WHERE metadata->>'category' = 'returns'  -- Filter first, then vector search
ORDER BY embedding <=> $1::vector
LIMIT 5;

Production Pipeline

// Complete embedding pipeline for a knowledge base
async function indexDocuments(documents: Document[]) {
  for (const doc of documents) {
    // 1. Chunk the document
    const chunks = semanticChunking(doc.content);
 
    // 2. Generate embeddings in batches
    const embeddings = await batchEmbed(chunks.map(c => c.content), {
      batchSize: 100,  // Most APIs accept batches
    });
 
    // 3. Store with metadata
    await vectorDB.upsert(
      chunks.map((chunk, i) => ({
        id: `${doc.id}-chunk-${i}`,
        embedding: embeddings[i],
        metadata: {
          documentId: doc.id,
          title: doc.title,
          chunkIndex: i,
          category: doc.category,
          updatedAt: doc.updatedAt,
        },
        content: chunk.content,
      }))
    );
  }
}
 
// Search with metadata filtering
async function search(query: string, filters?: SearchFilters) {
  const queryEmbedding = await embed(query);
 
  return vectorDB.search(queryEmbedding, {
    topK: 5,
    minScore: 0.7,  // Only return relevant results
    filter: filters ? buildFilter(filters) : undefined,
  });
}

Common Mistakes

Mistake 1: Embedding entire documents
  → Problem: Long text dilutes the embedding signal
  → Fix: Chunk into 200-800 token segments

Mistake 2: Not cleaning text before embedding
  → Problem: HTML tags, navigation text pollute embeddings
  → Fix: Strip HTML, remove boilerplate, keep content only

Mistake 3: Mixing embedding models
  → Problem: Vectors from different models aren't comparable
  → Fix: Re-embed everything if you change models

Mistake 4: No metadata filtering
  → Problem: Vector search across all docs when you know the category
  → Fix: Filter by metadata FIRST, then vector search within results

Mistake 5: Not evaluating retrieval quality
  → Problem: You don't know if search results are actually good
  → Fix: Build a test set of queries with expected results, measure recall

Embeddings are the foundation, not the product. Get chunking, storage, and retrieval right, and everything built on top — semantic search, RAG, recommendations — will work better. Get them wrong, and no amount of prompt engineering will save you.

Checkout Funnel Optimization — The Technical Fixes That Recover Revenue

Incident Management That Actually Works — From Alert to Post-Mortem

A Practical Guide to AI Embeddings — Beyond the Hype

May 27, 2026·ScaledByDesign·

aiembeddingsvector-databasesemantic-searchmachine-learning

What Embeddings Actually Are

// Two sentences with similar meaning
const embedding1 = await embed("How do I return a product?");
// → [0.023, -0.41, 0.87, 0.12, ... ] (1536 numbers)
 
const embedding2 = await embed("What's your return policy?");
// → [0.025, -0.39, 0.85, 0.14, ... ] (similar numbers!)
 
const embedding3 = await embed("The weather is nice today");
// → [-0.71, 0.33, -0.12, 0.56, ...] (very different numbers)
 
// Cosine similarity
cosineSimilarity(embedding1, embedding2); // 0.94 (very similar)
cosineSimilarity(embedding1, embedding3); // 0.11 (not related)

Choosing an Embedding Model

| Model                    | Dimensions | Speed    | Quality | Cost          |
|--------------------------|-----------|----------|---------|---------------|
| text-embedding-3-small   | 1536      | Fast     | Good    | $0.02/1M tok  |
| text-embedding-3-large   | 3072      | Medium   | Better  | $0.13/1M tok  |
| Cohere embed-v3          | 1024      | Fast     | Good    | $0.10/1M tok  |
| all-MiniLM-L6-v2 (local)| 384       | Very fast| OK      | Free (self-hosted)|
| nomic-embed-text (local) | 768       | Fast     | Good    | Free (self-hosted)|

Recommendation:
  → Start with text-embedding-3-small (best cost/quality for most use cases)
  → Use local models if you process >10M documents (cost savings)
  → Use text-embedding-3-large for legal/medical (accuracy-critical)

Chunking Strategy

Before embedding documents, you need to split them into chunks. This is where most implementations fail:

// ✗ Bad: fixed-size chunks that split sentences
function badChunking(text: string, size: number = 500): string[] {
  const chunks = [];
  for (let i = 0; i < text.length; i += size) {
    chunks.push(text.slice(i, i + size)); // Might cut mid-sentence
  }
  return chunks;
}
 
// ✓ Good: semantic chunking that respects document structure
function semanticChunking(markdown: string): Chunk[] {
  const chunks: Chunk[] = [];
  
  // Split by headers first (natural document boundaries)
  const sections = markdown.split(/\n#{1,3}\s/);
  
  for (const section of sections) {
    if (section.length < 200) {
      // Too small — merge with previous chunk
      if (chunks.length > 0) {
        chunks[chunks.length - 1].content += "\n" + section;
      }
    } else if (section.length > 1000) {
      // Too large — split by paragraphs
      const paragraphs = section.split(/\n\n/);
      let currentChunk = "";
      
      for (const para of paragraphs) {
        if ((currentChunk + para).length > 800) {
          chunks.push({ content: currentChunk.trim() });
          currentChunk = para;
        } else {
          currentChunk += "\n\n" + para;
        }
      }
      if (currentChunk.trim()) chunks.push({ content: currentChunk.trim() });
    } else {
      chunks.push({ content: section.trim() });
    }
  }
 
  return chunks;
}

Chunk Overlap

Add overlap between chunks so context isn't lost at boundaries:

function chunkWithOverlap(paragraphs: string[], targetSize: number = 600, overlap: number = 100): string[] {
  const chunks: string[] = [];
  let current = "";
 
  for (const para of paragraphs) {
    if ((current + para).length > targetSize && current.length > 0) {
      chunks.push(current);
      // Start next chunk with overlap from end of current
      current = current.slice(-overlap) + "\n\n" + para;
    } else {
      current += (current ? "\n\n" : "") + para;
    }
  }
  if (current) chunks.push(current);
  return chunks;
}

Storing and Querying Embeddings

Vector Database Options

Managed services:
  → Pinecone: Easiest to start, good for < 10M vectors
  → Weaviate Cloud: Full-featured, good hybrid search
  → Qdrant Cloud: Best performance/cost ratio

Self-hosted:
  → pgvector (PostgreSQL): Good for < 1M vectors, already have Postgres
  → Qdrant: Best for large scale self-hosted
  → Chroma: Good for prototyping, not for production scale

Recommendation:
  → Already use PostgreSQL? Start with pgvector
  → Need scale (>5M vectors)? Use Qdrant or Pinecone
  → Prototyping? Use Chroma locally

pgvector Example

-- Enable the extension
CREATE EXTENSION IF NOT EXISTS vector;
 
-- Create table with vector column
CREATE TABLE documents (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  content TEXT NOT NULL,
  embedding vector(1536),  -- Match your model's dimensions
  metadata JSONB DEFAULT '{}',
  created_at TIMESTAMPTZ DEFAULT NOW()
);
 
-- Create index for fast similarity search
CREATE INDEX ON documents 
  USING ivfflat (embedding vector_cosine_ops)
  WITH (lists = 100);  -- Tune: sqrt(row_count) is a good starting point
 
-- Query: find similar documents
SELECT id, content, 1 - (embedding <=> $1::vector) AS similarity
FROM documents
WHERE metadata->>'category' = 'returns'  -- Filter first, then vector search
ORDER BY embedding <=> $1::vector
LIMIT 5;

Production Pipeline

// Complete embedding pipeline for a knowledge base
async function indexDocuments(documents: Document[]) {
  for (const doc of documents) {
    // 1. Chunk the document
    const chunks = semanticChunking(doc.content);
 
    // 2. Generate embeddings in batches
    const embeddings = await batchEmbed(chunks.map(c => c.content), {
      batchSize: 100,  // Most APIs accept batches
    });
 
    // 3. Store with metadata
    await vectorDB.upsert(
      chunks.map((chunk, i) => ({
        id: `${doc.id}-chunk-${i}`,
        embedding: embeddings[i],
        metadata: {
          documentId: doc.id,
          title: doc.title,
          chunkIndex: i,
          category: doc.category,
          updatedAt: doc.updatedAt,
        },
        content: chunk.content,
      }))
    );
  }
}
 
// Search with metadata filtering
async function search(query: string, filters?: SearchFilters) {
  const queryEmbedding = await embed(query);
 
  return vectorDB.search(queryEmbedding, {
    topK: 5,
    minScore: 0.7,  // Only return relevant results
    filter: filters ? buildFilter(filters) : undefined,
  });
}

Common Mistakes

Mistake 1: Embedding entire documents
  → Problem: Long text dilutes the embedding signal
  → Fix: Chunk into 200-800 token segments

Mistake 2: Not cleaning text before embedding
  → Problem: HTML tags, navigation text pollute embeddings
  → Fix: Strip HTML, remove boilerplate, keep content only

Mistake 3: Mixing embedding models
  → Problem: Vectors from different models aren't comparable
  → Fix: Re-embed everything if you change models

Mistake 4: No metadata filtering
  → Problem: Vector search across all docs when you know the category
  → Fix: Filter by metadata FIRST, then vector search within results

Mistake 5: Not evaluating retrieval quality
  → Problem: You don't know if search results are actually good
  → Fix: Build a test set of queries with expected results, measure recall

Checkout Funnel Optimization — The Technical Fixes That Recover Revenue

Incident Management That Actually Works — From Alert to Post-Mortem

A Practical Guide to AI Embeddings — Beyond the Hype

What Embeddings Actually Are

Choosing an Embedding Model

Chunking Strategy

Chunk Overlap

Storing and Querying Embeddings

Vector Database Options

pgvector Example

Production Pipeline

Common Mistakes

Ready to Ship?

A Practical Guide to AI Embeddings — Beyond the Hype

What Embeddings Actually Are

Choosing an Embedding Model

Chunking Strategy

Chunk Overlap

Storing and Querying Embeddings

Vector Database Options

pgvector Example

Production Pipeline

Common Mistakes

Ready to Ship?