Retrieval-Augmented Generation (RAG) systems are only as good as their data. Garbage in, garbage out. Here's how to build a web content ingestion pipeline that produces high-quality vectors.

The RAG Ingestion Pipeline

URL → Fetch → Clean → Chunk → Embed → Store → Retrieve → Generate

Step 1: Fetch with Tryb

Use Tryb to fetch clean markdown from any URL:

const { markdown, title, url } = await tryb.read(sourceUrl, {
  clean_with_ai: true,
  use_cache: true
});

Step 2: Intelligent Chunking

Don't chunk by character count—chunk by semantic boundaries:

function chunkByHeadings(markdown: string, maxTokens = 500) {
  const sections = markdown.split(/(?=^#{1,3} )/m);
  const chunks = [];
  
  for (const section of sections) {
    if (countTokens(section) <= maxTokens) {
      chunks.push(section);
    } else {
      // Split long sections by paragraph
      chunks.push(...splitByParagraph(section, maxTokens));
    }
  }
  
  return chunks;
}

Step 3: Generate Embeddings

const embeddings = await openai.embeddings.create({
  model: "text-embedding-3-small",
  input: chunks
});

Step 4: Store in Vector DB

// Using Pinecone
await pinecone.upsert({
  vectors: chunks.map((chunk, i) => ({
    id: `${url}-${i}`,
    values: embeddings.data[i].embedding,
    metadata: { url, title, chunk, timestamp: Date.now() }
  }))
});

Complete Pipeline Code

import { TrybClient } from '@tryb/sdk';
import { Pinecone } from '@pinecone-database/pinecone';
import OpenAI from 'openai';

const tryb = new TrybClient(process.env.TRYB_API_KEY);
const pinecone = new Pinecone();
const openai = new OpenAI();

async function ingestUrls(urls: string[]) {
  // Batch fetch with Tryb
  const { results } = await tryb.batch(urls);
  
  for (const page of results) {
    if (!page.success) continue;
    
    // Chunk content
    const chunks = chunkByHeadings(page.markdown);
    
    // Embed chunks
    const embeddings = await openai.embeddings.create({
      model: "text-embedding-3-small",
      input: chunks
    });
    
    // Store in Pinecone
    await pinecone.index('web-content').upsert(
      chunks.map((chunk, i) => ({
        id: `${page.url}-${i}`,
        values: embeddings.data[i].embedding,
        metadata: { 
          url: page.url, 
          title: page.title, 
          chunk,
          ingested_at: new Date().toISOString()
        }
      }))
    );
  }
}

Best Practices

Deduplicate URLs: Track ingested URLs to avoid duplicates
Refresh stale content: Re-ingest URLs older than 7-30 days
Store raw content: Keep original markdown for re-embedding later
Add metadata: Include source, date, and relevance scores

Related Guides

Step 2: Intelligent Chunking

Don't chunk by character count—chunk by semantic boundaries:

function chunkByHeadings(markdown: string, maxTokens = 500) { const sections = markdown.split(/(?=^#{1,3} )/m); const chunks = []; for (const section of sections) { if (countTokens(section) <= maxTokens) { chunks.push(section); } else { // Split long sections by paragraph chunks.push(...splitByParagraph(section, maxTokens)); } } return chunks; }

// Using Pinecone await pinecone.upsert({ vectors: chunks.map((chunk, i) => ({ id: `${url}-${i}`, values: embeddings.data[i].embedding, metadata: { url, title, chunk, timestamp: Date.now() } })) });

Complete Pipeline Code

import { TrybClient } from '@tryb/sdk'; import { Pinecone } from '@pinecone-database/pinecone'; import OpenAI from 'openai'; const tryb = new TrybClient(process.env.TRYB_API_KEY); const pinecone = new Pinecone(); const openai = new OpenAI(); async function ingestUrls(urls: string[]) { // Batch fetch with Tryb const { results } = await tryb.batch(urls); for (const page of results) { if (!page.success) continue; // Chunk content const chunks = chunkByHeadings(page.markdown); // Embed chunks const embeddings = await openai.embeddings.create({ model: "text-embedding-3-small", input: chunks }); // Store in Pinecone await pinecone.index('web-content').upsert( chunks.map((chunk, i) => ({ id: `${page.url}-${i}`, values: embeddings.data[i].embedding, metadata: { url: page.url, title: page.title, chunk, ingested_at: new Date().toISOString() } })) ); } }

RAG Pipeline: Ingesting Web Content at Scale

The RAG Ingestion Pipeline

Step 1: Fetch with Tryb

Step 2: Intelligent Chunking

Step 3: Generate Embeddings

Step 4: Store in Vector DB

Complete Pipeline Code

Best Practices

Related Guides

Related Articles

LLM Context Window Optimization: Stop Wasting Tokens on HTML

Building Web-Aware AI Agents: A Complete Guide

Pinecone + Web Content: Build a Knowledge Base

Ready to Give Your AI Eyes?

RAG Pipeline: Ingesting Web Content at Scale

The RAG Ingestion Pipeline

Step 1: Fetch with Tryb

Step 2: Intelligent Chunking

Step 3: Generate Embeddings

Step 4: Store in Vector DB

Complete Pipeline Code

Best Practices

Related Guides

Related Articles

LLM Context Window Optimization: Stop Wasting Tokens on HTML

Building Web-Aware AI Agents: A Complete Guide

Pinecone + Web Content: Build a Knowledge Base

Ready to Give Your AI Eyes?