Tryb
Agents
APIPlayground
  1. Home
  2. Blog
  3. LLM & RAG
  4. RAG Pipeline: Ingesting Web Content at Scale
LLM & RAG
Dec 3, 20249 min read

RAG Pipeline: Ingesting Web Content at Scale

Build a production RAG system that ingests web content cleanly. From URL to vector database in minutes.

Alex Rivera

Alex Rivera

ML Engineer

RAG Pipeline: Ingesting Web Content at Scale

Retrieval-Augmented Generation (RAG) systems are only as good as their data. Garbage in, garbage out. Here's how to build a web content ingestion pipeline that produces high-quality vectors.

The RAG Ingestion Pipeline

URL → Fetch → Clean → Chunk → Embed → Store → Retrieve → Generate

Step 1: Fetch with Tryb

Use Tryb to fetch clean markdown from any URL:

const { markdown, title, url } = await tryb.read(sourceUrl, {
  clean_with_ai: true,
  use_cache: true
});

Step 2: Intelligent Chunking

Don't chunk by character count—chunk by semantic boundaries:

function chunkByHeadings(markdown: string, maxTokens = 500) {
  const sections = markdown.split(/(?=^#{1,3} )/m);
  const chunks = [];
  
  for (const section of sections) {
    if (countTokens(section) <= maxTokens) {
      chunks.push(section);
    } else {
      // Split long sections by paragraph
      chunks.push(...splitByParagraph(section, maxTokens));
    }
  }
  
  return chunks;
}

Step 3: Generate Embeddings

const embeddings = await openai.embeddings.create({
  model: "text-embedding-3-small",
  input: chunks
});

Step 4: Store in Vector DB

// Using Pinecone
await pinecone.upsert({
  vectors: chunks.map((chunk, i) => ({
    id: `${url}-${i}`,
    values: embeddings.data[i].embedding,
    metadata: { url, title, chunk, timestamp: Date.now() }
  }))
});

Complete Pipeline Code

import { TrybClient } from '@tryb/sdk';
import { Pinecone } from '@pinecone-database/pinecone';
import OpenAI from 'openai';

const tryb = new TrybClient(process.env.TRYB_API_KEY);
const pinecone = new Pinecone();
const openai = new OpenAI();

async function ingestUrls(urls: string[]) {
  // Batch fetch with Tryb
  const { results } = await tryb.batch(urls);
  
  for (const page of results) {
    if (!page.success) continue;
    
    // Chunk content
    const chunks = chunkByHeadings(page.markdown);
    
    // Embed chunks
    const embeddings = await openai.embeddings.create({
      model: "text-embedding-3-small",
      input: chunks
    });
    
    // Store in Pinecone
    await pinecone.index('web-content').upsert(
      chunks.map((chunk, i) => ({
        id: `${page.url}-${i}`,
        values: embeddings.data[i].embedding,
        metadata: { 
          url: page.url, 
          title: page.title, 
          chunk,
          ingested_at: new Date().toISOString()
        }
      }))
    );
  }
}

Best Practices

  • Deduplicate URLs: Track ingested URLs to avoid duplicates
  • Refresh stale content: Re-ingest URLs older than 7-30 days
  • Store raw content: Keep original markdown for re-embedding later
  • Add metadata: Include source, date, and relevance scores

Related Guides

  • LLM Context Window Optimization
  • Building Web-Aware AI Agents
RAGVector DatabaseEmbeddingsTutorial
Alex Rivera

Alex Rivera

ML Engineer at Tryb

Alex specializes in LLM optimization and RAG systems.

Related Articles

LLM Context Window Optimization: Stop Wasting Tokens on HTML
LLM & RAG

LLM Context Window Optimization: Stop Wasting Tokens on HTML

7 min read

Building Web-Aware AI Agents: A Complete Guide
Tutorials

Building Web-Aware AI Agents: A Complete Guide

10 min read

Pinecone + Web Content: Build a Knowledge Base
LLM & RAG

Pinecone + Web Content: Build a Knowledge Base

10 min read

Ready to Give Your AI Eyes?

Start scraping any website in seconds. Get 100 free credits when you sign up.

Tryb

The Universal Reader for AI Agents.

Product

  • Agents
  • Industry
  • API Reference
  • Dashboard

Company

  • About
  • Blog
  • Careers
  • Contact
  • Private Sector

Legal

  • Privacy
  • Terms
  • Security

© 2025 Tryb. All rights reserved.

TwitterGitHubDiscord