RAG Pipeline: Ingesting Web Content at Scale
Build a production RAG system that ingests web content cleanly. From URL to vector database in minutes.
Alex Rivera
ML Engineer

Retrieval-Augmented Generation (RAG) systems are only as good as their data. Garbage in, garbage out. Here's how to build a web content ingestion pipeline that produces high-quality vectors.
The RAG Ingestion Pipeline
URL → Fetch → Clean → Chunk → Embed → Store → Retrieve → Generate
Step 1: Fetch with Tryb
Use Tryb to fetch clean markdown from any URL:
const { markdown, title, url } = await tryb.read(sourceUrl, {
clean_with_ai: true,
use_cache: true
});
Step 2: Intelligent Chunking
Don't chunk by character count—chunk by semantic boundaries:
function chunkByHeadings(markdown: string, maxTokens = 500) {
const sections = markdown.split(/(?=^#{1,3} )/m);
const chunks = [];
for (const section of sections) {
if (countTokens(section) <= maxTokens) {
chunks.push(section);
} else {
// Split long sections by paragraph
chunks.push(...splitByParagraph(section, maxTokens));
}
}
return chunks;
}
Step 3: Generate Embeddings
const embeddings = await openai.embeddings.create({
model: "text-embedding-3-small",
input: chunks
});
Step 4: Store in Vector DB
// Using Pinecone
await pinecone.upsert({
vectors: chunks.map((chunk, i) => ({
id: `${url}-${i}`,
values: embeddings.data[i].embedding,
metadata: { url, title, chunk, timestamp: Date.now() }
}))
});
Complete Pipeline Code
import { TrybClient } from '@tryb/sdk';
import { Pinecone } from '@pinecone-database/pinecone';
import OpenAI from 'openai';
const tryb = new TrybClient(process.env.TRYB_API_KEY);
const pinecone = new Pinecone();
const openai = new OpenAI();
async function ingestUrls(urls: string[]) {
// Batch fetch with Tryb
const { results } = await tryb.batch(urls);
for (const page of results) {
if (!page.success) continue;
// Chunk content
const chunks = chunkByHeadings(page.markdown);
// Embed chunks
const embeddings = await openai.embeddings.create({
model: "text-embedding-3-small",
input: chunks
});
// Store in Pinecone
await pinecone.index('web-content').upsert(
chunks.map((chunk, i) => ({
id: `${page.url}-${i}`,
values: embeddings.data[i].embedding,
metadata: {
url: page.url,
title: page.title,
chunk,
ingested_at: new Date().toISOString()
}
}))
);
}
}
Best Practices
- Deduplicate URLs: Track ingested URLs to avoid duplicates
- Refresh stale content: Re-ingest URLs older than 7-30 days
- Store raw content: Keep original markdown for re-embedding later
- Add metadata: Include source, date, and relevance scores
Related Guides

Alex Rivera
ML Engineer at Tryb
Alex specializes in LLM optimization and RAG systems.


