Every token counts. When you feed raw HTML to an LLM, you're paying for navigation menus, cookie notices, and tracking scripts. Here's how to extract only what matters.

The Token Waste Problem

Analyze any modern webpage and you'll find:

20-30%: Actual article content
25-35%: Navigation and header/footer
15-25%: Ads and promotional content
10-20%: Scripts, styles, and metadata
5-10%: Cookie banners and popups

A 10,000 token page might contain only 2,500 tokens of useful content. At GPT-4 pricing, you're wasting $0.075 per page on junk.

Content Extraction Strategies

Strategy 1: DOM-Based Extraction

Simple but unreliable. Look for content in <article>, <main>, or specific CSS classes:

const content = document.querySelector('article')?.textContent;

Problem: Different sites use different structures. Maintenance nightmare.

Strategy 2: Readability Algorithms

Mozilla's Readability and similar tools score DOM elements by text density:

import { Readability } from "@mozilla/readability";
const article = new Readability(document).parse();

Problem: Still includes some junk, doesn't understand semantic importance.

Strategy 3: AI-Powered Extraction (Tryb)

Use an LLM to identify and extract only the main content:

const { markdown } = await tryb.read(url, { clean_with_ai: true });

Benefit: 95%+ accuracy, understands context, outputs clean Markdown.

Before/After Comparison

Metric	Raw HTML	Tryb Cleaned
Token count	12,450	2,890
Content ratio	23%	98%
GPT-4 cost	$0.12	$0.03
Response quality	Poor (confused by junk)	Excellent

RAG Pipeline Integration

For RAG systems, clean extraction is even more critical. Junk content pollutes your vector database and degrades retrieval quality.

// Optimal RAG ingestion pipeline
async function ingestUrl(url: string) {
  // 1. Extract clean content
  const { markdown, title } = await tryb.read(url);
  
  // 2. Chunk intelligently (by section)
  const chunks = chunkByHeadings(markdown);
  
  // 3. Embed and store
  for (const chunk of chunks) {
    const embedding = await openai.embeddings.create(chunk);
    await vectorDb.insert({ url, title, chunk, embedding });
  }
}

Start Optimizing Today

Try the Tryb Playground to see the difference clean extraction makes for your LLM applications.

LLM Context Window Optimization: Stop Wasting Tokens on HTML

The Token Waste Problem

Content Extraction Strategies

Strategy 1: DOM-Based Extraction

Strategy 2: Readability Algorithms

Strategy 3: AI-Powered Extraction (Tryb)

Before/After Comparison

RAG Pipeline Integration

Start Optimizing Today

Related Articles

RAG Pipeline: Ingesting Web Content at Scale

Why AI Agents Can't See the Web (And How to Fix It)

Building Web-Aware AI Agents: A Complete Guide

Ready to Give Your AI Eyes?

LLM Context Window Optimization: Stop Wasting Tokens on HTML

The Token Waste Problem

Content Extraction Strategies

Strategy 1: DOM-Based Extraction

Strategy 2: Readability Algorithms

Strategy 3: AI-Powered Extraction (Tryb)

Before/After Comparison

RAG Pipeline Integration

Start Optimizing Today

Related Articles

RAG Pipeline: Ingesting Web Content at Scale

Why AI Agents Can't See the Web (And How to Fix It)

Building Web-Aware AI Agents: A Complete Guide

Ready to Give Your AI Eyes?