Get Started

Get Started

Production-grade web extraction infrastructure for AI applications

Thunderbit Open API turns any web page into clean, structured data your LLMs can actually use — while transparently handling JavaScript rendering, anti-bot protection, geo-routing, and proxy rotation.

Why Thunderbit

Pain pointWithout ThunderbitWith Thunderbit
JavaScript-heavy SPAsSelf-host headless Chrome, debug timeouts, watch memory leakrenderMode: "full"
CAPTCHA / bot wallsRotate proxies, solve puzzles, watch IPs burnWe absorb it
Geo-blocked contentManage proxy pools per countrycountryCode: "DE"
HTML noise (ads, nav, popups)Hand-write per-site readability heuristicsAuto-stripped Markdown
Structured extractionTrain extractors, maintain CSS selectors that break weeklyJSON Schema → JSON output
Scaling to 10k+ URLsBuild your own queue, retry, dedupe, status boardBatch endpoint + webhook
LLM token costsFeed the model raw HTML and pay for itPre-distilled Markdown — 5–10× fewer tokens

Three core endpoints

🔥 Distill — page → clean Markdown

curl -X POST https://openapi.thunderbit.com/openapi/v1/distill \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/article"}'

Returns LLM-ready Markdown with metadata stripped. 5–10× fewer tokens than raw HTML.

🧠 Extract — JSON Schema → structured fields

curl -X POST https://openapi.thunderbit.com/openapi/v1/extract \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/product",
    "schema": {
      "type": "object",
      "properties": {
        "name":  { "type": "string" },
        "price": { "type": "number" }
      },
      "required": ["name", "price"]
    }
  }'

The AI reads your schema's descriptions — be specific ("product MSRP in USD before discount" beats "price").

⚡ Batch — up to 100 URLs, async with webhooks

curl -X POST https://openapi.thunderbit.com/openapi/v1/batch/distill \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": ["https://example.com/page1", "https://example.com/page2"],
    "webhook": {
      "url":    "https://your-server.com/webhook/distill",
      "secret": "whsec_your_secret_key"
    }
  }'

Submit → fire webhook → fetch results. See Batch Lifecycle.

Resources