Recipes

RAG Knowledge Base

Build a vector store from a documentation site using batch distill

Turn any documentation site into a searchable RAG knowledge base. Submit URLs in batch, poll until done (or use webhooks), then index the resulting Markdown into your vector store.

Flow

  1. Discover URLs to ingest (sitemap, crawl, or curated list)
  2. Submit them in a single /batch/distill job
  3. Wait for completion (poll or webhook)
  4. Embed each result's markdown into your vector store

Implementation

import httpx, time

API = "https://openapi.thunderbit.com/openapi/v1"
H = {"Authorization": "Bearer YOUR_API_KEY"}

urls = [f"https://docs.example.com/page-{i}" for i in range(50)]
job = httpx.post(f"{API}/batch/distill",
                 headers=H,
                 json={"urls": urls, "include": ["metadata"]}).json()
batch_id = job["data"]["id"]

while True:
    status = httpx.get(f"{API}/batch/distill/{batch_id}", headers=H).json()["data"]
    if status["status"] in ("COMPLETED", "FAILED", "CANCELLED"):
        break
    time.sleep(10)

for r in status["results"]:
    if r["status"] == "SUCCEEDED":
        embed_and_store(r["url"], r["markdown"])

Tips

  • Use include: ["metadata"] so each result carries title / description for chunk headers
  • For 100+ URLs, prefer webhooks over polling — see Webhooks
  • Re-running on the same URLs is fine; bypass cache with forceRefresh: true if content changes often

This recipe is being expanded with vector store wiring (Pinecone / Weaviate / pgvector) — check back soon.