Recipes
RAG Knowledge Base
Build a vector store from a documentation site using batch distill
Turn any documentation site into a searchable RAG knowledge base. Submit URLs in batch, poll until done (or use webhooks), then index the resulting Markdown into your vector store.
Flow
- Discover URLs to ingest (sitemap, crawl, or curated list)
- Submit them in a single
/batch/distilljob - Wait for completion (poll or webhook)
- Embed each result's
markdowninto your vector store
Implementation
import httpx, time
API = "https://openapi.thunderbit.com/openapi/v1"
H = {"Authorization": "Bearer YOUR_API_KEY"}
urls = [f"https://docs.example.com/page-{i}" for i in range(50)]
job = httpx.post(f"{API}/batch/distill",
headers=H,
json={"urls": urls, "include": ["metadata"]}).json()
batch_id = job["data"]["id"]
while True:
status = httpx.get(f"{API}/batch/distill/{batch_id}", headers=H).json()["data"]
if status["status"] in ("COMPLETED", "FAILED", "CANCELLED"):
break
time.sleep(10)
for r in status["results"]:
if r["status"] == "SUCCEEDED":
embed_and_store(r["url"], r["markdown"])Tips
- Use
include: ["metadata"]so each result carries title / description for chunk headers - For 100+ URLs, prefer webhooks over polling — see Webhooks
- Re-running on the same URLs is fine; bypass cache with
forceRefresh: trueif content changes often
Related
This recipe is being expanded with vector store wiring (Pinecone / Weaviate / pgvector) — check back soon.