RAG 知識庫

把任何文件站轉成可檢索的 RAG 知識庫。批次提交 URL，輪詢直到完成（或使用 Webhook），再把產出的 Markdown 索引到你的向量庫。

流程

收集要灌入的 URL（sitemap、爬蟲，或人工整理清單）
在單次 /batch/distill 任務中提交
等待完成（輪詢或 Webhook）
把每筆結果的 markdown 嵌入向量庫

實作

import httpx, time

API = "https://openapi.thunderbit.com/openapi/v1"
H = {"Authorization": "Bearer YOUR_API_KEY"}

urls = [f"https://docs.example.com/page-{i}" for i in range(50)]
job = httpx.post(f"{API}/batch/distill",
                 headers=H,
                 json={"urls": urls, "include": ["metadata"]}).json()
batch_id = job["data"]["id"]

while True:
    status = httpx.get(f"{API}/batch/distill/{batch_id}", headers=H).json()["data"]
    if status["status"] in ("COMPLETED", "FAILED", "CANCELLED"):
        break
    time.sleep(10)

for r in status["results"]:
    if r["status"] == "SUCCEEDED":
        embed_and_store(r["url"], r["markdown"])

小技巧

帶上 include: ["metadata"]，這樣每筆結果就會附上 title／description 可作為 chunk 標頭
100 個以上的 URL 建議改用 Webhook 取代輪詢 —— 參見 Webhooks
同樣的 URL 重跑沒關係；若內容常變，用 forceRefresh: true 跳過快取

流程

實作

小技巧

相關

目錄