RAG 知识库

把任意文档站点变成可检索的 RAG 知识库。批量提交 URL，轮询直到任务完成（或使用 Webhook），然后把生成的 Markdown 索引到你的向量库。

流程

找出要入库的 URL（站点地图、爬取，或手工整理列表）
在一个 /batch/distill 任务里一次性提交
等待任务完成（轮询或 Webhook）
把每条结果的 markdown 嵌入向量库

实现

import httpx, time

API = "https://openapi.thunderbit.com/openapi/v1"
H = {"Authorization": "Bearer YOUR_API_KEY"}

urls = [f"https://docs.example.com/page-{i}" for i in range(50)]
job = httpx.post(f"{API}/batch/distill",
                 headers=H,
                 json={"urls": urls, "include": ["metadata"]}).json()
batch_id = job["data"]["id"]

while True:
    status = httpx.get(f"{API}/batch/distill/{batch_id}", headers=H).json()["data"]
    if status["status"] in ("COMPLETED", "FAILED", "CANCELLED"):
        break
    time.sleep(10)

for r in status["results"]:
    if r["status"] == "SUCCEEDED":
        embed_and_store(r["url"], r["markdown"])

小贴士

加上 include: ["metadata"]，让每条结果都带上 title / description，方便作为 chunk 头
URL 数量超过 100 时，建议用 Webhook 替代轮询 —— 详见 Webhooks
同一批 URL 重跑没问题；如果内容经常变动，加 forceRefresh: true 绕过缓存

流程

实现

小贴士

相关

目录