整合
LlamaIndex
把 Thunderbit 接進 LlamaIndex pipeline,當 Reader 或 tool 用
LlamaIndex 把它叫做「Reader」而不是 loader,但模式跟 LangChain 一模一樣 —— Thunderbit 產出乾淨的 Markdown,LlamaIndex 拿去切塊、做索引。
安裝
pip install llama-index-core httpx當 Reader
from llama_index.core import Document
import httpx
API = "https://openapi.thunderbit.com/openapi/v1"
H = {"Authorization": "Bearer YOUR_API_KEY"}
class ThunderbitReader:
def load_data(self, urls: list[str]) -> list[Document]:
job = httpx.post(f"{API}/batch/distill",
headers=H,
json={"urls": urls,
"include": ["metadata"]}).json()
# poll until COMPLETED — see Batch Job Lifecycle guide
return [
Document(text=r["markdown"],
metadata={"source": r["url"], **r.get("metadata", {})})
for r in job["data"]["results"] if r["status"] == "SUCCEEDED"
]
docs = ThunderbitReader().load_data(["https://docs.example.com"])照樣接 VectorStoreIndex.from_documents(docs)。
當 agent tool(FunctionTool)
from llama_index.core.tools import FunctionTool
def read_url(url: str) -> str:
"""Fetch a URL and return clean Markdown."""
resp = httpx.post(f"{API}/distill",
headers=H,
json={"url": url, "renderMode": "basic"},
timeout=60.0)
resp.raise_for_status()
return resp.json()["data"]["markdown"]
read_tool = FunctionTool.from_defaults(fn=read_url)相關文件
這份整合會擴成 llama-index-readers-thunderbit 套件 —— 敬請期待。