ドキュメントを llm.txt へ

任意のドキュメントサイトを LLM 用の単一 Markdown ファイルに変換する

ドキュメントサイト全体を 1 つの llm.txt に Distill し、任意の LLM コンテキスト、RAG パイプライン、ローカルモデルに貼り付けて使えるようにします。馴染みのないライブラリ、社内 wiki、製品ドキュメントの取り込みに便利です。

フロー

インデックスページを include: ["links"] 付きで Distill し、リンクされている全 URL を発見する
URL パターン（例：/docs/、/guide/）でリンクリストをフィルタリングする
フィルタ後の URL を /batch/distill に渡す
生成された Markdown を 1 つのファイルに連結する

実装

import httpx, re

API = "https://openapi.thunderbit.com/openapi/v1"
H = {"Authorization": "Bearer YOUR_API_KEY"}

# 1. Pull the index page + outbound links
index = httpx.post(f"{API}/distill",
                   headers=H,
                   json={"url": "https://docs.example.com",
                         "include": ["links"]}).json()["data"]

# 2. Filter to docs paths
doc_urls = [u for u in index["links"] if re.search(r"/docs/", u)]

# 3. Batch distill
job = httpx.post(f"{API}/batch/distill",
                 headers=H,
                 json={"urls": doc_urls}).json()["data"]

# 4. Poll, concatenate
# (poll loop omitted; see RAG Knowledge Base recipe)

with open("llm.txt", "w") as f:
    for r in job["results"]:
        if r["status"] == "SUCCEEDED":
            f.write(f"# {r['url']}\n\n{r['markdown']}\n\n---\n\n")

ヒント

サイズ上限を設けましょう —— llm.txt が約 1 MB を超えるとトークン予算を圧迫し始めます
URL またはセクション順でソートすると、実行間の差分が安定します
CI ジョブと組み合わせて、ソースドキュメントの変更に応じて llm.txt を最新に保ちましょう

ドキュメントを llm.txt へ

フロー

実装

ヒント

関連

目次