Batch Job Lifecycle
Sync vs async, job states, partial results, and webhooks
Batch endpoints (/batch/distill, /batch/extract) are asynchronous. A job moves through a small state machine; understanding the states helps you decide between polling and webhooks, and how to handle partial failures.
Sync vs async
Sync (/distill, /extract) | Async (/batch/distill, /batch/extract) | |
|---|---|---|
| URLs per request | 1 | up to 100 |
| Response | Full result in HTTP body | Job ID; results fetched separately |
| Latency | One request, one wait | Submit → poll or webhook → fetch |
| Best for | Real-time UX, agent tools, ad-hoc lookups | Scheduled jobs, RAG ingestion, monitoring |
| Failure mode | One bad URL fails the call | One bad URL fails its row, job continues |
| Concurrency cost | One slot per call | One slot for the whole batch |
When in doubt: single URL → sync; many URLs or no rush → async.
Job states
| Status | Meaning |
|---|---|
PENDING | Job accepted, queued |
PROCESSING | At least one URL is being processed |
COMPLETED | All URLs reached a terminal state (success or failure) |
FAILED | Fatal job-level error (rare — usually one URL fails, not the whole job) |
CANCELLED | User-initiated cancellation via DELETE |
A URL failure does not fail the job. Each item in results[] carries its own status: PENDING, PROCESSING, SUCCEEDED, or FAILED. The job moves to COMPLETED once every row reaches a terminal state.
Polling vs webhooks
| Job size | Recommended | Why |
|---|---|---|
| < 10 URLs | Poll every 5–10 s | Webhook overhead isn't worth the wiring |
| 10–100 URLs | Webhook | Polling burns ~60 round-trips on a 5-minute job |
| > 100 URLs (multiple batches) | Webhook | Each batch fires once on completion |
See Webhooks for payload shape, signature verification (HMAC-SHA256), and retry behavior.
Partial results
GET /batch/distill/{id} works while the job is still PROCESSING — you get whatever has finished so far. Useful for dashboards that stream rows as they complete.
import httpx, time
API = "https://openapi.thunderbit.com/openapi/v1"
H = {"Authorization": "Bearer YOUR_API_KEY"}
job = httpx.post(f"{API}/batch/distill", headers=H,
json={"urls": urls}).json()
batch_id = job["data"]["id"]
while True:
body = httpx.get(f"{API}/batch/distill/{batch_id}", headers=H).json()["data"]
fresh = [r for r in body["results"] if r["status"] == "SUCCEEDED"]
yield_to_dashboard(fresh)
if body["status"] in ("COMPLETED", "FAILED", "CANCELLED"):
break
time.sleep(5)Cancellation
DELETE /batch/distill/{id} (or /batch/extract/{id}) only works on PENDING or PROCESSING jobs. Once a job hits a terminal state, it stays there. Already-processed URLs in a cancelled job remain billable; in-flight URLs that hadn't finished are not.