Guides

Batch Job Lifecycle

Sync vs async, job states, partial results, and webhooks

Batch endpoints (/batch/distill, /batch/extract) are asynchronous. A job moves through a small state machine; understanding the states helps you decide between polling and webhooks, and how to handle partial failures.

Sync vs async

Sync (/distill, /extract)Async (/batch/distill, /batch/extract)
URLs per request1up to 100
ResponseFull result in HTTP bodyJob ID; results fetched separately
LatencyOne request, one waitSubmit → poll or webhook → fetch
Best forReal-time UX, agent tools, ad-hoc lookupsScheduled jobs, RAG ingestion, monitoring
Failure modeOne bad URL fails the callOne bad URL fails its row, job continues
Concurrency costOne slot per callOne slot for the whole batch

When in doubt: single URL → sync; many URLs or no rush → async.

Job states

StatusMeaning
PENDINGJob accepted, queued
PROCESSINGAt least one URL is being processed
COMPLETEDAll URLs reached a terminal state (success or failure)
FAILEDFatal job-level error (rare — usually one URL fails, not the whole job)
CANCELLEDUser-initiated cancellation via DELETE

A URL failure does not fail the job. Each item in results[] carries its own status: PENDING, PROCESSING, SUCCEEDED, or FAILED. The job moves to COMPLETED once every row reaches a terminal state.

Polling vs webhooks

Job sizeRecommendedWhy
< 10 URLsPoll every 5–10 sWebhook overhead isn't worth the wiring
10–100 URLsWebhookPolling burns ~60 round-trips on a 5-minute job
> 100 URLs (multiple batches)WebhookEach batch fires once on completion

See Webhooks for payload shape, signature verification (HMAC-SHA256), and retry behavior.

Partial results

GET /batch/distill/{id} works while the job is still PROCESSING — you get whatever has finished so far. Useful for dashboards that stream rows as they complete.

import httpx, time

API = "https://openapi.thunderbit.com/openapi/v1"
H = {"Authorization": "Bearer YOUR_API_KEY"}

job = httpx.post(f"{API}/batch/distill", headers=H,
                 json={"urls": urls}).json()
batch_id = job["data"]["id"]

while True:
    body = httpx.get(f"{API}/batch/distill/{batch_id}", headers=H).json()["data"]
    fresh = [r for r in body["results"] if r["status"] == "SUCCEEDED"]
    yield_to_dashboard(fresh)
    if body["status"] in ("COMPLETED", "FAILED", "CANCELLED"):
        break
    time.sleep(5)

Cancellation

DELETE /batch/distill/{id} (or /batch/extract/{id}) only works on PENDING or PROCESSING jobs. Once a job hits a terminal state, it stays there. Already-processed URLs in a cancelled job remain billable; in-flight URLs that hadn't finished are not.