Thunderbit Open API
Get Started
Thunderbit Open API provides powerful web distillation and intelligent data extraction capabilities, turning any web content into LLM-ready formats.
Key Features
š„ Web Distillation - Convert web pages into clean Markdown format, perfect for AI applications
š§ AI-Powered Extraction - Extract structured data using schemas or natural language prompts
ā” Batch Processing - Process multiple URLs simultaneously with asynchronous job management
š”ļø Enterprise-Ready - Handles JavaScript rendering, anti-bot measures, proxies, and dynamic content automatically
What We Handle For You
- Dynamic Content: JavaScript-rendered pages, SPAs, and AJAX-loaded content
- Anti-Bot Protection: Automatic handling of CAPTCHAs and bot detection systems
- Content Processing: Intelligent cleaning and formatting for optimal AI consumption
- Metadata Extraction: Automatic extraction of titles, descriptions, and structured metadata
Authentication
All API requests require an API Key in the Header:
Authorization: Bearer <YOUR_API_KEY>
Get your API key from the Thunderbit Dashboard.
Rate Limits
| Plan | Request Limit | Concurrency | Best For |
|---|
| Free | 10 requests/min | 2 concurrent | Testing & prototyping |
| Pro | 100 requests/min | 10 concurrent | Production apps |
| Enterprise | 1000 requests/min | 50 concurrent | Large-scale operations |
Output Formats
- Markdown: Clean, LLM-optimized markdown format
- Structured Data: JSON output based on your schema
- Metadata: Automatic extraction of page metadata
Base URL
https://open.thunderbit.com/v1 ā Production server
Authentication
Type: HTTP Bearer (JWT). Header format: Authorization: Bearer YOUR_API_KEY
Enter your API key from the Thunderbit Dashboard. The header format will be: `Authorization: Bearer YOUR_API_KEY`
Error Responses
BadRequest
Invalid request parameters
Unauthorized
Authentication failed, invalid API Key
RateLimited
Too many requests, rate limit triggered
- X-RateLimit-Limit: Rate limit ceiling
- X-RateLimit-Remaining: Remaining requests
- X-RateLimit-Reset: Reset timestamp
Schemas
Error
Standard error response format
- success (boolean):
- error (object):
- code (string): Error codes:
- INVALID_URL: Invalid URL format
- URL_NOT_ACCESSIBLE: Unable to access target URL
- TIMEOUT: Request timeout
- QUOTA_EXCEEDED: Quota exhausted
- RATE_LIMITED: Rate limit triggered
- INVALID_SCHEMA: Invalid Schema format
- EXTRACTION_FAILED: AI extraction failed
- BATCH_SIZE_EXCEEDED: Batch request count exceeded limit
- INVALID_WEBHOOK_URL: Invalid Webhook URL format or not HTTPS
- WEBHOOK_DELIVERY_FAILED: Webhook callback delivery failed
- message (string): Error description message
Metadata
Page metadata extracted from HTML meta tags, Open Graph, and Twitter Cards
DistillResult
Result of distilling a web page into clean Markdown format
ExtractResult
Result of AI-powered structured data extraction from a web page
BatchJob
Status and results of a batch processing job
Distill
Convert web pages into clean, LLM-ready Markdown format. Handles JavaScript rendering, dynamic content, and anti-bot protection automatically.
POST /distill ā Distill Single Page
Convert a web page into clean, LLM-ready Markdown format.
Use Cases:
- Prepare web content for RAG (Retrieval-Augmented Generation)
- Extract article content for AI processing
- Convert documentation pages to markdown
- Process dynamic web applications
What's Included:
- Clean markdown content with preserved structure
- Automatic removal of ads, navigation, and boilerplate
- Metadata extraction (title, description, language)
- JavaScript rendering for dynamic content
- Automatic handling of anti-bot measures
Output Format:
Returns markdown optimized for LLM consumption with minimal noise and maximum signal.
Request Body
- url (string) *required: The URL of the web page to distill
- timeout (number): Request timeout in milliseconds (default: 30000, max: 60000)
- waitFor (number): Time to wait (in milliseconds) after page load for dynamic content to render before extracting content
- includeTags (string[]): Only include content from these HTML tags (e.g., ['article', 'main', 'div.content'])
- excludeTags (string[]): Exclude content from these HTML tags (e.g., ['nav', 'footer', 'aside'])
- headers (object): Custom HTTP headers to send with the request
Response (200): Success response
- success (boolean):
- data (object):
- url (string): The URL that was distilled
- markdown (string): Clean markdown content extracted from the page
- html (string): Raw HTML content (optional, only if requested)
- metadata (object):
- title (string): Page title extracted from <title> tag or Open Graph
- description (string): Meta description or excerpt
- language (string): Detected language code (ISO 639-1)
- author (string): Article author if available
- publishedDate (string): Publication date if available
- image (string): Featured image URL from Open Graph or Twitter Card
- sourceURL (string): Original URL (may differ from requested URL due to redirects)
- statusCode (integer): HTTP status code of the response
- contentLength (integer): Length of the markdown content in characters
- links (object[]): Links found in the content
Example Request
curl 'https://open.thunderbit.com/v1/distill' \
--header 'Authorization: Bearer YOUR_SECRET_TOKEN' \
--header 'Content-Type: application/json' \
--data '{"url":"https://example.com/article","timeout":30000,"waitFor":2000,"includeTags":["article","main"],"excludeTags":["nav","footer","aside"],"headers":{"User-Agent":"MyBot/1.0"}}'
Example Response
{
"success": true,
"data": {
"url": "https://example.com/article",
"markdown": "# Article Title\n\nContent...",
"html": "<article>...</article>",
"metadata": {
"title": "string",
"description": "string",
"language": "string",
"author": "string",
"publishedDate": "2025-01-01T00:00:00Z",
"image": "string",
"sourceURL": "string",
"statusCode": 1,
"contentLength": 1
},
"links": [
{
"text": "string",
"href": "string"
}
]
}
}
POST /batch/distill ā Batch Distill Multiple Pages
Distill multiple web pages simultaneously with asynchronous processing.
Use Cases:
- Process entire website sections or categories
- Batch import content into your knowledge base
- Large-scale content migration
- Periodic content updates from multiple sources
How It Works:
Submit a batch job with up to 100 URLs
Receive a job ID immediately
Poll the status endpoint or receive webhook notification
Retrieve all results when complete
Features:
- Asynchronous processing for high throughput
- Automatic retry on failures
- Webhook notifications when complete
- Detailed per-URL status and error reporting
Request Body
- urls (string[]) *required: List of URLs to distill, maximum 100
- timeout (number): Timeout per request in milliseconds, default 30000
- webhook (object): Webhook callback configuration, notifies when task completes
- url (string): Webhook callback URL, must be HTTPS
- headers (object): Custom callback headers, can be used for authentication
Response (200): Success response
- success (boolean):
- data (object):
- id (string): Batch task ID
- status (string):
- total (integer):
- completed (integer):
- creditsUsed (integer):
Example Request
curl 'https://open.thunderbit.com/v1/batch/distill' \
--header 'Authorization: Bearer YOUR_SECRET_TOKEN' \
--header 'Content-Type: application/json' \
--data '{"urls":["https://example.com/page1","https://example.com/page2"],"timeout":1,"webhook":{"url":"string","headers":{}}}'
Example Response
{
"success": true,
"data": {
"id": "batch_abc123",
"status": "processing",
"total": 3,
"completed": 0,
"creditsUsed": 0
}
}
GET /batch/distill/{id} ā Get Batch Distill Job Status
Check the status and retrieve results of a batch distill job.
Response States:
processing: Job is currently running
completed: All URLs have been processed
failed: Job encountered a fatal error
Polling Best Practices:
- Poll every 5-10 seconds for jobs with <10 URLs
- Poll every 30-60 seconds for larger jobs
- Use webhooks for better efficiency
Partial Results:
You can retrieve completed results while the job is still processing. The results array will contain all URLs processed so far.
Parameters
- id (string) *required: Batch task ID
Response (200): Success response
- success (boolean):
- data (object):
- id (string):
- status (string):
- total (integer):
- completed (integer):
- creditsUsed (integer):
- results (object[]):
Example Request
curl 'https://open.thunderbit.com/v1/batch/distill/{id}' \
--header 'Authorization: Bearer YOUR_SECRET_TOKEN'
Example Response
{
"success": true,
"data": {
"id": "batch_abc123",
"status": "string",
"total": 1,
"completed": 1,
"creditsUsed": 1,
"results": [
{
"url": "string",
"success": true,
"markdown": "string",
"error": {
"code": "string",
"message": "string"
}
}
]
}
}
Extract
AI-powered structured data extraction. Define your desired data structure using JSON Schema or natural language prompts, and let our AI extract the information for you.
POST /extract ā AI-Powered Structured Extraction
Extract structured data from web pages using AI. Define your desired output structure with JSON Schema, and our AI will intelligently extract the information.
Use Cases:
- Extract product information from e-commerce pages
- Parse job listings into structured format
- Extract contact information and business details
- Convert news articles into structured data
- Scrape pricing tables and specifications
How It Works:
Provide a URL and a JSON Schema defining your desired structure
Our AI analyzes the page content
Extracts data matching your schema
Returns validated JSON output
Schema Definition:
Use JSON Schema to define your desired output structure:
- Field types: string, number, boolean, array, object
- Field descriptions: Help the AI understand what to extract
- Required fields: Mark critical fields as required
- Nested structures: Support for complex, nested data
Example Schema:
{
"type": "object",
"properties": {
"title": {
"type": "string",
"description": "Product name or title"
},
"price": {
"type": "number",
"description": "Current price in USD"
},
"availability": {
"type": "boolean",
"description": "Whether the product is in stock"
},
"features": {
"type": "array",
"items": {"type": "string"},
"description": "List of key product features"
}
},
"required": ["title", "price"]
}
Request Body
- url (string) *required: The URL of the web page to extract data from
- schema (object) *required: Data structure definition in JSON Schema format
- timeout (number): Request timeout in milliseconds, default 30000
Response (200): Success response
- success (boolean):
- data (object):
- url (string):
- extract (object): Extracted structured data matching your schema
- metadata (object):
- sourceURL (string):
- statusCode (integer):
- extractedAt (string):
- confidence (number): AI confidence score (0-1) for the extraction quality
- processingTime (integer): Time taken to process in milliseconds
Example Request
curl 'https://open.thunderbit.com/v1/extract' \
--header 'Authorization: Bearer YOUR_SECRET_TOKEN' \
--header 'Content-Type: application/json' \
--data '{"url":"https://example.com/product","schema":{"type":"object","properties":{"name":{"type":"string"},"price":{"type":"number"}},"required":["name","price"]},"timeout":1}'
Example Response
{
"success": true,
"data": {
"url": "https://example.com/product",
"extract": {
"name": "iPhone 15 Pro",
"price": 999,
"currency": "USD"
},
"metadata": {
"sourceURL": "string",
"statusCode": 1,
"extractedAt": "2025-01-01T00:00:00Z",
"confidence": 1,
"processingTime": 1
}
}
}
POST /extract/batch ā Batch Extract Multiple Pages
Extract structured data from multiple URLs simultaneously using AI.
Use Cases:
- Scrape product catalogs from multiple pages
- Extract data from search result pages
- Batch process listings or directory pages
- Collect competitive intelligence at scale
How It Works:
Submit up to 50 URLs with a single schema
Get immediate job ID response
All URLs are extracted using the same schema
Poll for status or receive webhook notification
Retrieve all structured results at once
Features:
- Same schema applied to all URLs
- Parallel processing for speed
- Individual error handling per URL
- Webhook notifications available
Request Body
- urls (string[]) *required: List of URLs to extract data from, maximum 50
- schema (object) *required: Data structure definition in JSON Schema format
- timeout (number): Timeout per request in milliseconds, default 30000
- webhook (object): Webhook callback configuration, notifies when task completes
- url (string): Webhook callback URL, must be HTTPS
- headers (object):
Response (200): Success response
- success (boolean):
- data (object):
- id (string):
- status (string):
- total (integer):
- completed (integer):
Example Request
curl 'https://open.thunderbit.com/v1/extract/batch' \
--header 'Authorization: Bearer YOUR_SECRET_TOKEN' \
--header 'Content-Type: application/json' \
--data '{"urls":["string"],"schema":{},"timeout":1,"webhook":{"url":"string","headers":{}}}'
Example Response
{
"success": true,
"data": {
"id": "batch_ext_xyz789",
"status": "string",
"total": 1,
"completed": 1
}
}
GET /extract/batch/{id} ā Get Batch Extract Job Status
Check status and retrieve extracted data from a batch extraction job.
Response States:
processing: Extraction in progress
completed: All extractions finished
failed: Job failed (check error details)
Results Format:
Each URL in the results array contains:
- Extracted data matching your schema
- Success/failure status
- Individual error messages if applicable
- Confidence scores for extraction quality
Parameters
- id (string) *required: Batch task ID
Response (200): Success response
- success (boolean):
- data (object):
- id (string):
- status (string):
- total (integer):
- completed (integer):
- creditsUsed (integer):
- results (object[]):
Example Request
curl 'https://open.thunderbit.com/v1/extract/batch/{id}' \
--header 'Authorization: Bearer YOUR_SECRET_TOKEN'
Example Response
{
"success": true,
"data": {
"id": "string",
"status": "string",
"total": 1,
"completed": 1,
"creditsUsed": 1,
"results": [
{
"url": "string",
"success": true,
"extract": {},
"error": {
"code": "string",
"message": "string"
}
}
]
}
}