If you've followed an Amazon scraping tutorial only to hit a wall of CAPTCHAs, 503 errors, or completely empty results — welcome to the club. Most Python Amazon scraping guides floating around the internet were written in 2022 or 2023, and they use selectors and techniques that Amazon has long since patched.
I've spent years building data extraction tools at Thunderbit, and one thing I can tell you from the trenches: Amazon is one of the hardest sites to scrape reliably. The platform changes its HTML structure constantly, deploys a six-layer anti-bot defense, and even serves different page layouts to different users via A/B testing. In this guide, I'm going to walk you through a Python Amazon scraper that actually works in 2025 — with verified CSS selectors, a layered anti-blocking strategy, and guidance on scheduling and exporting that most tutorials skip entirely. And for folks who just need the data without wrestling with Python, I'll also show you how can do the same job in about two clicks.
What Is Amazon Product Scraping?
Amazon product scraping is the process of programmatically extracting publicly available data — product names, prices, ratings, review counts, images, availability, and more — from Amazon's product and search result pages. Instead of manually copying information from hundreds of listings, a scraper visits each page, reads the HTML, and pulls out the data you specify into a structured format like CSV, Excel, or a database.
Think of it as hiring a tireless intern who can visit a thousand product pages in the time it takes you to finish your morning coffee. Except this intern never misspells anything and doesn't need lunch breaks.
Why Scrape Amazon Products with Python?
Amazon hosts roughly across 30+ categories, powered by approximately . Third-party sellers now represent 69% of total GMV. Manually monitoring even a fraction of that catalog is impossible. Here's why teams scrape Amazon:
| Use Case | Who Benefits | What They Extract |
|---|---|---|
| Price monitoring & repricing | Ecommerce ops, marketplace sellers | Prices, availability, seller info |
| Competitor analysis | Product managers, brand teams | Product features, ratings, review counts |
| Market research | Analysts, new product teams | Category trends, pricing distributions |
| Lead generation | Sales teams | Seller names, brand info, contact data |
| Affiliate marketing | Content creators, deal sites | Prices, deals, product details |
| Inventory tracking | Supply chain, procurement | Stock status, delivery estimates |
The scale of Amazon's pricing alone makes automation essential: Amazon changes prices , with the average product's price updating roughly every 10 minutes. By contrast, competitors like Best Buy and Walmart change prices only about 50,000 times per month. No human team can keep up.

Python gives you full control over the scraping process — you decide what to extract, how to handle errors, and where to store the data. But it also means you're responsible for maintenance, anti-blocking, and keeping up with Amazon's frequent HTML changes.
What You Can Scrape from Amazon (and What You Can't)
From publicly accessible product pages, you can typically extract:
- Product title (name, brand)
- Price (current, original, deal price)
- Rating (star average)
- Review count
- Product images (main image URL)
- Availability / stock status
- ASIN (Amazon Standard Identification Number)
- Product description and bullet points
- Seller information
- Product variations (size, color, etc.)
What you should avoid:
- Data behind login walls: Extended review pages, personal account data, order history
- Personal information: Buyer names, addresses, payment info
- Copyrighted content for republishing: Product descriptions and images are fine for analysis, but don't republish them as your own
Amazon's blocks 50+ named bots (including GPTBot, Scrapy, and ClaudeBot) and disallows paths like user accounts, carts, and wishlists. Product detail pages are not explicitly disallowed, but Amazon's Terms of Service do prohibit automated access. Courts have generally distinguished between ToS violations (a civil matter) and criminal violations under the CFAA — more on legality at the end of this guide.
Tools and Libraries You'll Need
Here's the Python stack for this tutorial:
| Library | Purpose | Why We Use It |
|---|---|---|
requests | HTTP requests | Simple, widely supported |
beautifulsoup4 | HTML parsing | Easy CSS selector-based extraction |
lxml | Fast HTML parser | Used as BeautifulSoup's parser backend |
curl_cffi | TLS fingerprint impersonation | Critical for bypassing Amazon's detection |
pandas | Data structuring & export | DataFrames, CSV/Excel export |
Optional (for JavaScript-rendered content):
seleniumorplaywright— headless browser automation
Setting Up Your Python Environment
Open your terminal and run:
1mkdir amazon-scraper && cd amazon-scraper
2python -m venv venv
3source venv/bin/activate # On Windows: venv\Scripts\activate
4pip install requests beautifulsoup4 lxml curl_cffi pandas
Verify everything installed:
1import requests, bs4, curl_cffi, pandas
2print("All good!")
If you see "All good!" with no errors, you're ready.

Why Most Amazon Scraping Tutorials Break (and How This One Is Different)
This is the part most guides skip, and it's the reason you're probably reading this article in the first place.
Amazon frequently updates its HTML structure, class names, and element IDs. The scraping community reports that due to DOM shifts and fingerprinting changes. The most infamous casualty? The selector #priceblock_ourprice, which appeared in hundreds of tutorials from 2018–2023. That ID no longer exists on Amazon product pages.
A quick comparison of what's broken vs. what's works now:
| Data Point | Broken Selector (Pre-2024) | Working 2025 Selector |
|---|---|---|
| Price | #priceblock_ourprice | div#corePriceDisplay_desktop_feature_div span.a-price .a-offscreen |
| Title | #productTitle | span#productTitle (still works) |
| Rating | span.a-icon-alt (sometimes wrong context) | #acrPopover span.a-icon-alt |
| Review Count | #acrCustomerReviewCount | span#acrCustomerReviewText |
| Availability | #availability span | div#availability span.a-size-medium |
Every code snippet in this guide was tested against live Amazon pages in 2025. I'll show you the actual CSS selectors alongside expected output — no copy-pasting from 2022.
Before You Start
- Difficulty: Intermediate (basic Python knowledge assumed)
- Time Required: ~30–45 minutes for the full tutorial; ~10 minutes for the basic scraper
- What You'll Need: Python 3.9+, Chrome browser (for inspecting Amazon pages), a terminal, and optionally the if you want to compare the no-code approach
Step 1: Send Your First Request to Amazon
Navigate to any Amazon product page in your browser and copy the URL. We'll start with a simple requests.get():
1import requests
2url = "https://www.amazon.com/dp/B0DGNFM9YJ"
3response = requests.get(url)
4print(response.status_code)
5print(response.text[:500])
Run this, and you'll almost certainly get a 503 status code or a page that says "To discuss automated access to Amazon data please contact…" That's Amazon's WAF (Web Application Firewall) detecting your Python script. A bare requests.get() without proper headers achieves roughly a against Amazon.
You should see something like: 503 and a block page in the HTML. That's expected — we'll fix it in the next step.
Step 2: Set Up Custom Headers and TLS Impersonation
Simply adding a User-Agent header isn't enough anymore. Amazon compares your HTTP headers against your TLS fingerprint. If you claim to be Chrome 120 but your TLS handshake reveals Python's requests library, you're .
The most reliable approach in 2025 is to use curl_cffi with browser impersonation:
1from curl_cffi import requests as cfreq
2headers = {
3 "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
4 "Accept-Language": "en-US,en;q=0.9",
5 "Accept-Encoding": "gzip, deflate, br",
6 "Referer": "https://www.google.com/",
7 "DNT": "1",
8 "Connection": "keep-alive",
9 "Upgrade-Insecure-Requests": "1",
10}
11url = "https://www.amazon.com/dp/B0DGNFM9YJ"
12response = cfreq.get(url, headers=headers, impersonate="chrome124")
13print(response.status_code)
14print(len(response.text))
With curl_cffi impersonating Chrome 124, success rates jump to approximately — a 47x improvement over plain requests. You should now see a 200 status code and a much longer HTML response (100,000+ characters).
If you still get a 503, try a different impersonate value (e.g., "chrome131") or add a short delay before retrying.
Step 3: Parse the HTML and Extract Product Data
Now that we have the full HTML, let's extract the data using BeautifulSoup with verified 2025 selectors:
1from bs4 import BeautifulSoup
2soup = BeautifulSoup(response.text, "lxml")
3# Product Title
4title_el = soup.select_one("span#productTitle")
5title = title_el.get_text(strip=True) if title_el else None
6# Price
7price_el = soup.select_one(
8 "div#corePriceDisplay_desktop_feature_div span.a-price .a-offscreen"
9)
10if not price_el:
11 price_el = soup.select_one("span.priceToPay .a-offscreen")
12if not price_el:
13 price_el = soup.select_one(".apexPriceToPay .a-offscreen")
14price = price_el.get_text(strip=True) if price_el else None
15# Rating
16rating_el = soup.select_one("#acrPopover span.a-icon-alt")
17rating = rating_el.get_text(strip=True) if rating_el else None
18# Review Count
19reviews_el = soup.select_one("span#acrCustomerReviewText")
20reviews = reviews_el.get_text(strip=True) if reviews_el else None
21# Availability
22avail_el = soup.select_one("div#availability span")
23availability = avail_el.get_text(strip=True) if avail_el else None
24# Main Image URL
25img_el = soup.select_one("#landingImage")
26image_url = img_el.get("src") if img_el else None
27print(f"Title: {title}")
28print(f"Price: {price}")
29print(f"Rating: {rating}")
30print(f"Reviews: {reviews}")
31print(f"Availability: {availability}")
32print(f"Image: {image_url}")
Expected output (example):
1Title: Apple AirPods Pro (2nd Generation) with USB-C
2Price: $189.99
3Rating: 4.7 out of 5 stars
4Reviews: 98,432 ratings
5Availability: In Stock
6Image: https://m.media-amazon.com/images/I/61SUj2...
Notice the multiple fallback selectors for price — Amazon uses different containers depending on the product type, deal status, and A/B test variant. Wrapping each extraction in a conditional check prevents your scraper from crashing when a selector doesn't match.
Step 4: Scrape Multiple Products from Search Results
To build a real dataset, you'll want to start from an Amazon search results page, collect ASINs, then scrape each product detail page.
1import time
2import random
3def get_search_asins(keyword, max_pages=1):
4 """Collect ASINs from Amazon search results."""
5 asins = []
6 for page in range(1, max_pages + 1):
7 search_url = f"https://www.amazon.com/s?k={keyword}&page={page}"
8 resp = cfreq.get(search_url, headers=headers, impersonate="chrome124")
9 if resp.status_code != 200:
10 print(f"Search page {page} returned {resp.status_code}")
11 break
12 search_soup = BeautifulSoup(resp.text, "lxml")
13 results = search_soup.select('div[data-component-type="s-search-result"]')
14 for r in results:
15 asin = r.get("data-asin")
16 if asin:
17 asins.append(asin)
18 print(f"Page {page}: found {len(results)} products")
19 time.sleep(random.uniform(2, 5)) # Polite delay
20 return asins
21asins = get_search_asins("wireless+earbuds", max_pages=2)
22print(f"Collected {len(asins)} ASINs")
Each ASIN maps to a clean product URL: https://www.amazon.com/dp/{ASIN}. This is more reliable than using the full search result URLs, which can contain session-specific parameters.
Step 5: Handle Pagination and Scrape at Scale
Now let's combine search collection and detail page scraping into a full pipeline:
1import pandas as pd
2def scrape_product(asin):
3 """Scrape a single Amazon product detail page."""
4 url = f"https://www.amazon.com/dp/{asin}"
5 try:
6 resp = cfreq.get(url, headers=headers, impersonate="chrome124")
7 if resp.status_code != 200:
8 return None
9 soup = BeautifulSoup(resp.text, "lxml")
10 title_el = soup.select_one("span#productTitle")
11 price_el = (
12 soup.select_one("div#corePriceDisplay_desktop_feature_div span.a-price .a-offscreen")
13 or soup.select_one("span.priceToPay .a-offscreen")
14 or soup.select_one(".apexPriceToPay .a-offscreen")
15 )
16 rating_el = soup.select_one("#acrPopover span.a-icon-alt")
17 reviews_el = soup.select_one("span#acrCustomerReviewText")
18 avail_el = soup.select_one("div#availability span")
19 img_el = soup.select_one("#landingImage")
20 return {
21 "asin": asin,
22 "title": title_el.get_text(strip=True) if title_el else None,
23 "price": price_el.get_text(strip=True) if price_el else None,
24 "rating": rating_el.get_text(strip=True) if rating_el else None,
25 "reviews": reviews_el.get_text(strip=True) if reviews_el else None,
26 "availability": avail_el.get_text(strip=True) if avail_el else None,
27 "image_url": img_el.get("src") if img_el else None,
28 "url": url,
29 }
30 except Exception as e:
31 print(f"Error scraping {asin}: {e}")
32 return None
33# Scrape all collected ASINs
34products = []
35for i, asin in enumerate(asins):
36 print(f"Scraping {i+1}/{len(asins)}: {asin}")
37 product = scrape_product(asin)
38 if product:
39 products.append(product)
40 time.sleep(random.uniform(2, 5)) # Random delay between requests
41df = pd.DataFrame(products)
42print(f"\nScraped {len(df)} products successfully")
43print(df.head())
The random delay between 2–5 seconds is critical. Perfectly regular timing (e.g., exactly 3 seconds every time) looks suspicious to Amazon's behavioral analysis. Random intervals mimic human browsing patterns.
Step 6: Save Scraped Amazon Data to CSV
1df.to_csv("amazon_products.csv", index=False, encoding="utf-8-sig")
2print("Saved to amazon_products.csv")
You should now have a clean CSV with columns for ASIN, title, price, rating, reviews, availability, image URL, and product URL. This is where most tutorials stop — but if you're building a real workflow, CSV is just the beginning.
Anti-Blocking Deep Dive: How to Keep Your Scraper Running
Getting blocked is the for anyone who tries to scrape Amazon products with Python. Amazon's six-layer defense includes IP reputation analysis, TLS fingerprinting, browser environment checks, behavioral biometrics, CAPTCHAs, and ML-driven anomaly detection. Below is a layered strategy to address each one.
Rotate User-Agents and Full Headers
A single static User-Agent gets flagged fast. Rotate through a list of current browser strings:
1import random
2USER_AGENTS = [
3 "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/143.0.0.0 Safari/537.36",
4 "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/143.0.0.0 Safari/537.36",
5 "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0",
6 "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/26.0 Safari/605.1.15",
7]
8def get_headers():
9 return {
10 "User-Agent": random.choice(USER_AGENTS),
11 "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
12 "Accept-Language": "en-US,en;q=0.9",
13 "Accept-Encoding": "gzip, deflate, br",
14 "Referer": "https://www.google.com/",
15 "DNT": "1",
16 "Connection": "keep-alive",
17 }
One detail that trips people up: your Accept-Language must match the geographic location implied by your IP. Sending Accept-Language: en-US from a German IP is a red flag.
TLS Fingerprint Impersonation with curl_cffi
We covered this in Step 2, but it's worth emphasizing: this single technique provides the biggest improvement in success rate. Standard Python requests achieves about 2% success against Amazon. With curl_cffi impersonation, you're at roughly 94%. That's the difference between a working scraper and a broken one.
1from curl_cffi import requests as cfreq
2# Rotate impersonation targets too
3BROWSERS = ["chrome120", "chrome124", "chrome131"]
4response = cfreq.get(
5 url,
6 headers=get_headers(),
7 impersonate=random.choice(BROWSERS),
8)
Proxy Rotation
For scraping more than a handful of pages, you'll need proxy rotation. Amazon tracks IP addresses and will block any single IP that sends too many requests.
1PROXIES = [
2 "http://user:pass@proxy1.example.com:8080",
3 "http://user:pass@proxy2.example.com:8080",
4 "http://user:pass@proxy3.example.com:8080",
5]
6proxy = random.choice(PROXIES)
7response = cfreq.get(
8 url,
9 headers=get_headers(),
10 impersonate="chrome124",
11 proxies={"http": proxy, "https": proxy},
12)
Residential proxies are more effective than datacenter proxies (Amazon blocks datacenter IP ranges proactively), but they're also more expensive. For a small project, you might start with a and scale up as needed.
Rate Limiting and Exponential Backoff
No competitor article I found covers this, but it's essential. When you get a 503 or CAPTCHA response, don't just retry immediately — that's a fast path to a permanent ban.
1import time
2import random
3def fetch_with_backoff(url, max_retries=3):
4 """Fetch a URL with exponential backoff on failure."""
5 for attempt in range(max_retries):
6 response = cfreq.get(
7 url,
8 headers=get_headers(),
9 impersonate=random.choice(BROWSERS),
10 )
11 if response.status_code == 200:
12 return response
13 # Exponential backoff with jitter
14 wait = min(2 ** attempt + random.uniform(0, 1), 30)
15 print(f"Attempt {attempt+1} failed ({response.status_code}). Waiting {wait:.1f}s...")
16 time.sleep(wait)
17 return None # All retries exhausted
The formula wait = min(2^attempt + jitter, max_delay) ensures your delays grow (2s, 4s, 8s...) but never exceed a reasonable cap. The random jitter prevents your retry pattern from being fingerprinted.
Selenium or Playwright Fallback for JS-Rendered Content
Some Amazon pages (especially those with dynamic pricing widgets or variation selectors) require JavaScript to render fully. When curl_cffi returns incomplete HTML, a headless browser is your fallback:
1from playwright.sync_api import sync_playwright
2def scrape_with_browser(url):
3 with sync_playwright() as p:
4 browser = p.chromium.launch(headless=True)
5 page = browser.new_page()
6 page.goto(url, wait_until="domcontentloaded")
7 page.wait_for_timeout(3000) # Let JS render
8 html = page.content()
9 browser.close()
10 return html
This is slower — 3–5 seconds per page vs. under 1 second with curl_cffi. Use it only when needed.
In my experience, curl_cffi handles 90%+ of Amazon product pages without a browser.
Anti-Blocking Summary
| Technique | Difficulty | Effectiveness | Covered by Most Tutorials? |
|---|---|---|---|
| Custom User-Agent | Easy | Low (Amazon detects patterns) | Yes |
| Full header rotation | Easy | Medium | Rarely |
| TLS impersonation (curl_cffi) | Medium | High (~94% success) | Almost never |
| Proxy rotation | Medium | High | Briefly, if at all |
| Rate limiting + exponential backoff | Easy | Medium | No |
| Selenium/Playwright fallback | Medium | High (for JS content) | Mentioned, not demonstrated |
Beyond CSV: Export Scraped Amazon Data to Google Sheets, Airtable, and More
Every tutorial I reviewed stops at CSV export. But real business workflows need data in Google Sheets, databases, or tools like Airtable and Notion.
Export to Google Sheets with gspread
First, set up a Google service account (one-time setup):
- Go to → APIs & Services → Credentials
- Create a service account and download the JSON key file
- Save it to
~/.config/gspread/service_account.json - Share your target spreadsheet with the
client_emailfrom the JSON file
Then:
1import gspread
2from gspread_dataframe import set_with_dataframe
3gc = gspread.service_account()
4sh = gc.open("Amazon Scrape Data")
5worksheet = sh.sheet1
6set_with_dataframe(worksheet, df)
7print("Data exported to Google Sheets!")
This writes your entire DataFrame directly to a Google Sheet — live, shareable, and ready for dashboards.
Store in SQLite for Local Analysis
For larger datasets or historical tracking, SQLite is perfect — no server setup, just a single file:
1import sqlite3
2conn = sqlite3.connect("amazon_products.db")
3df.to_sql("products", conn, if_exists="append", index=False)
4print(f"Stored {len(df)} products in SQLite")
5# Query later:
6historical = pd.read_sql_query(
7 "SELECT * FROM products WHERE price IS NOT NULL ORDER BY rowid DESC LIMIT 100",
8 conn,
9)
The No-Code Alternative
If you don't want to maintain Python export scripts, offers free export to Google Sheets, Airtable, Notion, Excel, CSV, and JSON — including image fields that render directly in Airtable and Notion. No gspread setup, no API credentials, no code at all. For teams that need data flowing into their existing tools, it's a significant time saver.
Scheduling Automated Amazon Scrapes — The Missing Chapter
Price monitoring and inventory tracking require recurring scrapes, not one-off runs. Yet I couldn't find a single competitor article that covers scheduling. Here's how to automate your Python scraper.
Cron Jobs (Linux/macOS)
Open your crontab:
1crontab -e
Add a line to run your scraper daily at 6 AM:
10 6 * * * cd /path/to/amazon-scraper && /path/to/venv/bin/python scraper.py >> ~/scraper.log 2>&1
Or every 6 hours:
10 */6 * * * cd /path/to/amazon-scraper && /path/to/venv/bin/python scraper.py >> ~/scraper.log 2>&1
Windows Task Scheduler
Create a batch file run_scraper.bat:
1@echo off
2cd /d "C:\path\to\amazon-scraper"
3call venv\Scripts\activate
4python scraper.py
5deactivate
Then open Task Scheduler → Create Basic Task → set your trigger (Daily, Hourly) → Action: "Start a program" → browse to run_scraper.bat.
GitHub Actions (Free Tier)
For a cloud-based schedule with zero infrastructure:
1name: Amazon Scraper
2on:
3 schedule:
4 - cron: "0 6 * * *" # Daily at 6 AM UTC
5 workflow_dispatch: # Manual trigger
6jobs:
7 scrape:
8 runs-on: ubuntu-latest
9 steps:
10 - uses: actions/checkout@v3
11 - name: Set up Python
12 uses: actions/setup-python@v4
13 with:
14 python-version: "3.11"
15 - name: Install dependencies
16 run: pip install -r requirements.txt
17 - name: Run scraper
18 run: python scraper.py
19 - name: Commit results
20 run: |
21 git config user.name 'GitHub Actions'
22 git config user.email 'actions@github.com'
23 git add data/
24 git diff --staged --quiet || git commit -m "Update scraped data"
25 git push
Store proxy credentials in GitHub Secrets, and you've got a free, automated scraping pipeline.
No-Code Alternative: Thunderbit's Scheduled Scraper
For teams that don't want to manage cron syntax or cloud infrastructure, Thunderbit offers a built-in . You describe the schedule in plain English (e.g., "every day at 8 AM" or "every Monday"), add your Amazon URLs, and click "Schedule." No terminal, no YAML files, no deployment pipeline. It's particularly useful for ecommerce teams running continuous price or inventory monitoring.
Python DIY vs. Scraper API vs. No-Code: Which Approach Should You Use?
This is a question I see on forums constantly, and no top-ranking article provides a structured answer. So here's my honest take:
| Criteria | Python + BS4/curl_cffi | Scraper API (ScraperAPI, Oxylabs) | No-Code (Thunderbit) |
|---|---|---|---|
| Setup time | 30–60 min | 10–20 min | ~2 minutes |
| Coding required | Yes (Python) | Yes (API calls) | None |
| Anti-blocking built-in | No (DIY) | Yes | Yes |
| Handles JS rendering | Only with Selenium/Playwright | Varies by provider | Yes (Browser or Cloud mode) |
| Scheduling | DIY (cron/cloud) | Some offer it | Built-in |
| Cost | Free (+ proxy costs) | $30–100+/mo | Free tier available |
| Maintenance | High (selectors break) | Low | None (AI adapts) |
| Best for | Developers wanting full control | Scale & reliability at volume | Speed, non-developers, business users |
Python is the right choice if you want to learn, customize every detail, and don't mind ongoing maintenance. Scraper APIs handle anti-blocking for you but still require code. And Thunderbit is the fastest path for sales, ecommerce ops, or anyone who just needs the data — no selectors, no code, no maintenance when Amazon changes its HTML.
How Thunderbit Scrapes Amazon Products in 2 Clicks
I'm biased, of course — my team built this. But the workflow genuinely is this simple:
- Install the
- Navigate to an Amazon search results or product page
- Click "AI Suggest Fields" (or use the instant Amazon scraper template)
- Click "Scrape"
Thunderbit's AI reads the page, identifies the data structure, and extracts everything into a clean table. You can export to Excel, Google Sheets, Airtable, or Notion for free. The real payoff: when Amazon changes its HTML next week (and it will), Thunderbit's AI adapts automatically. No broken scripts, no selector updates.
For enriching product lists with detail-page data, Thunderbit's Subpage Scraping feature automatically follows links to product pages and pulls in additional fields like images, descriptions, and variations — something that takes significant extra code in Python.
Tips to Keep Your Python Amazon Scraper Working Long-Term
If you're going the Python route, here's how to minimize maintenance headaches:
- Check selectors regularly. Amazon changes them often. Bookmark this article — I'll update the selector table as things change.
- Monitor your success rate. Track the ratio of 200 responses vs. 503s/CAPTCHAs. Set up an alert (even a simple email) when your success rate drops below 80%.
- Store raw HTML. Save the full HTML response alongside your parsed data. If selectors change, you can re-parse historical data without re-scraping.
- Rotate proxies and User-Agents frequently. Static fingerprints get flagged within hours at scale.
- Use exponential backoff. Never retry immediately after a block.
- Containerize with Docker. Wrap your scraper in a Docker container for easy deployment and portability.
- Add data validation. Check that prices are numeric, ratings are between 1–5, and titles aren't empty. One team reported a after adding validation layers.
Or, if all of that sounds like more work than you signed up for, consider whether a no-code tool like Thunderbit might be a better fit for your use case. There's no shame in choosing the faster path — I've spent enough years debugging scrapers to know that sometimes the best code is the code you don't have to write.
Legal and Ethical Considerations When Scraping Amazon
Since this comes up in every conversation about Amazon scraping, a quick note on the legal landscape:
- Publicly available data scraping is generally legal in the US. The landmark ruling (2022) established that accessing public data doesn't violate the CFAA. More recently, (2024) and (2024) reinforced this principle.
- Amazon's ToS prohibit automated access. This is a civil matter (breach of contract), not a criminal one. Courts have generally distinguished between the two.
- Amazon v. Perplexity (2025) is an active case involving AI scraping of Amazon pages. A preliminary injunction was issued in March 2026. This is worth watching.
- Stick to public pages. Don't scrape login-protected content, personal data, or anything behind authentication.
- Respect rate limits. Don't hammer Amazon's servers. A delay of 2–5 seconds between requests is reasonable.
- Use data responsibly. Scrape for analysis, not for republishing copyrighted content.
- Consult legal counsel for large-scale commercial use, especially if you're in the EU (GDPR applies to personal data).
For a deeper dive, see our guide on .
Wrapping Up
You now have a working Python Amazon scraper with verified 2025 selectors, a layered anti-blocking strategy that goes far beyond "add a User-Agent," practical scheduling options for continuous monitoring, and export methods that get your data into Google Sheets, databases, or any tool your team uses.
Quick summary:
- Python + curl_cffi + BeautifulSoup gives you full control and a ~94% success rate when combined with TLS impersonation
- Anti-blocking requires multiple layers: header rotation, TLS impersonation, proxy rotation, rate limiting, and exponential backoff
- Scheduling turns a one-off script into a continuous monitoring pipeline (cron, GitHub Actions, or Thunderbit's built-in scheduler)
- Export beyond CSV — Google Sheets, SQLite, Airtable, Notion — is where real business value lives
- Thunderbit offers a 2-click alternative for non-developers or anyone who'd rather spend their time analyzing data instead of debugging selectors
If you want to try the code, everything in this guide is ready to copy and run. And if you'd rather skip the coding entirely, lets you test the no-code approach on Amazon right away.
For more, see our guides on , , and . You can also watch step-by-step walkthroughs on the .
Happy scraping — and may your selectors survive until the next Amazon update.
FAQs
1. Why does my Python Amazon scraper get blocked after a few requests?
Amazon uses a six-layer defense system: IP reputation analysis, TLS fingerprinting (JA3/JA4), browser environment detection, behavioral biometrics, CAPTCHA challenges, and ML-driven anomaly detection. A basic requests script with just a User-Agent header achieves only about success. You need TLS impersonation (curl_cffi), full header rotation, proxy rotation, and rate limiting with random jitter to maintain reliable access.
2. What Python libraries are best for scraping Amazon products in 2025?
curl_cffi for TLS-impersonated HTTP requests (the biggest single improvement), BeautifulSoup4 with lxml for HTML parsing, pandas for data structuring and export, and Selenium or Playwright as a fallback for JavaScript-rendered content. Python is used by of scraping developers.
3. Is it legal to scrape Amazon product data?
Scraping publicly available data is generally legal in the US, supported by rulings like hiQ v. LinkedIn and Meta v. Bright Data. Amazon's Terms of Service prohibit automated access, but courts distinguish between ToS violations (civil) and criminal violations. Always avoid login-protected content, respect rate limits, and consult legal counsel for large-scale commercial use.
4. Can I scrape Amazon without writing any code?
Yes. Tools like let you scrape Amazon products in 2 clicks with a Chrome extension. Its AI-powered field detection automatically structures the data, and you can export to Excel, Google Sheets, Airtable, or Notion for free. When Amazon changes its HTML, Thunderbit's AI adapts without any manual updates.
5. How often does Amazon change its HTML selectors, and how do I keep my scraper updated?
Frequently and without notice. The scraping community reports that of crawlers need weekly fixes due to DOM changes. To stay ahead, monitor your scraper's success rate, store raw HTML for re-parsing, and check selectors against live pages regularly. Alternatively, AI-powered tools like Thunderbit adapt automatically, eliminating this maintenance burden.
Learn More