How I Scrape Yelp with Python Without Getting Blocked

Yelp holds across — and getting that data into a usable format has never been harder. Yelp's 2024–2025 anti-bot crackdown has quietly broken most existing Python scraping tutorials.

If you've tried running a Yelp scraper recently and hit a wall of 403 errors, empty HTML responses, or CAPTCHAs that weren't there six months ago, you're not imagining things. Yelp now runs TLS/JA3 fingerprinting, rotating obfuscated CSS class names, and aggressive IP reputation scoring — meaning the old requests + BeautifulSoup approach that every tutorial still recommends dies on the first request. I've spent weeks testing different approaches against Yelp's current stack, and this guide covers everything that actually works in 2025: the official Fusion API (and why it probably won't be enough), a full Python scraping workflow with a layered anti-blocking strategy, and a 2-click no-code alternative with for readers who just want the data without the debugging marathon.

Why Scrape Yelp with Python (and Who Actually Benefits)

Before writing a single line of code, what's the actual business case for Yelp data? The platform isn't just a restaurant review site — it's effectively a live database of local businesses with structured contact info, ratings, categories, hours, and hundreds of millions of customer reviews.

Here's who benefits most and what they're extracting:

Use Case	Key Data Fields	Why It Matters
Sales & lead generation	Business name, phone, website, address, category, rating	Build targeted prospect lists of local SMBs — 4 of 5 Yelp users are purchase-ready on arrival
Competitive intelligence	Reviews, star ratings, review volume, sentiment	Monitor competitor reputation, identify service gaps, track trends
Market research & NLP	Full review text, dates, reviewer metadata	Sentiment analysis, topic modeling — Yelp reviews are one of the most-used NLP corpora in academic research
Real estate & site selection	Business density, category mix, review quality by area	Franchise and retail site selection — Yelp sells Location Intelligence as a licensed B2B product for exactly this
Ecommerce & operations	Pricing signals, customer complaints, service hours	Track how competitors are reviewed, identify operational patterns

The common thread: the real goal is structured data, and Python is just one vehicle to get there. Some readers will want full programmatic control. Others just need a spreadsheet of plumber contact info in Austin. Both paths are covered here.

Yelp Fusion API vs. Python Web Scraping: Which Should You Use?

Most guides skip this decision entirely and jump straight into code without evaluating whether the official (now rebranded as the "Yelp Places API") would have been sufficient. In my experience, that evaluation saves hours of wasted effort — because the API is great for some things and completely inadequate for others.

What the Fusion API Actually Gives You

The Fusion API provides structured business search, business details, autocomplete, and a reviews endpoint. It's authorized, well-documented, and doesn't require anti-bot gymnastics.

But the reviews endpoint is where things fall apart. Here's what Yelp staff have confirmed on GitHub:

"The Yelp API does not return full review text. Three review excerpts of 160 characters are provided by default." —

That's not a bug — it's by design. The API physically caps at 3 review excerpts (7 on Premium), each truncated to ~160 characters. No review metadata (useful/funny/cool votes), no reviewer history, no owner replies. And the after May 2023 — down from 5,000. Entry pricing starts at .

The Decision Framework

Factor	Yelp Fusion API	Python Web Scraping	Thunderbit (No-Code)
Full reviews	❌ Only 3 excerpts (~160 chars each)	✅ All reviews via GraphQL	✅ All visible reviews
Rate limits	300–500/day (new); 5,000 (legacy)	Self-managed (proxy budget)	Credit-based
Setup effort	~15 min (API key + SDK)	Hours to days	~2 minutes
Business fields	~20 structured fields	Unlimited (parse HTML/JSON)	AI-suggested fields
Anti-bot handling	N/A (authorized)	Must build yourself	Handled automatically
Legal risk	✅ Authorized	⚠️ ToS gray area	⚠️ Same as scraping
Cost	$29/mo minimum	Free (+ proxy costs $0.75–$4/GB)	Free tier available
Maintenance	Low (API stable)	High (selectors rot, anti-bot escalates)	Low (AI re-adapts)

Use the Fusion API if: you need basic business info, small-scale lookups, or an authorized integration — and 3 review snippets per business is enough.

Use Python scraping if: you need full review text, all reviews for a business, review metadata, more than 240 results per search, or your budget is below $29/month.

Use Thunderbit if: you want the data fast without writing or maintaining code. More on this in the no-code section below.

The No-Code Shortcut: Scrape Yelp with Thunderbit (No Python Needed)

Before the Python deep-dive, here's the fastest path for readers whose real goal is the data, not the coding exercise. Every competitor guide assumes Python proficiency, but in my work at Thunderbit, I've seen that a huge chunk of people searching "scrape Yelp" are sales reps, ops managers, and small business owners who just want a spreadsheet of local businesses — not a crash course in TLS fingerprinting.

already ships pre-built Yelp templates:

— extracts business name, rating, contact details, address, hours, category
— extracts reviewer username, review content, rating, date, reviewer location

How It Works in Practice

Open a Yelp search results page or business page in Chrome
Click AI Suggest Fields in the — the AI reads the page and proposes columns (business name, rating, review count, price range, category, address, phone, URL)
Click Scrape — done

For the pre-built Yelp templates, it's even simpler: open the template, click Scrape.

Subpage scraping handles the enrichment loop automatically — start from a Yelp search results page, enable subpage scraping, and Thunderbit visits each business page to pull hours, full reviews, website, photos, and amenities. No additional setup.

Pagination is automatic — both click-based and scroll-based, handled out of the box. (For more on how this works, see our .)

Exports are free on every tier — Excel, Google Sheets, Airtable, Notion, CSV, JSON. No pandas, no CSV-writing code.

Time Comparison

Time	Python Scraper	Thunderbit
First run	Hours to days (write selectors, handle pagination, proxies, retry logic)	~30 seconds with the pre-built Yelp template
When Yelp changes markup	Manually rewrite selectors	Click AI Suggest Fields again — re-adapts automatically
When IP gets banned	Debug, rotate proxy pools, re-test	Cloud mode handles IP rotation
Export to Google Sheets	Write OAuth + pandas glue	One click, free

If you try Thunderbit first and find it covers your needs, you can skip the rest of this article. If you need full programmatic control, custom fields, or scale beyond a few thousand records per month — read on.

Python Libraries for Scraping Yelp: Which One to Pick

"Should I use Scrapy, BS4+requests, or Selenium?" is one of the most common questions in r/webscraping threads about Yelp. And yet every tutorial just picks their favorite library and moves on without explaining why. Here's the honest breakdown.

The 2025 Reality: `requests + BeautifulSoup` Is Broken for Yelp

The stack that every canonical Yelp tutorial recommends — pip install requests beautifulsoup4 — gets you blocked on the first request in 2025. Not the 50th. The first.

The reason: Python's requests library ships a TLS/JA3 fingerprint that doesn't match any real browser. Yelp's anti-bot layer flags it at the TLS-handshake level, before your User-Agent header is even read. I tested this repeatedly — fresh IP, realistic headers, randomized delays — and still hit 403 Forbidden immediately with vanilla requests.

The Library Decision Matrix

Library	Best For	Handles JS?	Anti-Bot?	Learning Curve	Speed
`requests` + `BeautifulSoup`	~~Simple single-page scraping~~ (broken for Yelp)	❌	❌	Very low	Fast (until blocked)
`httpx` async + `parsel`	Large-scale async scraping	❌	❌	Low	Very fast
`curl_cffi` + `parsel`	Yelp-specific: TLS impersonation	❌	✅ TLS/JA3/HTTP2	Low	Very fast
`Scrapy` 2.14	Full crawl pipelines with pagination	Partial (via scrapy-playwright)	AutoThrottle, retry middleware	Medium-High	Fast
`Selenium` 4.43 / `Playwright` 1.58	JS-heavy pages, CAPTCHA workarounds	✅	Partial	Medium	Slow (~10–30 pages/min)
Thunderbit	Non-coders, quick extraction	✅ (browser)	Built-in (Cloud mode)	Very low	Fast

The `curl_cffi` Revelation

The library that changed my Yelp scraping workflow is — a Python binding for curl-impersonate. It emits the exact same TLS/JA3 + HTTP/2 fingerprint as real Chrome, and its API is a drop-in replacement for requests:

1from curl_cffi import requests
2r = requests.get(
3    "https://www.yelp.com/biz/some-restaurant",
4    impersonate="chrome131",
5)
6print(r.status_code, len(r.text))

That single change — from curl_cffi import requests plus impersonate="chrome131" — bypasses Yelp's without spinning up a browser. In my testing, it's the difference between instant 403s and clean 200 responses.

My recommended stack for Yelp in 2025: curl_cffi + parsel + jmespath + residential proxies. If you need a full crawl pipeline with scheduling, wrap it in Scrapy 2.14 with a curl_cffi-based downloader middleware.

Setting Up Your Python Environment to Scrape Yelp

Difficulty: Intermediate
Time Required: ~15 minutes for setup, 1–2 hours for a working scraper
What You'll Need: Python 3.10+ (3.12 recommended), a terminal, and optionally a residential proxy provider

Step 1: Create a Virtual Environment and Install Packages

1python3.12 -m venv .venv
2source .venv/bin/activate  # On Windows: .venv\Scripts\activate
3pip install "curl_cffi>=0.11" "parsel>=1.9" "jmespath>=1.0" pandas

What each package does:

curl_cffi — makes HTTP requests with Chrome's TLS fingerprint (the anti-bot bypass)
parsel — CSS/XPath selectors for parsing HTML (same engine Scrapy uses, lighter weight)
jmespath — declarative JSON querying (cleaner than nested dict access for Yelp's embedded JSON)
pandas — data export to CSV/Excel

Optional but useful:

1pip install fake-useragent  # Note: repo archived April 2026 but still installable

Step-by-Step: How to Scrape Yelp with Python

This is the core tutorial. The key insight that makes everything more resilient: skip CSS selectors, pull hidden JSON instead. Yelp randomizes CSS class names at build time (y-css-14xwok2 one week, y-css-hcq7b9 the next), so any scraper pinned to them breaks within weeks. The embedded JSON payloads — application/ld+json schema and react-root-props — are stable.

Step 2: Scrape Yelp Search Results

Yelp search URLs follow a predictable pattern: https://www.yelp.com/search?find_desc={term}&find_loc={location}. The search results data is embedded in a <script data-id="react-root-props"> tag as JSON — not rendered in the CSS-class soup.

1import re, json, jmespath
2from curl_cffi import requests
3from parsel import Selector
4HEADERS = {
5    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
6                  "AppleWebKit/537.36 (KHTML, like Gecko) "
7                  "Chrome/124.0.0.0 Safari/537.36",
8    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,"
9              "image/avif,image/webp,image/apng,*/*;q=0.8",
10    "accept-language": "en-US,en;q=0.9",
11    "accept-encoding": "gzip, deflate, br",
12    "cookie": "intl_splash=false",
13}
14def scrape_search(term: str, location: str, max_pages: int = 3):
15    results = []
16    for page in range(max_pages):
17        url = (f"https://www.yelp.com/search?"
18               f"find_desc={term}&find_loc={location}&start={page * 10}")
19        r = requests.get(url, headers=HEADERS, impersonate="chrome131")
20        if r.status_code != 200:
21            print(f"Blocked on page {page}: {r.status_code}")
22            break
23        sel = Selector(text=r.text)
24        script = sel.xpath(
25            "//script[@data-id='react-root-props']/text()"
26        ).get() or ""
27        m = re.search(r"react_root_props\s*=\s*(\{.*?\});", script, re.S)
28        if not m:
29            print(f"No react-root-props found on page {page} — possible soft block")
30            break
31        data = json.loads(m.group(1))
32        businesses = jmespath.search(
33            "legacyProps.searchAppProps.searchPageProps"
34            ".mainContentComponentsListProps"
35            "[?searchResultBusiness].searchResultBusiness.{"
36            "name: name, url: businessUrl, rating: rating, "
37            "reviews: reviewCount, phone: phone, "
38            "neighborhoods: neighborhoods}",
39            data,
40        ) or []
41        results.extend(businesses)
42        import time, random
43        time.sleep(random.uniform(3, 7))
44    return results

You should get back a list of dicts with business names, URLs, ratings, and review counts. If react-root-props is missing from the response, you've been served a block shell — rotate your IP and retry.

The Cookie: intl_splash=false header is a standard workaround for Yelp's country-splash redirect. Without it, non-US IPs hit a splash page that looks like a soft block but isn't.

Step 3: Scrape Yelp Business Pages

Each business URL from the search results leads to a detail page with richer data. The most stable extraction target is the <script type="application/ld+json"> block — it contains structured schema.org data that Yelp maintains for SEO and doesn't obfuscate.

1def scrape_business(biz_url: str) -> dict:
2    url = f"https://www.yelp.com{biz_url}" if biz_url.startswith("/") else biz_url
3    r = requests.get(url, headers=HEADERS, impersonate="chrome131")
4    if r.status_code != 200:
5        return {"url": url, "error": r.status_code}
6    sel = Selector(text=r.text)
7    biz_id = sel.css('meta[name="yelp-biz-id"]::attr(content)').get()
8    for raw in sel.css('script[type="application/ld+json"]::text').getall():
9        try:
10            data = json.loads(raw)
11        except json.JSONDecodeError:
12            continue
13        for node in (data if isinstance(data, list) else [data]):
14            if node.get("@type") in (
15                "Restaurant", "LocalBusiness", "FoodEstablishment",
16                "HealthAndBeautyBusiness", "HomeAndConstructionBusiness",
17            ):
18                return {
19                    "biz_id": biz_id,
20                    "name": node.get("name"),
21                    "rating": (node.get("aggregateRating") or {}).get("ratingValue"),
22                    "review_count": (node.get("aggregateRating") or {}).get("reviewCount"),
23                    "address": node.get("address"),
24                    "telephone": node.get("telephone"),
25                    "price_range": node.get("priceRange"),
26                    "hours": node.get("openingHours"),
27                    "url": url,
28                }
29    return {"biz_id": biz_id, "url": url}

The meta[name="yelp-biz-id"] value is the encoded business ID you'll need for the reviews endpoint. Grab it here — you'll use it in the next step.

Step 4: Scrape Yelp Reviews with Pagination

This is where the Fusion API falls short and scraping shines. Yelp's internal GraphQL batch endpoint returns full review text, reviewer info, dates, ratings, and vote counts — everything the API withholds.

The endpoint is https://www.yelp.com/gql/batch, and it uses a static documentId for the GetBusinessReviewFeed operation. Pagination works via a base64-encoded cursor.

1import base64
2GQL_URL = "https://www.yelp.com/gql/batch"
3DOC_ID = "ef51f33d1b0eccc958dddbf6cde15739c48b34637a00ebe316441031d4bf7681"
4def fetch_reviews(enc_biz_id: str, num_pages: int = 5):
5    all_reviews = []
6    for page in range(num_pages):
7        offset = page * 10
8        cursor = base64.b64encode(
9            json.dumps({"version": 1, "offset": offset}).encode()
10        ).decode()
11        payload = [{
12            "operationName": "GetBusinessReviewFeed",
13            "variables": {
14                "encBizId": enc_biz_id,
15                "reviewsPerPage": 10,
16                "after": cursor,
17                "sortBy": "DATE_DESC",
18                "language": "en",
19            },
20            "extensions": {
21                "operationType": "query",
22                "documentId": DOC_ID,
23            },
24        }]
25        r = requests.post(
26            GQL_URL,
27            json=payload,
28            headers={
29                **HEADERS,
30                "content-type": "application/json",
31                "x-apollo-operation-name": "GetBusinessReviewFeed",
32                "apollographql-client-name": "yelp-main-frontend",
33            },
34            impersonate="chrome131",
35        )
36        if r.status_code != 200:
37            print(f"Review fetch failed at offset {offset}: {r.status_code}")
38            break
39        data = r.json()
40        # Navigate the response structure to extract reviews
41        try:
42            reviews = data[0]["data"]["business"]["reviews"]["edges"]
43            for edge in reviews:
44                node = edge.get("node", {})
45                all_reviews.append({
46                    "reviewer": node.get("author", {}).get("displayName"),
47                    "rating": node.get("rating"),
48                    "date": node.get("localizedDate"),
49                    "text": node.get("text", {}).get("full"),
50                })
51        except (KeyError, IndexError, TypeError):
52            break
53        import time, random
54        time.sleep(random.uniform(3, 7))
55    return all_reviews

Each page returns 10 reviews. Increment the offset in the base64 cursor to paginate. The sortBy parameter accepts DATE_DESC (newest first), RATING_ASC, RATING_DESC, and others.

Step 5: Export Your Scraped Yelp Data

1import pandas as pd
2# Assuming you've collected businesses and reviews
3df_businesses = pd.DataFrame(businesses)
4df_businesses.to_csv("yelp_businesses.csv", index=False)
5df_reviews = pd.DataFrame(all_reviews)
6df_reviews.to_csv("yelp_reviews.csv", index=False)
7# Or save as JSON for flexibility
8import json
9with open("yelp_data.json", "w") as f:
10    json.dump({"businesses": businesses, "reviews": all_reviews}, f, indent=2)

For readers on the no-code path, Thunderbit exports the same data straight to Excel, Google Sheets, Airtable, or Notion — no pandas or file-writing code needed.

The Anti-Blocking Playbook: How to Scrape Yelp Without Getting Blocked

This section is the whole reason the article exists. Yelp's anti-bot measures have gotten significantly tougher since late 2024 — are all in play. Most existing guides are outdated because they were written before this crackdown.

The strategy is layered. Each layer reduces your block rate; together, they make sustained scraping viable.

Layer 1: Realistic Request Headers

Default Python requests headers send User-Agent: python-requests/2.x — blocked instantly. But even a realistic User-Agent isn't enough. Yelp checks the full header set for consistency.

1FULL_HEADERS = {
2    "authority": "www.yelp.com",
3    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
4                  "AppleWebKit/537.36 (KHTML, like Gecko) "
5                  "Chrome/124.0.0.0 Safari/537.36",
6    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,"
7              "image/avif,image/webp,image/apng,*/*;q=0.8",
8    "accept-language": "en-US,en;q=0.9",
9    "accept-encoding": "gzip, deflate, br",
10    "sec-ch-ua": '"Chromium";v="124", "Google Chrome";v="124", "Not-A.Brand";v="99"',
11    "sec-ch-ua-mobile": "?0",
12    "sec-ch-ua-platform": '"Windows"',
13    "sec-fetch-dest": "document",
14    "sec-fetch-mode": "navigate",
15    "sec-fetch-site": "same-origin",
16    "sec-fetch-user": "?1",
17    "upgrade-insecure-requests": "1",
18    "referer": "https://www.yelp.com/",
19    "cookie": "intl_splash=false",
20}

Three mistakes that get you flagged:

UA claims Chrome but sec-ch-ua is missing or contradicts the UA version
sec-ch-ua-platform says "Windows" but the UA string says macOS
Same exact UA across thousands of requests from one IP — rotate a pool of 10–20 recent Chrome/Firefox/Safari strings

Layer 2: Rate Limiting and Random Delays

Predictable timing patterns are a red flag. Add randomized sleep intervals and implement exponential backoff on error responses.

1import random, time
2def polite_get(client_get, url, attempt=0):
3    r = client_get(url, headers=FULL_HEADERS, impersonate="chrome131")
4    if r.status_code in (403, 429, 503):
5        if attempt >= 4:
6            raise RuntimeError(f"Blocked after {attempt + 1} attempts on {url}")
7        backoff = 2 ** (attempt + 1) + random.random()
8        print(f"  Got {r.status_code}, backing off {backoff:.1f}s (attempt {attempt + 1})")
9        time.sleep(backoff)
10        return polite_get(client_get, url, attempt + 1)
11    time.sleep(random.uniform(3, 7))
12    return r

Parameter	Recommended Value
Random sleep between requests	`random.uniform(3, 7)` seconds
Backoff on 429/403/503	2 → 4 → 8 → 16s, max 5 attempts
Concurrent workers per IP	1 (serialize per IP; use proxies for parallelism)
Max sustained rate per residential IP	~1 req / 5s (~12 rpm)

Layer 3: User-Agent and Session Rotation

Rotate through a pool of real browser User-Agent strings. Persist sessions and cookies to mimic real browsing behavior — Yelp uses cookie-based detection, so creating a fresh session for every request is itself suspicious.

1UA_POOL = [
2    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Chrome/124.0.0.0 Safari/537.36",
3    "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_4_1) AppleWebKit/537.36 Chrome/124.0.0.0 Safari/537.36",
4    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:125.0) Gecko/20100101 Firefox/125.0",
5    "Mozilla/5.0 (Macintosh; Intel Mac OS X 14.4; rv:125.0) Gecko/20100101 Firefox/125.0",
6    "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_4_1) AppleWebKit/605.1.15 Safari/17.4.1",
7    # Add 5-10 more recent strings
8]

Layer 4: Proxy Rotation

At any real volume, you need residential proxies. Datacenter and free proxies do not work on Yelp — Yelp's IP-reputation layer preemptively 403s AWS, GCP, and DigitalOcean IP ranges.

Provider	Entry $/GB	Notes
IPRoyal	$1.75/GB	Cheapest; runs the most-cited Yelp tutorial
Decodo (ex-Smartproxy)	$3.20–$3.50	Best GB/$ ratio at volume
Bright Data	$4.00 (PAYG)	150M+ IP pool; dedicated Yelp Proxies page
Oxylabs	$6.00–$8.00	Premium; 10M+ IPs
Aluvia (mobile SIM)	$3.00	Real US carrier mobile IPs, positioned for Yelp

Rotating residential (new IP per request) works best for high-volume search crawls. Sticky sessions (hold one IP for 10 minutes) are better when persisting cookies across a business-page → reviews → pagination flow.

Layer 5: Detecting and Handling Blocks

Not every block looks the same. Yelp often serves a generic "page not available" shell rather than a CAPTCHA, which is why naive scrapers think they're getting data when they're actually getting empty responses.

1BLOCK_MARKERS = (
2    "captcha", "px-captcha", "page not available",
3    "access denied", "unusual traffic",
4)
5def is_blocked(resp):
6    if resp.status_code in (401, 403, 429, 503):
7        return True
8    body = resp.text.lower()
9    if any(m in body for m in BLOCK_MARKERS):
10        return True
11    # If this is a search/business page but react-root-props is missing,
12    # Yelp served a stripped block response
13    if "react-root-props" not in body and "/biz/" in str(resp.url):
14        return True
15    return False

Signal	Meaning
HTTP 403	Hard block — IP/header/TLS burned
HTTP 429	Rate limited — often recoverable with backoff
HTTP 503	Generic block or load shedding
Redirect to `/error` or "page not available" body	Soft block
Empty with only	Challenge page waiting for JS
`captcha` / `g-recaptcha` / `px-captcha` in body	Escalated — CAPTCHA required
Missing `react-root-props` on a listing page	Stripped block response

Layer 6: The Resilient Parsing Trick — Hidden JSON over CSS Selectors

Worth repeating: Yelp randomizes CSS class names at build time. A scraper pinned to h3.y-css-14xwok2 will break within weeks when Yelp redeploys with h3.y-css-hcq7b9.

The payloads that don't move:

<script type="application/ld+json"> — schema.org structured data (name, address, phone, rating, hours)
<script data-id="react-root-props"> — full search results data as JSON
https://www.yelp.com/gql/batch — GraphQL reviews endpoint with a stable documentId

If you're parsing CSS classes, you're building on sand. Parse the JSON instead.

Layer 7: The Stealth Browser Fallback

Escalate to a headless browser only when curl_cffi + residential proxies can't get through — typically when Yelp serves a JavaScript challenge page or CAPTCHA.

For 95% of business/search/review scraping, curl_cffi + hidden JSON + residential proxies is faster, cheaper, and more reliable than a browser. But when you do need a browser:

Tool	Status (2025)	Notes
rebrowser-playwright	Recommended starting point	Drop-in Playwright patched to fix CDP leaks
nodriver	Best-in-class for Chrome stealth	Successor to undetected-chromedriver; avoids WebDriver protocol entirely
patchright	Actively maintained Playwright fork	Passes modern detection tests
playwright-stealth	Mature	Patches `navigator.webdriver`, strips `HeadlessChrome` from UA

Skip vanilla Selenium for Yelp. It's too easily fingerprinted.

Yelp Fusion API vs. Python Scraping vs. Thunderbit: Full Comparison

Dimension	Yelp Fusion API	Python Scraping	Thunderbit
Full review text	❌ 3 excerpts × ~160 chars	✅ Unlimited (GraphQL)	✅ Built-in review template
Review metadata (votes, owner replies)	❌	✅	✅ Via AI-suggested fields
Photos	❌ (0 on Base)	✅ Unlimited	✅
Max results per search	240 (was 1,000 pre-2024)	Unlimited (paginated)	Unlimited
Daily rate limit	300–500 (new) / 5,000 (legacy)	Proxy budget only	Credit-based (3,000/mo on Pro)
Setup effort	~15 min	Hours to days	~2 minutes
Anti-bot handling	N/A	Your problem	Handled (Cloud mode)
Legal risk	Low (authorized)	Medium (ToS gray area)	Medium (same as scraping)
Cost (entry)	$29/mo	~$0.75–$4/GB proxies + dev time	Free tier
Cost (heavy use)	$643+/mo	$50–$500/mo proxies + dev time	$38–$49/mo
Data export	JSON	CSV/JSON (you write it)	Excel / Sheets / Airtable / Notion — free
Maintenance	Low	High (selectors rot, anti-bot escalates)	Low (AI re-adapts)

Legal and Ethical Tips for Scraping Yelp

I'm not a lawyer, and this isn't legal advice. But the legal landscape has shifted enough in the last two years that you should know the basics before investing time in a Yelp scraping project.

What Yelp's Terms of Service say: The explicitly prohibits using "any robot, spider... or other automated device" to "access, retrieve, copy, scrape, or index any portion of the Service." It also added language about "AI Technologies and/or other automated tools."

: "Yelp does not allow any scraping of the site."

What robots.txt says: Yelp's has a wildcard User-agent: * / Disallow: / and specifically blocks GPTBot, ClaudeBot, PerplexityBot, CCBot, and Meta-ExternalAgent. Only Googlebot, Bingbot, and a few social-media crawlers are whitelisted.

The legal precedent that matters: In (N.D. Cal. Jan 2024), the court ruled that scraping publicly available, logged-out data did not violate Meta's Terms of Service. The key distinction: logged-out public data vs. logged-in data. The case established that scraping public data likely doesn't violate the CFAA, but hiQ still lost on state tort claims (trespass to chattels, misappropriation) and was hit with a $500,000 judgment.

Practical guidelines:

Scrape only publicly available, logged-out pages
Rate-limit your requests (the delays in this guide serve double duty as ethical rate limits)
Don't resell raw review text attributed to named users — respect reviewer privacy
Comply with local data protection laws (CCPA, GDPR)
Don't log in to scrape — that crosses the authorization line
Treat business info (name/address/phone/rating) as public factual data; treat review text as more sensitive

Consult a legal professional for your specific situation.

Wrapping Up

Three paths, one goal.

The Yelp Fusion API is the authorized, low-maintenance option — but it caps at 3 review excerpts and starts at $29/month. Python scraping gives you full control over every data point on Yelp, but it requires real investment: curl_cffi for TLS impersonation, residential proxies, randomized delays, hidden JSON parsing, and ongoing maintenance as Yelp's defenses evolve. Thunderbit gets you from "I need Yelp data" to "here's my spreadsheet" in about 30 seconds, with no code and no proxy configuration.

The anti-blocking essentials that actually work in 2025: realistic headers with full Client Hints, curl_cffi for TLS fingerprint impersonation, randomized delays with exponential backoff, residential proxy rotation, and — above all — parsing hidden JSON (application/ld+json and react-root-props) instead of fragile CSS selectors.

Not sure which path fits? Try first. If it covers your needs, you've saved yourself hours. If you need more control — full programmatic pipelines, custom fields, tight CRM integration — the Python guide above has you covered. And for a deeper look at the scraping tools landscape, check out our roundup of the or our guide to .

Try Thunderbit for Yelp Data Extraction

FAQs

Can I scrape Yelp for free with Python?

Yes — using free libraries like curl_cffi, parsel, and jmespath. But at any real volume (more than a few dozen pages), you'll need paid residential proxies, which start around . Thunderbit also offers a free tier with 6 pages/month for quick, no-code extraction.

Does Yelp block scrapers?

Yes, aggressively. Yelp uses . Vanilla requests gets blocked on the first hit. The layered anti-blocking strategy in this guide — curl_cffi for TLS impersonation, realistic headers, random delays, and residential proxies — is what works in 2025.

Is the Yelp Fusion API better than scraping?

Depends on your needs. The API is authorized and low-risk, but it only returns , caps search results at 240, and starts at $29/month. If you need full review text, review metadata, or more than a few hundred records per day, scraping is the only option.

How do I scrape Yelp reviews with Python?

Use curl_cffi with impersonate="chrome131" to fetch the business page, pull the encoded business ID from <meta name="yelp-biz-id">, then POST to https://www.yelp.com/gql/batch with the GetBusinessReviewFeed operation and paginate via a base64-encoded after cursor. The step-by-step code is in the tutorial section above. The is also a solid reference implementation.

Can I scrape Yelp without coding?

Yes — ships pre-built and templates. Open a Yelp page, click AI Suggest Fields, click Scrape. Exports to Google Sheets, Excel, Airtable, and Notion are free on every tier, including the free plan.

Learn More