Last week, I tried to pull hotel ratings and review counts for about 200 properties across three European cities from TripAdvisor. My first script — a basic requests.get() with default headers — returned a beautiful 403 Forbidden on every single request. Not a single byte of useful data.
TripAdvisor is one of the richest public data sources in the travel industry: over , 8+ million business listings, and roughly 460 million unique monthly visitors. It influences more than in annual travel spending. But getting that data programmatically? That's where things get tricky. TripAdvisor uses DataDome bot detection, Cloudflare WAF, TLS fingerprinting, and JavaScript challenges — a layered defense stack that blocks most naive scraping attempts before they even start. This guide is the single resource I wish I'd had: a head-to-head comparison of three Python scraping approaches (plus a no-code option), complete code for each, a structured anti-bot troubleshooting section, and reusable patterns that work across hotels, restaurants, and attractions. Whether you're a Python beginner or an experienced developer, this should save you a lot of wasted 403s.
Don't Want to Write Code? Scrape TripAdvisor the Easy Way
I want to be upfront about something. A lot of people searching "scrape TripAdvisor with Python" aren't actually married to the idea of writing code. They just want the data — hotel names, ratings, review counts, prices — in a spreadsheet, fast. If that sounds like you, there's a much shorter path.
is an AI-powered Chrome extension we built that can read any TripAdvisor page and automatically suggest the right columns to extract. The workflow is genuinely two clicks:
- Open a TripAdvisor listing page (e.g., "Hotels in Paris" search results).
- Click "AI Suggest Fields" in the Thunderbit sidebar. The AI scans the page and proposes columns like Hotel Name, Rating, Review Count, Price, and Location.
- Click "Scrape." Thunderbit extracts data from every listing on the page — and handles pagination automatically if you need more results.
- Export to Excel, Google Sheets, Airtable, or Notion. Exports are free on every plan.
Thunderbit works across hotels, restaurants, and attractions without any configuration changes — the AI adapts to whatever's on the page. For paginated results, it auto-detects "Next" buttons and infinite scroll. And because it runs inside your real Chrome browser, it inherits your session cookies and browser fingerprint, which gives it a natural advantage against bot detection.
You can try it with the — the free tier gives you 6 pages/month, enough to test the workflow.
If you need programmatic control, custom parsing logic, or plan to scrape 10,000+ pages, Python is the way to go. Keep reading.
Why Scrape TripAdvisor with Python?
TripAdvisor data has direct, measurable business impact. A found that a 1-point increase in a hotel's 100-point Global Review Index leads to a 0.89% increase in average daily rate and a 1.42% increase in Revenue Per Available Room. A separate showed that an exogenous 1-star increase in TripAdvisor rating translates to $55,000–$75,000 in additional yearly revenue for an average hotel. Reviews aren't just vanity metrics — they're revenue drivers.
Here's how different teams use TripAdvisor data:
| Use Case | Who Benefits | Data Needed |
|---|---|---|
| Hotel competitor analysis | Hotel chains, revenue managers | Ratings, prices, review volume, amenities |
| Restaurant market research | Restaurant groups, food brands | Cuisine types, price ranges, review sentiment |
| Attraction trend tracking | Tour operators, tourism boards | Popularity rankings, seasonal patterns |
| Sentiment analysis | Researchers, data analysts | Full review text, star ratings, dates |
| Lead generation | Sales teams, travel agencies | Business names, contact info, locations |
Why Python specifically? Three reasons. First, the ecosystem: BeautifulSoup, Selenium, Playwright, Scrapy, httpx, pandas — Python has more mature scraping and data analysis libraries than any other language. Second, use Python, which means more community support, more StackOverflow answers, and more up-to-date guides. Third, the pipeline advantage: you can scrape with BeautifulSoup, clean with pandas, run sentiment analysis with Hugging Face Transformers, and build dashboards — all in one language. No context switching.
Three Ways to Scrape TripAdvisor with Python (Compared)
Every competing guide picks one approach and runs with it. That's not helpful when you're trying to decide before writing code. Here's the comparison table I wish someone had given me:
| Approach | Speed | JS Support | Anti-Bot Resistance | Complexity | Best For |
|---|---|---|---|---|---|
requests + BeautifulSoup | ⚡ Fast (~120–200 pages/min raw) | ❌ None | ⚠️ Low | Easy | Static listing pages, small-scale projects |
| Selenium / Headless Browser | 🐢 Slow (~8–20 pages/min) | ✅ Full | ⚠️ Medium | Medium | Dynamic content, "Read more" clicks, cookie banners |
| Hidden JSON / GraphQL API | ⚡⚡ Fastest (~200–600 pages/min raw) | N/A | ✅ Higher | Hard | Large-scale review/hotel extraction |
| No-code (Thunderbit) | ⚡ Fast | ✅ Built-in | ✅ Built-in | Easiest | Non-devs, quick one-off exports |
A few important caveats. Those raw speeds are theoretical — TripAdvisor's rate limits (~10–15 requests per minute per IP) constrain actual throughput to roughly 10 pages/minute per IP regardless of approach. The hidden JSON method gets you the most data per request, which means fewer total requests and less exposure to rate limiting. Selenium is 5x slower than request-based approaches in real-world benchmarks, but it's the only option when you need to click buttons or render JavaScript.
The rest of this guide walks through all three Python methods with complete code. Pick the one that fits your situation, or combine them (I often use requests+BS4 for listing pages and hidden JSON for detail pages).
Setting Up Your Python Environment
Before diving in, let's get the environment ready. You'll need Python 3.10+ (I recommend 3.12 or 3.13 — all major packages support them with no known issues).
Install everything at once:
1pip install requests beautifulsoup4 selenium httpx parsel pandas curl-cffi
Package notes:
requests(2.33.1) — HTTP requests, requires Python 3.10+beautifulsoup4(4.14.3) — HTML parsingselenium(4.43.0) — Browser automation, requires Python 3.10+httpx(0.28.1) — Async HTTP clientparsel(1.11.0) — CSS/XPath selectors (lighter than BS4)pandas(3.0.2) — Data export, requires Python 3.11+curl_cffi(0.15.0) — TLS fingerprint impersonation (critical for bypassing Cloudflare)
ChromeDriver: If you're using Selenium, good news — since Selenium 4.6+, Selenium Manager automatically downloads and caches the correct ChromeDriver binary. No manual installation needed. It resolves version matching dynamically, so you don't have to worry about Chrome version mismatches.
Virtual environment (recommended):
1python -m venv tripadvisor-scraper
2source tripadvisor-scraper/bin/activate # macOS/Linux
3tripadvisor-scraper\Scripts\activate # Windows
Approach 1: Scrape TripAdvisor with Requests and BeautifulSoup
This is the simplest approach. It works well for scraping listing pages (hotel search results, restaurant lists) where the data you need is present in the static HTML. No browser, no JavaScript rendering, minimal resource usage.
Understanding TripAdvisor URL Patterns
TripAdvisor URLs follow predictable patterns by category:
- Hotels:
https://www.tripadvisor.com/Hotels-g{locationId}-{Location_Name}-Hotels.html - Restaurants:
https://www.tripadvisor.com/Restaurants-g{locationId}-{Location_Name}.html - Attractions:
https://www.tripadvisor.com/Attractions-g{locationId}-Activities-{Location_Name}.html
Pagination uses the oa (offset anchors) parameter, inserted into the URL. Each page shows 30 results:
- Page 1: base URL (no
oaparameter) - Page 2:
Hotels-g187768-oa30-Italy-Hotels.html - Page 3:
Hotels-g187768-oa60-Italy-Hotels.html
For review pages, the offset parameter is or with increments of 10:
- Page 1:
Reviews-or0-Hotel_Name.html - Page 2:
Reviews-or10-Hotel_Name.html
To get reviews in all languages, append ?filterLang=ALL to the URL.
Sending Requests with Realistic Headers
TripAdvisor checks headers aggressively. A request with default Python headers gets blocked instantly. You need to mimic a real Chrome browser:
1import requests
2import time
3import random
4session = requests.Session()
5headers = {
6 "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/135.0.0.0 Safari/537.36",
7 "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
8 "Accept-Language": "en-US,en;q=0.9",
9 "Accept-Encoding": "gzip, deflate, br",
10 "Referer": "https://www.tripadvisor.com/",
11 "Sec-Fetch-Dest": "document",
12 "Sec-Fetch-Mode": "navigate",
13 "Sec-Fetch-Site": "none",
14 "Sec-CH-UA": '"Google Chrome";v="135", "Not-A.Brand";v="8", "Chromium";v="135"',
15 "Sec-CH-UA-Mobile": "?0",
16 "Sec-CH-UA-Platform": '"Windows"',
17}
18session.headers.update(headers)
19url = "https://www.tripadvisor.com/Hotels-g187147-Paris_Ile_de_France-Hotels.html"
20response = session.get(url)
21print(f"Status: {response.status_code}")
22print(f"Content length: {len(response.text)} characters")
Key detail: TripAdvisor validates that your User-Agent and Sec-CH-UA Client Hints headers are consistent. If you claim to be Chrome 135 in the User-Agent but your Sec-CH-UA says Chrome 120, you'll get flagged. Always rotate entire header sets together, not individual headers.
Parsing Listings with BeautifulSoup
Once you have a successful response, extract the data using BeautifulSoup. TripAdvisor uses data-automation and data-test-attribute attributes that are more stable than CSS class names (which change frequently):
1from bs4 import BeautifulSoup
2soup = BeautifulSoup(response.text, "html.parser")
3# Find all hotel listing cards
4cards = soup.select('div[data-test-attribute="location-results-card"]')
5hotels = []
6for card in cards:
7 # Hotel name
8 title_el = card.select_one('div[data-automation="hotel-card-title"]')
9 name = title_el.get_text(strip=True) if title_el else None
10 # Link to detail page
11 link_el = card.select_one('div[data-automation="hotel-card-title"] a')
12 link = "https://www.tripadvisor.com" + link_el["href"] if link_el else None
13 # Rating
14 rating_el = card.select_one('[data-automation="bubbleRatingValue"]')
15 rating = rating_el.get_text(strip=True) if rating_el else None
16 # Review count
17 review_el = card.select_one('[data-automation="bubbleReviewCount"]')
18 review_count = review_el.get_text(strip=True).replace(",", "").split()[0] if review_el else None
19 hotels.append({
20 "name": name,
21 "rating": rating,
22 "review_count": review_count,
23 "url": link,
24 })
25print(f"Found {len(hotels)} hotels on this page")
26for h in hotels[:3]:
27 print(h)
A note on selectors: TripAdvisor uses obfuscated CSS class names (like FGwzt, yyzcQ) that change with every site update. The data-automation and data-test-target attributes are far more stable. Always prefer data attributes over class names.
Handling Pagination
To scrape multiple pages, loop through the offset parameter with a polite delay between requests:
1import pandas as pd
2all_hotels = []
3base_url = "https://www.tripadvisor.com/Hotels-g187147-oa{offset}-Paris_Ile_de_France-Hotels.html"
4for page in range(5): # First 5 pages
5 offset = page * 30
6 url = base_url.format(offset=offset) if page > 0 else "https://www.tripadvisor.com/Hotels-g187147-Paris_Ile_de_France-Hotels.html"
7 response = session.get(url)
8 if response.status_code != 200:
9 print(f"Page {page + 1}: Got status {response.status_code}, stopping.")
10 break
11 soup = BeautifulSoup(response.text, "html.parser")
12 cards = soup.select('div[data-test-attribute="location-results-card"]')
13 for card in cards:
14 title_el = card.select_one('div[data-automation="hotel-card-title"]')
15 name = title_el.get_text(strip=True) if title_el else None
16 rating_el = card.select_one('[data-automation="bubbleRatingValue"]')
17 rating = rating_el.get_text(strip=True) if rating_el else None
18 review_el = card.select_one('[data-automation="bubbleReviewCount"]')
19 review_count = review_el.get_text(strip=True).replace(",", "").split()[0] if review_el else None
20 all_hotels.append({"name": name, "rating": rating, "review_count": review_count})
21 print(f"Page {page + 1}: {len(cards)} hotels found")
22 time.sleep(random.uniform(3, 7)) # Random delay to avoid rate limiting
23df = pd.DataFrame(all_hotels)
24print(f"\nTotal hotels scraped: {len(df)}")
The time.sleep(random.uniform(3, 7)) is important. TripAdvisor's rate limit threshold is roughly 10–15 requests per minute per IP. Going faster than that triggers CAPTCHAs or 429 errors.
Limitations of This Approach
Where does this fall apart? The requests+BS4 approach fails when:
- TripAdvisor serves JavaScript-rendered content (some search result pages require JS)
- Review text is truncated behind "Read more" buttons
- Anti-bot measures escalate to JavaScript challenges or CAPTCHAs
- You need data that only appears after client-side rendering (prices, availability)
For these scenarios, you need either Selenium (Approach 2) or the hidden JSON method (Approach 3).
Approach 2: Scrape TripAdvisor with Selenium (Headless Browser)
Selenium launches a real browser, which means it can render JavaScript, click buttons, handle cookie consent banners, and interact with dynamic content. The cost: it's roughly and uses 300–500MB of RAM per browser instance.
Configuring Selenium with Anti-Detection Settings
Out of the box, Selenium is trivially detectable. TripAdvisor's fingerprinting catches it immediately. You need to disable automation flags:
1from selenium import webdriver
2from selenium.webdriver.chrome.options import Options
3from selenium.webdriver.common.by import By
4from selenium.webdriver.support.ui import WebDriverWait
5from selenium.webdriver.support import expected_conditions as EC
6options = Options()
7options.add_argument("--headless=new") # Use new headless mode (Chrome 112+)
8options.add_argument("--disable-blink-features=AutomationControlled")
9options.add_argument("--window-size=1920,1080")
10options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/135.0.0.0 Safari/537.36")
11options.add_experimental_option("excludeSwitches", ["enable-automation"])
12options.add_experimental_option("useAutomationExtension", False)
13driver = webdriver.Chrome(options=options)
14# Remove webdriver property from navigator
15driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {
16 "source": "Object.defineProperty(navigator, 'webdriver', {get: () => undefined})"
17})
Is this enough for TripAdvisor? For small-scale scraping (under 50 pages), this setup with residential proxies usually works. For larger volumes, you may need undetected-chromedriver or nodriver — TripAdvisor's DataDome protection analyzes over 1,000 signals per request, including TLS fingerprints that vanilla Selenium can't spoof.
Scraping Hotel Search Results with Selenium
1import time
2import random
3url = "https://www.tripadvisor.com/Hotels-g187147-Paris_Ile_de_France-Hotels.html"
4driver.get(url)
5# Wait for hotel cards to load
6wait = WebDriverWait(driver, 15)
7wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, 'div[data-test-attribute="location-results-card"]')))
8# Handle cookie consent popup (if it appears)
9try:
10 cookie_btn = driver.find_element(By.ID, "onetrust-accept-btn-handler")
11 cookie_btn.click()
12 time.sleep(1)
13except:
14 pass # No cookie popup
15# Extract hotel data
16cards = driver.find_elements(By.CSS_SELECTOR, 'div[data-test-attribute="location-results-card"]')
17hotels = []
18for card in cards:
19 try:
20 name = card.find_element(By.CSS_SELECTOR, 'div[data-automation="hotel-card-title"]').text
21 except:
22 name = None
23 try:
24 rating = card.find_element(By.CSS_SELECTOR, '[data-automation="bubbleRatingValue"]').text
25 except:
26 rating = None
27 try:
28 reviews = card.find_element(By.CSS_SELECTOR, '[data-automation="bubbleReviewCount"]').text
29 except:
30 reviews = None
31 hotels.append({"name": name, "rating": rating, "review_count": reviews})
32print(f"Scraped {len(hotels)} hotels")
33for h in hotels[:3]:
34 print(h)
This took about 8 seconds for a single page on my machine — compared to under 1 second with requests+BS4. That 8x difference adds up fast when you're scraping hundreds of pages.
Expanding "Read More" and Scraping Full Reviews
Review pages truncate long reviews behind a "Read more" button. Selenium can click it:
1review_url = "https://www.tripadvisor.com/Hotel_Review-g187147-d188726-Reviews-Le_Marais_Hotel-Paris_Ile_de_France.html"
2driver.get(review_url)
3wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, 'div[data-reviewid]')))
4time.sleep(2)
5# Click all "Read more" buttons
6read_more_buttons = driver.find_elements(By.XPATH, '//button//*[contains(text(), "Read more")]/..')
7for btn in read_more_buttons:
8 try:
9 driver.execute_script("arguments[0].click();", btn)
10 time.sleep(0.3)
11 except:
12 pass
13# Extract reviews
14review_elements = driver.find_elements(By.CSS_SELECTOR, 'div[data-reviewid]')
15reviews = []
16for rev in review_elements:
17 try:
18 title = rev.find_element(By.CSS_SELECTOR, 'div[data-test-target="review-title"]').text
19 except:
20 title = None
21 try:
22 body = rev.find_element(By.CSS_SELECTOR, 'q.IRsGHoPm span').text
23 except:
24 try:
25 body = rev.find_element(By.CSS_SELECTOR, 'p.partial_entry').text
26 except:
27 body = None
28 try:
29 rating_class = rev.find_element(By.CSS_SELECTOR, 'div[data-test-target="review-rating"] span').get_attribute("class")
30 # Rating encoded in class like "ui_bubble_rating bubble_50" = 5.0
31 rating_num = [c for c in rating_class.split() if "bubble_" in c][0].replace("bubble_", "")
32 rating = int(rating_num) / 10
33 except:
34 rating = None
35 reviews.append({"title": title, "body": body, "rating": rating})
36print(f"Scraped {len(reviews)} reviews")
Adding Proxy Rotation to Selenium
For sustained scraping, you'll need proxy rotation. Since selenium-wire has been deprecated since January 2024, use Chrome's built-in proxy support:
1# With authentication-free proxy
2proxy = "http://your-proxy-address:port"
3options.add_argument(f"--proxy-server={proxy}")
4# For proxies with authentication, use a Chrome extension or Selenium 4's BiDi protocol
For rotating proxies programmatically, create a new driver instance with a different proxy for each batch of requests. It's not elegant, but it's reliable.
Approach 3: The Hidden JSON Method (Skip HTML Parsing Entirely)
Most guides skip this approach entirely, which is a shame — it's the fastest and cleanest of the three. TripAdvisor embeds structured data as JSON directly in its HTML pages — inside <script> tags as JavaScript variables like pageManifest and urqlCache. Extracting this JSON gives you cleaner data (ratings as numbers, dates in ISO format) with fewer requests and no need for JavaScript rendering.
Finding the Embedded JSON in Page Source
The key insight: you can use a simple requests.get() to fetch the page, then extract the JSON from the raw HTML without ever rendering JavaScript.
1import requests
2import re
3import json
4headers = {
5 "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/135.0.0.0 Safari/537.36",
6 "Accept-Language": "en-US,en;q=0.9",
7 "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
8 "Referer": "https://www.tripadvisor.com/",
9 "Sec-CH-UA": '"Google Chrome";v="135", "Not-A.Brand";v="8", "Chromium";v="135"',
10 "Sec-CH-UA-Mobile": "?0",
11 "Sec-CH-UA-Platform": '"macOS"',
12}
13url = "https://www.tripadvisor.com/Hotel_Review-g188590-d194317-Reviews-NH_City_Centre_Amsterdam.html"
14response = requests.get(url, headers=headers)
15# Extract the pageManifest JSON blob
16match = re.search(r"pageManifest:({.+?})};", response.text)
17if match:
18 page_data = json.loads(match.group(1))
19 print("Found pageManifest data")
20 print(f"Keys: {list(page_data.keys())[:10]}")
How to find the variable name yourself: Open any TripAdvisor hotel page in Chrome, right-click → View Page Source, then Ctrl+F for pageManifest or urqlCache or aggregateRating. The data is there, waiting to be parsed.
Parsing the JSON and Extracting Structured Data
TripAdvisor also embeds application/ld+json schema.org data that's even easier to extract:
1from parsel import Selector
2sel = Selector(text=response.text)
3# Extract JSON-LD structured data
4json_ld_scripts = sel.xpath("//script[@type='application/ld+json']/text()").getall()
5for script in json_ld_scripts:
6 data = json.loads(script)
7 if isinstance(data, dict) and data.get("@type") in ["Hotel", "Restaurant", "TouristAttraction"]:
8 print(f"Name: {data.get('name')}")
9 print(f"Rating: {data.get('aggregateRating', {}).get('ratingValue')}")
10 print(f"Review Count: {data.get('aggregateRating', {}).get('reviewCount')}")
11 print(f"Price Range: {data.get('priceRange')}")
12 print(f"Address: {data.get('address', {}).get('streetAddress')}")
13 print(f"Coordinates: {data.get('geo', {}).get('latitude')}, {data.get('geo', {}).get('longitude')}")
14 break
The JSON-LD data is embedded in static HTML and does NOT require JavaScript rendering. It gives you the property name, aggregate rating, review count, address, coordinates, price range, and photo URLs — all without parsing a single HTML tag.
For richer data (individual reviews, rating breakdowns, amenity lists), you need the urqlCache object:
1# Extract urqlCache for detailed review data
2cache_match = re.search(r'"urqlCache"\s*:\s*({.+?})\s*,\s*"redux"', response.text)
3if cache_match:
4 cache_data = json.loads(cache_match.group(1))
5 # Navigate the cache to find review data
6 for key, value in cache_data.items():
7 if "reviews" in str(value).lower()[:100]:
8 reviews_data = json.loads(value.get("data", "{}")) if isinstance(value, dict) else None
9 if reviews_data:
10 print(f"Found review cache entry: {key[:50]}...")
11 break
The exact JSON paths change occasionally when TripAdvisor updates its frontend, but the general structure — JSON-LD for summary data, urqlCache for detailed data — has been stable for years.
Reverse-Engineering TripAdvisor's GraphQL API (Advanced)
For large-scale extraction, TripAdvisor's GraphQL endpoints return structured data directly. This is the fastest method but requires the most maintenance.
1import httpx
2import random
3import string
4def generate_request_id():
5 """Generate the X-Requested-By header value"""
6 random_chars = ''.join(random.choices(string.ascii_letters + string.digits, k=180))
7 return f"TNI1625!{random_chars}"
8# Search for hotels in Paris
9search_payload = [{
10 "variables": {
11 "request": {
12 "query": "hotels in Paris",
13 "limit": 10,
14 "scope": "WORLDWIDE",
15 "locale": "en-US",
16 "scopeGeoId": 1,
17 "searchCenter": None,
18 "types": ["LOCATION", "QUERY_SUGGESTION", "RESCUE_RESULT"],
19 "locationTypes": ["GEO", "AIRPORT", "ACCOMMODATION", "ATTRACTION", "EATERY", "NEIGHBORHOOD"]
20 }
21 },
22 "extensions": {
23 "preRegisteredQueryId": "84b17ed122fbdbd4"
24 }
25}]
26graphql_headers = {
27 "Content-Type": "application/json",
28 "Accept": "*/*",
29 "Accept-Language": "en-US,en;q=0.9",
30 "Origin": "https://www.tripadvisor.com",
31 "Referer": "https://www.tripadvisor.com/Hotels",
32 "X-Requested-By": generate_request_id(),
33 "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/135.0.0.0 Safari/537.36",
34}
35with httpx.Client() as client:
36 response = client.post(
37 "https://www.tripadvisor.com/data/graphql/ids",
38 json=search_payload,
39 headers=graphql_headers
40 )
41 if response.status_code == 200:
42 results = response.json()
43 print(json.dumps(results, indent=2)[:1000])
44 else:
45 print(f"GraphQL request failed: {response.status_code}")
For fetching reviews via GraphQL:
1review_payload = [{
2 "variables": {
3 "locationId": 194317, # NH City Centre Amsterdam
4 "offset": 0,
5 "limit": 20,
6 "filters": {},
7 "sortType": None,
8 "sortBy": "date",
9 "language": "en",
10 "doMachineTranslation": False,
11 "photosPerReviewLimit": 3
12 },
13 "extensions": {
14 "preRegisteredQueryId": "ef1a9f94012220d3"
15 }
16}]
17with httpx.Client() as client:
18 response = client.post(
19 "https://www.tripadvisor.com/data/graphql/ids",
20 json=review_payload,
21 headers=graphql_headers
22 )
23 if response.status_code == 200:
24 data = response.json()
25 reviews = data[0]["data"]["locations"][0]["reviewListPage"]["reviews"]
26 total = data[0]["data"]["locations"][0]["reviewListPage"]["totalCount"]
27 print(f"Total reviews: {total}")
28 for r in reviews[:3]:
29 print(f" [{r['rating']}/5] {r['title']} - {r['createdDate']}")
Important caveat: The preRegisteredQueryId values (like 84b17ed122fbdbd4 for search and ef1a9f94012220d3 for reviews) can break when TripAdvisor redeploys. When they do, your requests will fail silently. You'll need to re-discover the query IDs by monitoring network requests in browser DevTools.
Why This Method Reduces the Need for Proxies
The math is simple. With requests+BS4, scraping 100 hotel detail pages requires 100 requests. With the hidden JSON method, each request returns all the data you need from a single page load — no additional requests for expanding reviews or loading dynamic content. With GraphQL, a single API call can return 20 reviews at once. Fewer requests = less exposure to rate limiting = less need for proxy rotation. For small-to-medium projects (under 1,000 pages), you may not need proxies at all if you add sensible delays.
Scrape Hotels, Restaurants, and Attractions with One Reusable Script
Four out of five competing guides only cover hotels. But TripAdvisor has three core content categories, and the URL patterns and data fields differ between them. Here's how to build one function that handles all three.
Data Fields Available per Category
| Field | Hotels | Restaurants | Attractions |
|---|---|---|---|
| Name | ✅ | ✅ | ✅ |
| Rating | ✅ | ✅ | ✅ |
| Review count | ✅ | ✅ | ✅ |
| Price/Price range | ✅ | ✅ | Sometimes |
| Address | ✅ | ✅ | ✅ |
| Cuisine type | ❌ | ✅ | ❌ |
| Duration/Tour type | ❌ | ❌ | ✅ |
| Amenities | ✅ | ❌ | ❌ |
| Coordinates | ✅ | ✅ | ✅ |
Building a Reusable scrape_tripadvisor() Function
1import requests
2from bs4 import BeautifulSoup
3import pandas as pd
4import time
5import random
6import re
7import json
8def scrape_tripadvisor(category, location_id, location_name, num_pages=3):
9 """
10 Scrape TripAdvisor listings across hotels, restaurants, or attractions.
11 Args:
12 category: "hotels", "restaurants", or "attractions"
13 location_id: TripAdvisor geo ID (e.g., "187147" for Paris)
14 location_name: URL-friendly name (e.g., "Paris_Ile_de_France")
15 num_pages: Number of pages to scrape
16 """
17 url_patterns = {
18 "hotels": "https://www.tripadvisor.com/Hotels-g{geo}-oa{offset}-{name}-Hotels.html",
19 "restaurants": "https://www.tripadvisor.com/Restaurants-g{geo}-oa{offset}-{name}.html",
20 "attractions": "https://www.tripadvisor.com/Attractions-g{geo}-oa{offset}-Activities-{name}.html",
21 }
22 first_page_patterns = {
23 "hotels": "https://www.tripadvisor.com/Hotels-g{geo}-{name}-Hotels.html",
24 "restaurants": "https://www.tripadvisor.com/Restaurants-g{geo}-{name}.html",
25 "attractions": "https://www.tripadvisor.com/Attractions-g{geo}-Activities-{name}.html",
26 }
27 headers = {
28 "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/135.0.0.0 Safari/537.36",
29 "Accept-Language": "en-US,en;q=0.9",
30 "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
31 "Referer": "https://www.tripadvisor.com/",
32 "Sec-CH-UA": '"Google Chrome";v="135", "Not-A.Brand";v="8", "Chromium";v="135"',
33 "Sec-CH-UA-Mobile": "?0",
34 "Sec-CH-UA-Platform": '"Windows"',
35 }
36 session = requests.Session()
37 session.headers.update(headers)
38 all_items = []
39 for page in range(num_pages):
40 offset = page * 30
41 if page == 0:
42 url = first_page_patterns[category].format(geo=location_id, name=location_name)
43 else:
44 url = url_patterns[category].format(geo=location_id, offset=offset, name=location_name)
45 response = session.get(url)
46 if response.status_code != 200:
47 print(f" Page {page + 1}: Status {response.status_code}, stopping.")
48 break
49 soup = BeautifulSoup(response.text, "html.parser")
50 cards = soup.select('div[data-test-attribute="location-results-card"]')
51 for card in cards:
52 item = {"category": category}
53 title_el = card.select_one('div[data-automation="hotel-card-title"]') or card.select_one('a[data-automation]')
54 item["name"] = title_el.get_text(strip=True) if title_el else None
55 rating_el = card.select_one('[data-automation="bubbleRatingValue"]')
56 item["rating"] = rating_el.get_text(strip=True) if rating_el else None
57 review_el = card.select_one('[data-automation="bubbleReviewCount"]')
58 item["review_count"] = review_el.get_text(strip=True) if review_el else None
59 all_items.append(item)
60 print(f" Page {page + 1}: {len(cards)} items found")
61 time.sleep(random.uniform(3, 7))
62 return pd.DataFrame(all_items)
63# Usage examples
64print("=== Hotels in Paris ===")
65hotels_df = scrape_tripadvisor("hotels", "187147", "Paris_Ile_de_France", num_pages=2)
66print(hotels_df.head())
67print("\n=== Restaurants in Rome ===")
68restaurants_df = scrape_tripadvisor("restaurants", "187791", "Rome_Lazio", num_pages=2)
69print(restaurants_df.head())
70print("\n=== Attractions in Barcelona ===")
71attractions_df = scrape_tripadvisor("attractions", "187497", "Barcelona_Catalonia", num_pages=2)
72print(attractions_df.head())
One function, three categories, zero code duplication. If TripAdvisor changes a selector, you fix it in one place.
What to Do When TripAdvisor Blocks You (Anti-Bot Troubleshooting)
This is the section I needed most when I started scraping TripAdvisor, and it's the section no competing guide provides in a structured way. TripAdvisor uses DataDome (analyzing per day) and Cloudflare WAF together. Here's a diagnostic table for the most common failure modes:
| Symptom | Likely Cause | Fix |
|---|---|---|
| HTTP 403 response | Missing or suspicious headers; Cloudflare JS challenge | Set realistic User-Agent, Accept-Language, Referer, and Sec-CH-UA headers. Ensure header consistency. |
| CAPTCHA page instead of data | Rate limiting or browser fingerprinting | Rotate residential proxies, add random delays (2–7 seconds between requests) |
| Empty HTML or blank page body | JavaScript not rendered by requests | Switch to Selenium or extract from hidden JSON in page source |
| Partial reviews / "Read more" not expanding | Content loaded on click event | Use Selenium .click() or extract from embedded JSON blob |
| Reviews only in one language | Missing language parameter | Append ?filterLang=ALL to the review URL |
| Data stops loading after N pages | Session-based rate limit | Rotate sessions, clear cookies between batches |
| HTTP 1020 Access Denied | IP/ASN banned by Cloudflare | Switch from datacenter to residential proxies |
| Challenge loop (infinite CAPTCHA) | Broken cookie persistence | Warm up sessions by visiting homepage first; maintain cookie jar |
Retry Logic with Exponential Backoff
No competing article actually shows this code. Here's a reusable retry function:
1import time
2import random
3import requests
4def fetch_with_retry(session, url, max_retries=4, base_delay=2, max_delay=60):
5 """
6 Fetch a URL with exponential backoff and jitter.
7 Rotates User-Agent on each retry.
8 """
9 user_agents = [
10 "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/135.0.0.0 Safari/537.36",
11 "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/135.0.0.0 Safari/537.36",
12 "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36",
13 "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/135.0.0.0 Safari/537.36",
14 ]
15 for attempt in range(max_retries):
16 # Rotate User-Agent on retry
17 if attempt > 0:
18 session.headers["User-Agent"] = random.choice(user_agents)
19 try:
20 response = session.get(url, timeout=30)
21 if response.status_code == 200:
22 return response
23 if response.status_code == 429:
24 # Respect Retry-After header if present
25 retry_after = int(response.headers.get("Retry-After", base_delay * (2 ** attempt)))
26 print(f" Rate limited (429). Waiting {retry_after}s...")
27 time.sleep(retry_after)
28 continue
29 if response.status_code in (403, 503):
30 wait = min(base_delay * (2 ** attempt) + random.uniform(0, 1), max_delay)
31 print(f" Got {response.status_code}. Retry {attempt + 1}/{max_retries} in {wait:.1f}s...")
32 time.sleep(wait)
33 continue
34 # Other error codes — don't retry
35 print(f" Unexpected status {response.status_code} for {url}")
36 return response
37 except requests.exceptions.Timeout:
38 wait = min(base_delay * (2 ** attempt) + random.uniform(0, 1), max_delay)
39 print(f" Timeout. Retry {attempt + 1}/{max_retries} in {wait:.1f}s...")
40 time.sleep(wait)
41 print(f" All {max_retries} retries exhausted for {url}")
42 return None
Rotating Headers, Proxies, and Sessions
For sustained scraping, maintain a pool of header sets and rotate them together:
1import random
2HEADER_SETS = [
3 {
4 "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/135.0.0.0 Safari/537.36",
5 "Sec-CH-UA": '"Google Chrome";v="135", "Not-A.Brand";v="8", "Chromium";v="135"',
6 "Sec-CH-UA-Platform": '"Windows"',
7 },
8 {
9 "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/135.0.0.0 Safari/537.36",
10 "Sec-CH-UA": '"Google Chrome";v="135", "Not-A.Brand";v="8", "Chromium";v="135"',
11 "Sec-CH-UA-Platform": '"macOS"',
12 },
13 {
14 "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36",
15 "Sec-CH-UA": '"Google Chrome";v="134", "Not-A.Brand";v="8", "Chromium";v="134"',
16 "Sec-CH-UA-Platform": '"Windows"',
17 },
18]
19PROXY_LIST = [
20 "http://user:pass@residential-proxy-1:port",
21 "http://user:pass@residential-proxy-2:port",
22 # Add more residential proxies
23]
24def get_rotated_session():
25 """Create a new session with rotated headers and proxy."""
26 session = requests.Session()
27 # Pick a random header set
28 header_set = random.choice(HEADER_SETS)
29 base_headers = {
30 "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
31 "Accept-Language": "en-US,en;q=0.9",
32 "Accept-Encoding": "gzip, deflate, br",
33 "Referer": "https://www.tripadvisor.com/",
34 "Sec-Fetch-Dest": "document",
35 "Sec-Fetch-Mode": "navigate",
36 "Sec-CH-UA-Mobile": "?0",
37 }
38 base_headers.update(header_set)
39 session.headers.update(base_headers)
40 # Pick a random proxy
41 if PROXY_LIST:
42 proxy = random.choice(PROXY_LIST)
43 session.proxies = {"http": proxy, "https": proxy}
44 return session
Proxy type matters. Datacenter proxies get blocked almost immediately by TripAdvisor (HTTP 1020 Access Denied). Residential proxies are mandatory for sustained scraping — they route through consumer ISPs and are indistinguishable from real users. Expect to pay $2.50–$8.40/GB depending on the provider.
Exporting and Storing Your Scraped TripAdvisor Data
Once you have the data, getting it into a usable format is straightforward.
CSV Export (Most Common)
1import pandas as pd
2df = pd.DataFrame(all_hotels)
3df.to_csv("tripadvisor_hotels_paris.csv", index=False, encoding="utf-8-sig")
4print(f"Exported {len(df)} rows to CSV")
The encoding='utf-8-sig' is important — it ensures Excel correctly displays non-Latin characters (French accents, Chinese characters, etc.) when opening the CSV.
JSON Export (For Nested Data)
When you have reviews nested under hotels, JSON preserves the hierarchy:
1# Hierarchical structure
2hotel_data = {
3 "property_id": "d194317",
4 "name": "NH City Centre Amsterdam",
5 "rating": 4.0,
6 "reviews": [
7 {"title": "Great location", "rating": 5, "date": "2025-03-15", "text": "..."},
8 {"title": "Average stay", "rating": 3, "date": "2025-03-10", "text": "..."},
9 ]
10}
11# For flat analysis, use json_normalize
12flat_reviews = pd.json_normalize(
13 hotel_data,
14 record_path="reviews",
15 meta=["property_id", "name"]
16)
17flat_reviews.to_csv("reviews_flat.csv", index=False)
Two-File Approach for Relational Data
For large datasets, I use two CSV files:
hotels.csv— One row per property (flat)reviews.csv— One row per review, withproperty_idas a foreign key
This makes it easy to join in pandas, load into a database, or import into BI tools.
If you don't want to deal with any of this export logic, Thunderbit lets you to Excel, Google Sheets, Airtable, or Notion — all free, all without code. Useful when you need to share results with non-technical teammates.
Tips for Responsible and Efficient TripAdvisor Scraping
Responsible scraping in six bullets:
- Check
robots.txt: TripAdvisor's robots.txt blocks AI training bots (GPTBot, ClaudeBot, etc.) entirely. Standard crawlers face selective path restrictions. Review it attripadvisor.com/robots.txt. - Add delays: 3–7 seconds between requests is a safe range. Going faster than 10–15 requests per minute per IP triggers rate limiting.
- Scrape only public data. Don't log in to access restricted content.
- Store data securely and comply with GDPR/CCPA if handling personal information (reviewer names, etc.).
- Consider TripAdvisor's official API if you need commercial-scale data. The offers access to business details plus up to 5 reviews and 5 photos per location — limited, but legal and stable.
- Be aware of legal context: The strengthened ToS-based scraping prohibitions across the EU. TripAdvisor's Terms of Service explicitly prohibit scraping. Scrape responsibly and at your own risk.
Wrapping Up
That's the full picture.
- Requests + BeautifulSoup is the simplest path. It works for static listing pages, requires minimal setup, and is fast. Start here if you're scraping fewer than 100 pages and don't need JavaScript-rendered content.
- Selenium handles everything requests can't: dynamic content, "Read more" buttons, cookie banners. It's 5x slower and resource-heavy, but it's the only option when you need to interact with the page.
- Hidden JSON / GraphQL is the cleanest and fastest approach. It gives you structured data without parsing HTML, reduces the number of requests (and therefore the need for proxies), and returns data in analysis-ready formats. It requires more reverse-engineering upfront and occasional maintenance when TripAdvisor changes its data structure.
The reusable scrape_tripadvisor() function covers hotels, restaurants, and attractions. You shouldn't need a second tutorial.
And if you decide mid-tutorial that coding isn't for you — or you just need 50 hotels in a spreadsheet by end of day — can do it in two clicks with AI-powered field detection, automatic pagination, and free export to Excel or Google Sheets. No Python required.
If you want to go deeper, we have more scraping walkthroughs on the and our .
FAQs
1. Is it legal to scrape TripAdvisor?
TripAdvisor's Terms of Service explicitly prohibit scraping. However, courts have generally held that scraping publicly available data (not behind a login) does not violate the Computer Fraud and Abuse Act in the US. That said, the 2025 EU Court Ryanair ruling strengthened ToS-based restrictions in Europe. Scrape only public data, respect robots.txt, don't republish copyrighted content, and consult legal counsel if you're using the data commercially.
2. Can I scrape TripAdvisor without Python?
Yes. No-code tools like can scrape TripAdvisor directly from your browser with AI-powered field detection and automatic pagination. You can also use browser extensions, Google Sheets add-ons, or commercial scraping APIs. Python gives you the most control and flexibility, but it's not the only option.
3. How do I avoid getting blocked when scraping TripAdvisor?
The key tactics: use realistic and consistent headers (especially User-Agent and Sec-CH-UA), rotate residential proxies (datacenter IPs get blocked immediately), add random delays of 3–7 seconds between requests, use the hidden JSON method to minimize total requests, implement retry logic with exponential backoff, and warm up sessions by visiting the homepage before scraping deep pages.
4. What data can I scrape from TripAdvisor?
Hotels, restaurants, and attractions — including names, ratings, review counts, price ranges, addresses, coordinates, amenities (hotels), cuisine types (restaurants), tour durations (attractions), and full review text with individual ratings and dates. The hidden JSON and GraphQL approaches return the richest data per request.
5. How many pages can I scrape from TripAdvisor per day?
With a single IP and sensible delays: roughly 600–1,000 pages per day. With 20 rotating residential proxies: approximately 200,000–300,000 pages per day using request-based approaches. Selenium is slower — expect 8,000–12,000 pages per day per proxy. The hidden JSON/GraphQL approach gets you the most data per request, so you may need far fewer total pages to get the same amount of information.
Learn More
