If you've searched "scrape IMDb with Python" recently, you've probably noticed something: most of the tutorials you find are broken. Not "slightly outdated" broken — more like "returns zero results and a wall of NoneType errors" broken.
I've spent the last few weeks testing every major IMDb scraping tutorial I could find — GeeksforGeeks, Medium, freeCodeCamp, Kaggle notebooks, you name it. Out of tagged for IMDb scraping, the vast majority reference CSS selectors (td.titleColumn, td.ratingColumn) that haven't existed since June 2023 when IMDb redesigned its Top 250 page. The result? Forums full of developers asking "why does my code return empty?" and maintainers of popular libraries like "Not much we can do about it, beside fixing every parser." This guide covers two Python methods that actually work right now, how to handle pagination and common errors, when Python isn't even the right tool, and how to future-proof your scraper so it doesn't join the graveyard.
What Does It Mean to Scrape IMDb with Python?
Web scraping is the process of programmatically extracting data from web pages — instead of manually copying and pasting, you write a script that does it for you. When we talk about "scraping IMDb," we mean pulling structured movie data (titles, ratings, genres, cast, runtime, vote counts) from IMDb's web pages using Python.
The typical Python stack for this involves three libraries: requests (to fetch the web page), BeautifulSoup (to parse the HTML and find the data), and pandas (to organize and export the results). Some tutorials also use Selenium or Playwright for pages that require JavaScript rendering, but as you'll see, there are faster approaches.
One important caveat: everything in this guide is verified against IMDb's current page structure as of mid-2025. IMDb changes things roughly every 6–12 months, so if you're reading this in 2027, some selectors may have shifted. (I'll explain how to handle that, too.)
Why Scrape IMDb with Python? Real-World Use Cases
Before writing a single line of code, what would you actually do with IMDb data? The answer depends on who you are.
The IMDb review dataset is one of the most widely used NLP benchmarks in the world — the foundational paper by Maas et al. (2011) has accumulated , and the dataset is built into TensorFlow, Keras, and PyTorch. On Hugging Face, the stanfordnlp/imdb dataset gets 213,321 downloads per month and has been used to train over 1,500 models. So if you're in machine learning, you're probably already familiar with IMDb data.
But the use cases extend well beyond academia:
| Use Case | Who It's For | Data Fields Needed |
|---|---|---|
| Movie recommendation engine | Data scientists, hobbyists | Titles, genres, ratings, cast |
| Streaming platform content strategy | Product/content teams | Ratings, votes, release year, genres |
| Sentiment analysis / NLP training | ML researchers, students | Reviews, ratings |
| Competitive content analysis | Entertainment industry analysts | Box office, release dates, ratings trends |
| Film tourism research | Tourism boards, travel companies | Filming locations, popularity metrics |
| Academic research | University researchers | Any structured movie metadata |
The film tourism market alone is worth an estimated . Netflix spent over $17 billion on content in 2024, with driven by personalized recommendations. The point is: IMDb data feeds real decisions across industries.
Your Options for Getting IMDb Data (Before You Write a Line of Code)
This is the section most tutorials skip entirely. They jump straight to pip install beautifulsoup4 without asking whether Python scraping is even the right approach for your situation.
Here's the full landscape:
| Path | Best For | Pros | Cons |
|---|---|---|---|
| Python + BeautifulSoup | Learning, custom extraction | Full control, flexible | Fragile selectors, breaks often |
JSON-LD / __NEXT_DATA__ extraction | Developers who want stability | Handles JS content, more resilient | Requires understanding JSON structure |
| IMDb Official Datasets | Large-scale analysis, academic use | Legal, complete, 26M+ titles, daily updates | TSV format, no reviews/images |
| Cinemagoer (IMDbPY) library | Programmatic per-title lookups | Pythonic API, rich fields | 88 open issues, last release May 2023 |
| TMDb API | Movie metadata + images | Free API key, JSON, well-documented | Different source (not IMDb ratings) |
| Thunderbit (no-code) | Non-coders, quick exports | 2-click scraping, AI suggests fields, exports to Excel/Sheets | Credit-based for large scrapes |
A few notes on these options. Cinemagoer hasn't had a PyPI release since May 2023 and most of its parsers broke after IMDb's June 2025 redesign — I wouldn't recommend it for production use right now. TMDb is excellent but uses its own rating system, not IMDb's. And IMDb's official enterprise API costs via AWS Data Exchange, so that's not an option for most of us.
For readers who don't want to write code at all, reads the IMDb page, suggests extraction fields automatically (title, rating, year, genre), and exports to Excel, Google Sheets, Airtable, or Notion in two clicks. The AI adapts when IMDb changes its layout, so there are no selectors to maintain. More on that later.
Now, for those who do want to write Python — here are two methods that work.
Method 1: Scrape IMDb with Python Using BeautifulSoup (Traditional Approach)
This is the classic approach you'll find in most tutorials. It works, but I want to be upfront: it's the most fragile of the methods I'll cover. IMDb's CSS class names are auto-generated and change with redesigns. That said, it's the best way to learn web scraping fundamentals.
Step 1: Install and Import Your Python Libraries
You need four packages:
1pip install requests beautifulsoup4 pandas lxml
Here's what each does:
requests— sends HTTP requests to fetch the web pagebeautifulsoup4— parses the HTML so you can search for specific elementspandas— organizes the extracted data into tables and exports to CSV/Excellxml— a fast HTML parser (BeautifulSoup can use it as a backend)
Your import block:
1import requests
2from bs4 import BeautifulSoup
3import pandas as pd
Step 2: Send an HTTP Request to IMDb
This is where most beginners hit their first wall. IMDb blocks requests that don't include a proper User-Agent header — you'll get a 403 Forbidden error. The default Python Requests user-agent string (python-requests/2.31.0) is flagged immediately.
1url = "https://www.imdb.com/chart/top/"
2headers = {
3 "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
4 "Accept-Language": "en-US,en;q=0.9"
5}
6response = requests.get(url, headers=headers)
7if response.status_code != 200:
8 print(f"Failed to fetch page: {response.status_code}")
9else:
10 print("Page fetched successfully")
The Accept-Language header matters too — without it, IMDb may return content in a different language based on your IP's geolocation.
Step 3: Parse the HTML with BeautifulSoup
Once you have the HTML, create a BeautifulSoup object and start hunting for the right elements. Open the IMDb Top 250 page in Chrome, right-click on a movie title, and click "Inspect" to see the underlying HTML structure.
1soup = BeautifulSoup(response.text, "lxml")
As of mid-2025, the Top 250 page uses these selectors:
- Movie container:
li.ipc-metadata-list-summary-item - Title:
h3.ipc-title__text - Year:
span.cli-title-metadata-item(first span) - Rating:
span.ipc-rating-star--rating
Fair warning: those ipc- prefixed class names are generated by IMDb's component system. They've been stable since the June 2023 redesign, but there's no guarantee they won't change again.
Step 4: Extract Movie Data (Title, Year, Rating)
Here's where I differ from most tutorials: I include try/except error handling. None of the competitor guides I reviewed do this, which is exactly why their code breaks silently when a selector changes.
1movies = []
2movie_items = soup.select("li.ipc-metadata-list-summary-item")
3for item in movie_items:
4 try:
5 title_tag = item.select_one("h3.ipc-title__text")
6 title = title_tag.text.strip() if title_tag else "N/A"
7 year_tag = item.select_one("span.cli-title-metadata-item")
8 year = year_tag.text.strip() if year_tag else "N/A"
9 rating_tag = item.select_one("span.ipc-rating-star--rating")
10 rating = rating_tag.text.strip() if rating_tag else "N/A"
11 movies.append({
12 "title": title,
13 "year": year,
14 "rating": rating
15 })
16 except Exception as e:
17 print(f"Error parsing movie: {e}")
18 continue
19print(f"Extracted {len(movies)} movies")
Step 5: Save to CSV or Excel with Pandas
1df = pd.DataFrame(movies)
2df.to_csv("imdb_top_250.csv", index=False)
3df.to_excel("imdb_top_250.xlsx", index=False)
4print(df.head())
Sample output:
1 title year rating
20 1. The Shawshank Redemption 1994 9.3
31 2. The Godfather 1972 9.2
42 3. The Dark Knight 2008 9.0
53 4. The Godfather Part II 1974 9.0
64 5. 12 Angry Men 1957 9.0
That works. But it's held together with CSS selectors that could break any day — which brings us to the approach I actually recommend.
Method 2: The JSON-LD Trick — Skip HTML Parsing Entirely
This is the technique that no competitor article covers, and it's the one I'd use for any serious project. IMDb embeds structured data as (JavaScript Object Notation for Linked Data) in <script type="application/ld+json"> tags on every page. This data follows the Schema.org standard, is used by Google for rich search results, and changes far less frequently than CSS class names.
The Apify IMDb Scraper, a production-grade tool, uses the extraction priority order: "JSON-LD > NEXT_DATA > DOM." That's the hierarchy I'd recommend too.
Why JSON-LD Is More Reliable Than CSS Selectors
| Approach | Handles JS Content? | Resilient to UI Changes? | Speed | Complexity |
|---|---|---|---|---|
| BeautifulSoup + CSS selectors | ❌ No | ⚠️ Fragile (class names shift) | Fast | Low |
| JSON-LD extraction | âś… Yes | âś… Follows Schema.org standard | Fast | Low-Medium |
__NEXT_DATA__ JSON extraction | âś… Yes | âś… Fairly stable | Fast | Low-Medium |
| Selenium / Playwright | ✅ Yes | ⚠️ Fragile | Slow | Medium-High |
| Thunderbit (no-code, 2-click) | âś… Yes (AI reads page) | âś… AI adapts automatically | Fast | None |
CSS class names like ipc-metadata-list-summary-item are auto-generated by IMDb's React component system and change with every redesign. The JSON-LD schema represents the actual data model, not the presentation layer. It's like the difference between reading a book's table of contents versus trying to identify chapters by their font size.

Step-by-Step: Extract IMDb Data from JSON-LD
Step 1: Fetch the Page
Same as before — use requests with a proper User-Agent header.
1import requests
2from bs4 import BeautifulSoup
3import json
4import pandas as pd
5url = "https://www.imdb.com/chart/top/"
6headers = {
7 "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
8 "Accept-Language": "en-US,en;q=0.9"
9}
10response = requests.get(url, headers=headers)
11soup = BeautifulSoup(response.text, "lxml")
Step 2: Find the JSON-LD Script Tag
1script_tag = soup.find("script", {"type": "application/ld+json"})
2if not script_tag:
3 print("No JSON-LD found on this page")
4else:
5 data = json.loads(script_tag.string)
6 print(f"Found JSON-LD with type: {data.get('@type', 'unknown')}")
Step 3: Parse the Structured Data
On the Top 250 page, the JSON-LD contains an itemListElement array with all 250 movies. Each entry includes position, name, URL, aggregateRating, datePublished, genre, description, director, and actor arrays.
1movies = []
2for item in data.get("itemListElement", []):
3 movie = item.get("item", {})
4 rating_info = movie.get("aggregateRating", {})
5 movies.append({
6 "rank": item.get("position"),
7 "title": movie.get("name"),
8 "url": movie.get("url"),
9 "rating": rating_info.get("ratingValue"),
10 "vote_count": rating_info.get("ratingCount"),
11 "date_published": movie.get("datePublished"),
12 "genre": ", ".join(movie.get("genre", [])),
13 "description": movie.get("description"),
14 })
Step 4: Export to CSV
1df = pd.DataFrame(movies)
2df.to_csv("imdb_top_250_json_ld.csv", index=False)
3print(df.head())
Sample output:
1 rank title url rating vote_count date_published genre
20 1 The Shawshank Redemption https://www.imdb.com/title/tt0111161/ 9.3 2900000 1994-10-14 Drama
31 2 The Godfather https://www.imdb.com/title/tt0068646/ 9.2 2000000 1972-03-24 Crime, Drama
42 3 The Dark Knight https://www.imdb.com/title/tt0468569/ 9.0 2800000 2008-07-18 Action, Crime, Drama
All 250 movies. Clean, structured, no CSS selector gymnastics. And because this data follows the Schema.org standard (which Google depends on for search results), it's far less likely to change than the visual layout.
Bonus: __NEXT_DATA__ for Individual Movie Pages
For richer data from individual title pages (runtime, full cast, plot summary, poster images), IMDb also embeds a __NEXT_DATA__ JSON object. This is the data React uses to hydrate the page — it can't be removed without breaking the site.
1# On an individual movie page like /title/tt0111161/
2next_data_tag = soup.find("script", {"id": "__NEXT_DATA__"})
3if next_data_tag:
4 next_data = json.loads(next_data_tag.string)
5 above_fold = next_data["props"]["pageProps"]["aboveTheFoldData"]
6 title = above_fold["titleText"]["text"]
7 year = above_fold["releaseYear"]["year"]
8 rating = above_fold["ratingsSummary"]["aggregateRating"]
9 runtime_seconds = above_fold.get("runtime", {}).get("seconds", 0)
10 genres = [g["text"] for g in above_fold["genres"]["genres"]]
11 plot = above_fold["plot"]["plotText"]["plainText"]
Use JSON-LD for chart/list pages, __NEXT_DATA__ for individual title pages. That's the production-grade approach.
Why Your IMDb Scraper Keeps Breaking (And How to Fix It)
This is the single most-reported pain point across every IMDb scraping forum I checked. Users write: "Some of the code broke because of UI changes" and "Not working in 2024!" — and the response is usually silence or "try Selenium."
The root cause is IMDb's ongoing migration to a React/Next.js frontend. Here's the timeline of major breaking changes:
| Date | What Changed | What Broke |
|---|---|---|
| Nov 2022 | Name Pages redesigned | Old name-page scrapers |
| June 2023 | Top 250 page redesigned | All td.titleColumn / td.ratingColumn selectors |
| April 2023 | Title subpages redesigned | Bio, awards, news scrapers |
| Oct 2023 | Advanced Search redesigned | Search-based scrapers |
| June 2025 | /reference pages redesigned | Cinemagoer library (most parsers) |
That's roughly one major breaking change every 6–12 months. If your scraper relies on CSS class names, you're on a treadmill.
Common Errors and How to Fix Them
Empty results / NoneType errors
The most common error. You'll see AttributeError: 'NoneType' object has no attribute 'text'. This means BeautifulSoup couldn't find the element you're looking for — usually because the CSS class name changed or the content is rendered by JavaScript.
Fix: Switch to JSON-LD extraction (Method 2 above). The data is in the initial HTML response, no JavaScript required.
403 Forbidden
IMDb uses to detect and block bots. The #1 trigger is a missing or obviously fake User-Agent header. Documented across open-source projects and where an IMDb employee acknowledged the issue.
Fix: Always include a realistic browser User-Agent string and Accept-Language: en-US header. Use requests.Session() for connection pooling.
Only 25 results returned
IMDb search pages and "Most Popular" lists use lazy loading — they only render about 25 results initially and load more via AJAX as you scroll.
Fix: Use URL parameter pagination (covered in the next section) or switch to the Top 250 page, which loads all 250 movies in a single response.
Selectors suddenly stop working
Old selectors that no longer work: td.titleColumn, td.ratingColumn, .lister-item-header, .inline-block.ratings-imdb-rating. If your code uses any of these, it's broken.
Fix: Prefer data-testid attributes (like h1[data-testid="hero-title-block__title"]) over auto-generated class names. Better yet, use JSON-LD.
A Decision Framework: Short-Term vs. Long-Term Fixes
- Quick fix: Add
try/exceptblocks around every selector, validate HTTP status codes, log errors instead of crashing - Medium-term fix: Switch from CSS selectors to JSON-LD extraction (Method 2)
- Long-term fix: Use for large-scale analysis, or a tool like that uses AI to re-read the page structure fresh each time — no selectors to maintain, the AI adapts to layout changes automatically
Beyond the 25-Result Wall: Scraping IMDb Pagination and Large Datasets
Every competitor tutorial I reviewed scrapes exactly one page. Nobody covers pagination. But if you need more than a single list, you'll hit walls fast.
Pages That Don't Need Pagination
Good news: the Top 250 page loads all 250 movies in a single server-rendered response. The JSON-LD and __NEXT_DATA__ both contain the complete dataset. No pagination required.
How IMDb Search Pagination Works
IMDb search pages use a start= URL parameter, incrementing by 50:
1https://www.imdb.com/search/title/?groups=top_1000&start=1
2https://www.imdb.com/search/title/?groups=top_1000&start=51
3https://www.imdb.com/search/title/?groups=top_1000&start=101
Here's a Python loop that pages through results:
1import time
2all_movies = []
3for start in range(1, 1001, 50): # Pages through top 1000
4 url = f"https://www.imdb.com/search/title/?groups=top_1000&start={start}"
5 response = requests.get(url, headers=headers)
6 if response.status_code != 200:
7 print(f"Failed at start={start}: {response.status_code}")
8 break
9 soup = BeautifulSoup(response.text, "lxml")
10 # Extract movies using your preferred method
11 # ...
12 print(f"Scraped page starting at {start}")
13 time.sleep(3) # Be respectful — IMDb blocks after ~50 rapid requests
That time.sleep(3) matters. Community reports suggest IMDb starts blocking IPs after approximately 50 rapid requests. A random delay between 2–5 seconds is a good practice.
When to Skip Scraping Entirely: IMDb's Official Bulk Datasets
For truly large-scale needs, IMDb provides 7 free TSV files at , refreshed daily:
| File | Contents | Size |
|---|---|---|
| title.basics.tsv.gz | Titles, types, genres, runtime, year | ~800 MB |
| title.ratings.tsv.gz | Average rating, number of votes | ~25 MB |
| title.crew.tsv.gz | Directors, writers | ~300 MB |
| title.principals.tsv.gz | Top-billed cast/crew | ~2 GB |
| title.akas.tsv.gz | Alternative titles by region | ~1.5 GB |
| title.episode.tsv.gz | TV episode info | ~200 MB |
| name.basics.tsv.gz | People: name, birth year, known-for titles | ~700 MB |
Loading them into Pandas is straightforward:
1ratings = pd.read_csv("title.ratings.tsv.gz", sep="\t", compression="gzip")
2basics = pd.read_csv("title.basics.tsv.gz", sep="\t", compression="gzip", low_memory=False)
3# Merge on tconst (IMDb title ID)
4merged = basics.merge(ratings, on="tconst")
5top_movies = merged[merged["titleType"] == "movie"].nlargest(250, "averageRating")
These datasets cover 26+ million titles. No pagination, no selectors, no 403 errors. The license is for personal and non-commercial use only — you can't republish or resell the data.
The No-Code Shortcut: Thunderbit Handles Pagination for You
For readers who need paginated IMDb data but don't want to write pagination logic, supports both click-based pagination and infinite scroll natively. You tell it to scrape, it handles the rest — including scrolling through lazy-loaded content.
Scrape IMDb with Python: The Complete Working Code (Copy-Paste Ready)
Here are two self-contained scripts you can run right now.
Script A: BeautifulSoup Method (CSS Selectors)
1import requests
2from bs4 import BeautifulSoup
3import pandas as pd
4url = "https://www.imdb.com/chart/top/"
5headers = {
6 "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
7 "Accept-Language": "en-US,en;q=0.9"
8}
9response = requests.get(url, headers=headers)
10if response.status_code != 200:
11 print(f"Error: {response.status_code}")
12 exit()
13soup = BeautifulSoup(response.text, "lxml")
14movie_items = soup.select("li.ipc-metadata-list-summary-item")
15movies = []
16for item in movie_items:
17 try:
18 title = item.select_one("h3.ipc-title__text")
19 year = item.select_one("span.cli-title-metadata-item")
20 rating = item.select_one("span.ipc-rating-star--rating")
21 movies.append({
22 "title": title.text.strip() if title else "N/A",
23 "year": year.text.strip() if year else "N/A",
24 "rating": rating.text.strip() if rating else "N/A",
25 })
26 except Exception as e:
27 print(f"Skipping movie due to error: {e}")
28df = pd.DataFrame(movies)
29df.to_csv("imdb_top250_bs4.csv", index=False)
30print(f"Saved {len(df)} movies")
31print(df.head())
Script B: JSON-LD Method (Recommended)
1import requests
2from bs4 import BeautifulSoup
3import json
4import pandas as pd
5url = "https://www.imdb.com/chart/top/"
6headers = {
7 "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
8 "Accept-Language": "en-US,en;q=0.9"
9}
10response = requests.get(url, headers=headers)
11if response.status_code != 200:
12 print(f"Error: {response.status_code}")
13 exit()
14soup = BeautifulSoup(response.text, "lxml")
15script_tag = soup.find("script", {"type": "application/ld+json"})
16if not script_tag:
17 print("No JSON-LD data found")
18 exit()
19data = json.loads(script_tag.string)
20movies = []
21for item in data.get("itemListElement", []):
22 movie = item.get("item", {})
23 rating_info = movie.get("aggregateRating", {})
24 directors = movie.get("director", [])
25 director_names = ", ".join(
26 d.get("name", "") for d in (directors if isinstance(directors, list) else [directors])
27 )
28 movies.append({
29 "rank": item.get("position"),
30 "title": movie.get("name"),
31 "url": movie.get("url"),
32 "rating": rating_info.get("ratingValue"),
33 "votes": rating_info.get("ratingCount"),
34 "year": movie.get("datePublished", "")[:4],
35 "genre": ", ".join(movie.get("genre", [])),
36 "director": director_names,
37 "description": movie.get("description"),
38 })
39df = pd.DataFrame(movies)
40df.to_csv("imdb_top250_jsonld.csv", index=False)
41print(f"Saved {len(df)} movies")
42print(df.head())
Both scripts include error handling and produce clean CSV output. Script B gives you richer data — director, description, URL — and is more resilient to layout changes.
How to Scrape IMDb Without Writing Any Code (Using Thunderbit)
Not everyone needs or wants to write Python. Maybe you're an operations analyst who just needs this week's top-rated movies in a spreadsheet. Maybe you're a content strategist who wants to compare genre trends across years. In those cases, writing a scraper is overkill.
Here's how to get the same data using :
Before you start:
- Difficulty: Beginner
- Time Required: ~2 minutes
- What You'll Need: Chrome browser, (free tier works)
Step 1: Navigate to the IMDb page you want to scrape. Open the IMDb Top 250 (or any other IMDb list/search page) in Chrome.
Step 2: Click "AI Suggest Fields" in the Thunderbit sidebar. The AI scans the page and recommends columns — typically Title, Year, Rating, Genre, and a few others depending on the page. You'll see a preview table with the suggested fields.
Step 3: Adjust fields if needed. Remove columns you don't need, or add custom ones by clicking "+ Add Column" and describing what you want in plain English (e.g., "Director name" or "Number of votes").
Step 4: Click "Scrape." Thunderbit extracts the data. For pages with infinite scroll or pagination, it handles the scrolling automatically.
Step 5: Export. Click the export button and choose your format — Excel, Google Sheets, CSV, Airtable, or Notion. The data lands in your destination in seconds.
The key advantage here isn't just convenience — it's that Thunderbit's AI reads the page structure fresh each time. When IMDb changes its layout (and it will), the AI adapts. No selectors to update, no code to fix. For anyone who's been burned by a broken scraper at 2 AM before a deadline, that's worth a lot.
Thunderbit also supports subpage scraping — you can click into each movie's detail page and enrich your table with cast, director, runtime, and other fields that aren't visible on the list page. If you want to see it in action, check out the .
Is It Legal to Scrape IMDb? What You Need to Know
Users explicitly ask this in forums: "Is something like this legal?… IMDb does not want people scraping their website." It's a fair question, and no competitor article addresses it.
IMDb's robots.txt: The Top 250 chart (/chart/top/), individual title pages (/title/ttXXXXXXX/), and name pages (/name/nmXXXXXXX/) are NOT blocked by robots.txt. Blocked paths include /find, /_json/*, /search/name-text, /user/ur*/ratings, and various AJAX endpoints. There's no Crawl-delay directive specified.
IMDb's Conditions of Use: The relevant clause states: "You may not use data mining, robots, screen scraping, or similar data gathering and extraction tools on this site, except with our express written consent." An additional clause prohibits resale or commercial use of scraped data.
What this means in practice: Recent 2024 court rulings (Meta v. Bright Data, X Corp v. Bright Data) found that Terms of Service may not bind users who never agreed to them — if you're scraping publicly available data without logging in, the ToS enforceability is debatable. But this is an evolving legal area.
Safe alternatives: IMDb's are explicitly sanctioned for personal and non-commercial use. TMDb's API is permissive with a free API key. Both are solid options if you want to stay clearly in the clear.
Practical guidance: If you do scrape, use a respectful crawl rate (time.sleep(3) between requests), set proper headers, and don't hit paths blocked by robots.txt. For commercial projects, consult a legal professional or use the official datasets/API.
We've covered in depth on the Thunderbit blog.
Conclusion: Pick the Right Way to Scrape IMDb with Python
The short version:
- BeautifulSoup + CSS selectors: Good for learning the fundamentals. Expect it to break every 6–12 months. Always include error handling.
- JSON-LD extraction: The approach I'd recommend for any ongoing Python project. Follows the Schema.org standard, changes far less often than CSS classes, and gives you clean structured data without JavaScript rendering.
__NEXT_DATA__JSON: Use this as a supplement for richer data on individual title pages (runtime, full cast, plot, poster images).- IMDb Official Datasets: The best choice for large-scale analysis. 26M+ titles, updated daily, no scraping required. Personal/non-commercial use only.
- : The best choice for non-coders or anyone who wants data fast without maintaining code. AI adapts to layout changes, handles pagination, exports to Excel/Sheets/Airtable/Notion.
Bookmark this guide — I'll update it when IMDb's structure changes next. And if you want to skip the code entirely, and see how fast you can get from an IMDb page to a clean spreadsheet. If you're working with other sites too, our guide on covers the broader workflow.
FAQs
Is it legal to scrape IMDb?
IMDb's Terms of Service prohibit scraping without consent, but the enforceability of ToS on publicly accessible data is legally debatable after recent 2024 court rulings. The safest options are IMDb's (personal/non-commercial use) or the TMDb API (free key). If you do scrape, respect robots.txt, use reasonable delays between requests, and avoid blocked paths. For commercial use, consult a legal professional.
Why does my IMDb scraper return empty results?
Almost always, the cause is outdated CSS selectors — class names like td.titleColumn and td.ratingColumn haven't existed since June 2023. The fix is to switch to JSON-LD extraction (parse the <script type="application/ld+json"> tag) or update your selectors to the current ipc- prefixed classes. Also verify you're including a proper User-Agent header, as a missing header triggers a 403 error that can appear as empty results.
How do I scrape more than 25 results from IMDb?
The Top 250 page loads all 250 movies in a single response — no pagination needed. For search results, use the start= URL parameter (incrementing by 50) to page through results. For example: start=1, start=51, start=101. Add a time.sleep(3) between requests to avoid getting blocked. Alternatively, IMDb's official datasets at contain 26M+ titles with no pagination required.
What is __NEXT_DATA__ and why should I use it to scrape IMDb?
__NEXT_DATA__ is a JSON object embedded in a <script id="__NEXT_DATA__"> tag on IMDb's React/Next.js pages. It contains the complete structured data that React uses to render the page — titles, ratings, cast, genres, runtime, and more. Because it represents the underlying data model rather than the visual layout, it's more resilient to UI redesigns than CSS selectors. Use it alongside JSON-LD for the most robust extraction approach.
Can I scrape IMDb without coding?
Yes. Two main options: (1) Download IMDb's — 7 TSV files covering 26M+ titles, updated daily, free for non-commercial use. (2) Use , which reads the IMDb page, suggests extraction fields automatically, and exports to Excel, Google Sheets, or CSV in two clicks — no code, no selectors to maintain.
Learn More
