Scrape IMDb with Python

If you've searched "scrape IMDb with Python" recently, you've probably noticed something: most of the tutorials you find are broken. Not "slightly outdated" broken — more like "returns zero results and a wall of NoneType errors" broken.

I've spent the last few weeks testing every major IMDb scraping tutorial I could find — GeeksforGeeks, Medium, freeCodeCamp, Kaggle notebooks, you name it. Out of tagged for IMDb scraping, the vast majority reference CSS selectors (td.titleColumn, td.ratingColumn) that haven't existed since June 2023 when IMDb redesigned its Top 250 page. The result? Forums full of developers asking "why does my code return empty?" and maintainers of popular libraries like "Not much we can do about it, beside fixing every parser." This guide covers two Python methods that actually work right now, how to handle pagination and common errors, when Python isn't even the right tool, and how to future-proof your scraper so it doesn't join the graveyard.

What Does It Mean to Scrape IMDb with Python?

Web scraping is the process of programmatically extracting data from web pages — instead of manually copying and pasting, you write a script that does it for you. When we talk about "scraping IMDb," we mean pulling structured movie data (titles, ratings, genres, cast, runtime, vote counts) from IMDb's web pages using Python.

The typical Python stack for this involves three libraries: requests (to fetch the web page), BeautifulSoup (to parse the HTML and find the data), and pandas (to organize and export the results). Some tutorials also use Selenium or Playwright for pages that require JavaScript rendering, but as you'll see, there are faster approaches.

One important caveat: everything in this guide is verified against IMDb's current page structure as of mid-2025. IMDb changes things roughly every 6–12 months, so if you're reading this in 2027, some selectors may have shifted. (I'll explain how to handle that, too.)

Why Scrape IMDb with Python? Real-World Use Cases

Before writing a single line of code, what would you actually do with IMDb data? The answer depends on who you are.

The IMDb review dataset is one of the most widely used NLP benchmarks in the world — the foundational paper by Maas et al. (2011) has accumulated , and the dataset is built into TensorFlow, Keras, and PyTorch. On Hugging Face, the stanfordnlp/imdb dataset gets 213,321 downloads per month and has been used to train over 1,500 models. So if you're in machine learning, you're probably already familiar with IMDb data.

But the use cases extend well beyond academia:

Use Case	Who It's For	Data Fields Needed
Movie recommendation engine	Data scientists, hobbyists	Titles, genres, ratings, cast
Streaming platform content strategy	Product/content teams	Ratings, votes, release year, genres
Sentiment analysis / NLP training	ML researchers, students	Reviews, ratings
Competitive content analysis	Entertainment industry analysts	Box office, release dates, ratings trends
Film tourism research	Tourism boards, travel companies	Filming locations, popularity metrics
Academic research	University researchers	Any structured movie metadata

The film tourism market alone is worth an estimated . Netflix spent over $17 billion on content in 2024, with driven by personalized recommendations. The point is: IMDb data feeds real decisions across industries.

Your Options for Getting IMDb Data (Before You Write a Line of Code)

This is the section most tutorials skip entirely. They jump straight to pip install beautifulsoup4 without asking whether Python scraping is even the right approach for your situation.

Here's the full landscape:

Path	Best For	Pros	Cons
Python + BeautifulSoup	Learning, custom extraction	Full control, flexible	Fragile selectors, breaks often
JSON-LD / `__NEXT_DATA__` extraction	Developers who want stability	Handles JS content, more resilient	Requires understanding JSON structure
IMDb Official Datasets	Large-scale analysis, academic use	Legal, complete, 26M+ titles, daily updates	TSV format, no reviews/images
Cinemagoer (IMDbPY) library	Programmatic per-title lookups	Pythonic API, rich fields	88 open issues, last release May 2023
TMDb API	Movie metadata + images	Free API key, JSON, well-documented	Different source (not IMDb ratings)
Thunderbit (no-code)	Non-coders, quick exports	2-click scraping, AI suggests fields, exports to Excel/Sheets	Credit-based for large scrapes

A few notes on these options. Cinemagoer hasn't had a PyPI release since May 2023 and most of its parsers broke after IMDb's June 2025 redesign — I wouldn't recommend it for production use right now. TMDb is excellent but uses its own rating system, not IMDb's. And IMDb's official enterprise API costs via AWS Data Exchange, so that's not an option for most of us.

For readers who don't want to write code at all, reads the IMDb page, suggests extraction fields automatically (title, rating, year, genre), and exports to Excel, Google Sheets, Airtable, or Notion in two clicks. The AI adapts when IMDb changes its layout, so there are no selectors to maintain. More on that later.

Now, for those who do want to write Python — here are two methods that work.

Method 1: Scrape IMDb with Python Using BeautifulSoup (Traditional Approach)

This is the classic approach you'll find in most tutorials. It works, but I want to be upfront: it's the most fragile of the methods I'll cover. IMDb's CSS class names are auto-generated and change with redesigns. That said, it's the best way to learn web scraping fundamentals.

Step 1: Install and Import Your Python Libraries

You need four packages:

1pip install requests beautifulsoup4 pandas lxml

Here's what each does:

requests — sends HTTP requests to fetch the web page
beautifulsoup4 — parses the HTML so you can search for specific elements
pandas — organizes the extracted data into tables and exports to CSV/Excel
lxml — a fast HTML parser (BeautifulSoup can use it as a backend)

Your import block:

1import requests
2from bs4 import BeautifulSoup
3import pandas as pd

Step 2: Send an HTTP Request to IMDb

This is where most beginners hit their first wall. IMDb blocks requests that don't include a proper User-Agent header — you'll get a 403 Forbidden error. The default Python Requests user-agent string (python-requests/2.31.0) is flagged immediately.

1url = "https://www.imdb.com/chart/top/"
2headers = {
3    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
4    "Accept-Language": "en-US,en;q=0.9"
5}
6response = requests.get(url, headers=headers)
7if response.status_code != 200:
8    print(f"Failed to fetch page: \{response.status_code\}")
9else:
10    print("Page fetched successfully")

The Accept-Language header matters too — without it, IMDb may return content in a different language based on your IP's geolocation.

Step 3: Parse the HTML with BeautifulSoup

Once you have the HTML, create a BeautifulSoup object and start hunting for the right elements. Open the IMDb Top 250 page in Chrome, right-click on a movie title, and click "Inspect" to see the underlying HTML structure.

1soup = BeautifulSoup(response.text, "lxml")

As of mid-2025, the Top 250 page uses these selectors:

Movie container: li.ipc-metadata-list-summary-item
Title: h3.ipc-title__text
Year: span.cli-title-metadata-item (first span)
Rating: span.ipc-rating-star--rating

Fair warning: those ipc- prefixed class names are generated by IMDb's component system. They've been stable since the June 2023 redesign, but there's no guarantee they won't change again.

Step 4: Extract Movie Data (Title, Year, Rating)

Here's where I differ from most tutorials: I include try/except error handling. None of the competitor guides I reviewed do this, which is exactly why their code breaks silently when a selector changes.

1movies = []
2movie_items = soup.select("li.ipc-metadata-list-summary-item")
3for item in movie_items:
4    try:
5        title_tag = item.select_one("h3.ipc-title__text")
6        title = title_tag.text.strip() if title_tag else "N/A"
7        year_tag = item.select_one("span.cli-title-metadata-item")
8        year = year_tag.text.strip() if year_tag else "N/A"
9        rating_tag = item.select_one("span.ipc-rating-star--rating")
10        rating = rating_tag.text.strip() if rating_tag else "N/A"
11        movies.append({
12            "title": title,
13            "year": year,
14            "rating": rating
15        })
16    except Exception as e:
17        print(f"Error parsing movie: \{e\}")
18        continue
19print(f"Extracted {len(movies)} movies")

Step 5: Save to CSV or Excel with Pandas

1df = pd.DataFrame(movies)
2df.to_csv("imdb_top_250.csv", index=False)
3df.to_excel("imdb_top_250.xlsx", index=False)
4print(df.head())

Sample output:

1                          title  year rating
20  1. The Shawshank Redemption  1994    9.3
31           2. The Godfather    1972    9.2
42     3. The Dark Knight        2008    9.0
53  4. The Godfather Part II     1974    9.0
64         5. 12 Angry Men       1957    9.0

That works. But it's held together with CSS selectors that could break any day — which brings us to the approach I actually recommend.

Method 2: The JSON-LD Trick — Skip HTML Parsing Entirely

This is the technique that no competitor article covers, and it's the one I'd use for any serious project. IMDb embeds structured data as (JavaScript Object Notation for Linked Data) in <script type="application/ld+json"> tags on every page. This data follows the Schema.org standard, is used by Google for rich search results, and changes far less frequently than CSS class names.

The Apify IMDb Scraper, a production-grade tool, uses the extraction priority order: "JSON-LD > NEXT_DATA > DOM." That's the hierarchy I'd recommend too.

Why JSON-LD Is More Reliable Than CSS Selectors

Approach	Handles JS Content?	Resilient to UI Changes?	Speed	Complexity
BeautifulSoup + CSS selectors	❌ No	⚠️ Fragile (class names shift)	Fast	Low
JSON-LD extraction	✅ Yes	✅ Follows Schema.org standard	Fast	Low-Medium
`__NEXT_DATA__` JSON extraction	✅ Yes	✅ Fairly stable	Fast	Low-Medium
Selenium / Playwright	✅ Yes	⚠️ Fragile	Slow	Medium-High
Thunderbit (no-code, 2-click)	✅ Yes (AI reads page)	✅ AI adapts automatically	Fast	None

CSS class names like ipc-metadata-list-summary-item are auto-generated by IMDb's React component system and change with every redesign. The JSON-LD schema represents the actual data model, not the presentation layer. It's like the difference between reading a book's table of contents versus trying to identify chapters by their font size.

Step-by-Step: Extract IMDb Data from JSON-LD

Step 1: Fetch the Page

Same as before — use requests with a proper User-Agent header.

1import requests
2from bs4 import BeautifulSoup
3import json
4import pandas as pd
5url = "https://www.imdb.com/chart/top/"
6headers = {
7    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
8    "Accept-Language": "en-US,en;q=0.9"
9}
10response = requests.get(url, headers=headers)
11soup = BeautifulSoup(response.text, "lxml")

Step 2: Find the JSON-LD Script Tag

1script_tag = soup.find("script", {"type": "application/ld+json"})
2if not script_tag:
3    print("No JSON-LD found on this page")
4else:
5    data = json.loads(script_tag.string)
6    print(f"Found JSON-LD with type: {data.get('@type', 'unknown')}")

Step 3: Parse the Structured Data

On the Top 250 page, the JSON-LD contains an itemListElement array with all 250 movies. Each entry includes position, name, URL, aggregateRating, datePublished, genre, description, director, and actor arrays.

1movies = []
2for item in data.get("itemListElement", []):
3    movie = item.get("item", {})
4    rating_info = movie.get("aggregateRating", {})
5    movies.append({
6        "rank": item.get("position"),
7        "title": movie.get("name"),
8        "url": movie.get("url"),
9        "rating": rating_info.get("ratingValue"),
10        "vote_count": rating_info.get("ratingCount"),
11        "date_published": movie.get("datePublished"),
12        "genre": ", ".join(movie.get("genre", [])),
13        "description": movie.get("description"),
14    })

Step 4: Export to CSV

1df = pd.DataFrame(movies)
2df.to_csv("imdb_top_250_json_ld.csv", index=False)
3print(df.head())

Sample output:

1   rank                      title                                     url  rating  vote_count date_published              genre
20     1  The Shawshank Redemption  https://www.imdb.com/title/tt0111161/     9.3     2900000     1994-10-14     Drama
31     2            The Godfather   https://www.imdb.com/title/tt0068646/     9.2     2000000     1972-03-24     Crime, Drama
42     3          The Dark Knight   https://www.imdb.com/title/tt0468569/     9.0     2800000     2008-07-18     Action, Crime, Drama

All 250 movies. Clean, structured, no CSS selector gymnastics. And because this data follows the Schema.org standard (which Google depends on for search results), it's far less likely to change than the visual layout.

Bonus: `__NEXT_DATA__` for Individual Movie Pages

For richer data from individual title pages (runtime, full cast, plot summary, poster images), IMDb also embeds a __NEXT_DATA__ JSON object. This is the data React uses to hydrate the page — it can't be removed without breaking the site.

1# On an individual movie page like /title/tt0111161/
2next_data_tag = soup.find("script", {"id": "__NEXT_DATA__"})
3if next_data_tag:
4    next_data = json.loads(next_data_tag.string)
5    above_fold = next_data["props"]["pageProps"]["aboveTheFoldData"]
6    title = above_fold["titleText"]["text"]
7    year = above_fold["releaseYear"]["year"]
8    rating = above_fold["ratingsSummary"]["aggregateRating"]
9    runtime_seconds = above_fold.get("runtime", {}).get("seconds", 0)
10    genres = [g["text"] for g in above_fold["genres"]["genres"]]
11    plot = above_fold["plot"]["plotText"]["plainText"]

Use JSON-LD for chart/list pages, __NEXT_DATA__ for individual title pages. That's the production-grade approach.

Why Your IMDb Scraper Keeps Breaking (And How to Fix It)

This is the single most-reported pain point across every IMDb scraping forum I checked. Users write: "Some of the code broke because of UI changes" and "Not working in 2024!" — and the response is usually silence or "try Selenium."

The root cause is IMDb's ongoing migration to a React/Next.js frontend. Here's the timeline of major breaking changes:

Date	What Changed	What Broke
Nov 2022	Name Pages redesigned	Old name-page scrapers
June 2023	Top 250 page redesigned	All `td.titleColumn` / `td.ratingColumn` selectors
April 2023	Title subpages redesigned	Bio, awards, news scrapers
Oct 2023	Advanced Search redesigned	Search-based scrapers
June 2025	/reference pages redesigned	Cinemagoer library (most parsers)

That's roughly one major breaking change every 6–12 months. If your scraper relies on CSS class names, you're on a treadmill.

Common Errors and How to Fix Them

Empty results / NoneType errors

The most common error. You'll see AttributeError: 'NoneType' object has no attribute 'text'. This means BeautifulSoup couldn't find the element you're looking for — usually because the CSS class name changed or the content is rendered by JavaScript.

Fix: Switch to JSON-LD extraction (Method 2 above). The data is in the initial HTML response, no JavaScript required.

403 Forbidden

IMDb uses to detect and block bots. The #1 trigger is a missing or obviously fake User-Agent header. Documented across open-source projects and where an IMDb employee acknowledged the issue.

Fix: Always include a realistic browser User-Agent string and Accept-Language: en-US header. Use requests.Session() for connection pooling.

Only 25 results returned

IMDb search pages and "Most Popular" lists use lazy loading — they only render about 25 results initially and load more via AJAX as you scroll.

Fix: Use URL parameter pagination (covered in the next section) or switch to the Top 250 page, which loads all 250 movies in a single response.

Selectors suddenly stop working

Old selectors that no longer work: td.titleColumn, td.ratingColumn, .lister-item-header, .inline-block.ratings-imdb-rating. If your code uses any of these, it's broken.

Fix: Prefer data-testid attributes (like h1[data-testid="hero-title-block__title"]) over auto-generated class names. Better yet, use JSON-LD.

A Decision Framework: Short-Term vs. Long-Term Fixes

Quick fix: Add try/except blocks around every selector, validate HTTP status codes, log errors instead of crashing
Medium-term fix: Switch from CSS selectors to JSON-LD extraction (Method 2)
Long-term fix: Use for large-scale analysis, or a tool like that uses AI to re-read the page structure fresh each time — no selectors to maintain, the AI adapts to layout changes automatically

Beyond the 25-Result Wall: Scraping IMDb Pagination and Large Datasets

Every competitor tutorial I reviewed scrapes exactly one page. Nobody covers pagination. But if you need more than a single list, you'll hit walls fast.

Pages That Don't Need Pagination

Good news: the Top 250 page loads all 250 movies in a single server-rendered response. The JSON-LD and __NEXT_DATA__ both contain the complete dataset. No pagination required.

How IMDb Search Pagination Works

IMDb search pages use a start= URL parameter, incrementing by 50:

1https://www.imdb.com/search/title/?groups=top_1000&start=1
2https://www.imdb.com/search/title/?groups=top_1000&start=51
3https://www.imdb.com/search/title/?groups=top_1000&start=101

Here's a Python loop that pages through results:

1import time
2all_movies = []
3for start in range(1, 1001, 50):  # Pages through top 1000
4    url = f"https://www.imdb.com/search/title/?groups=top_1000&start=\{start\}"
5    response = requests.get(url, headers=headers)
6    if response.status_code != 200:
7        print(f"Failed at start=\{start\}: \{response.status_code\}")
8        break
9    soup = BeautifulSoup(response.text, "lxml")
10    # Extract movies using your preferred method
11    # ...
12    print(f"Scraped page starting at \{start\}")
13    time.sleep(3)  # Be respectful — IMDb blocks after ~50 rapid requests

That time.sleep(3) matters. Community reports suggest IMDb starts blocking IPs after approximately 50 rapid requests. A random delay between 2–5 seconds is a good practice.

When to Skip Scraping Entirely: IMDb's Official Bulk Datasets

For truly large-scale needs, IMDb provides 7 free TSV files at , refreshed daily:

File	Contents	Size
title.basics.tsv.gz	Titles, types, genres, runtime, year	~800 MB
title.ratings.tsv.gz	Average rating, number of votes	~25 MB
title.crew.tsv.gz	Directors, writers	~300 MB
title.principals.tsv.gz	Top-billed cast/crew	~2 GB
title.akas.tsv.gz	Alternative titles by region	~1.5 GB
title.episode.tsv.gz	TV episode info	~200 MB
name.basics.tsv.gz	People: name, birth year, known-for titles	~700 MB

Loading them into Pandas is straightforward:

1ratings = pd.read_csv("title.ratings.tsv.gz", sep="\t", compression="gzip")
2basics = pd.read_csv("title.basics.tsv.gz", sep="\t", compression="gzip", low_memory=False)
3# Merge on tconst (IMDb title ID)
4merged = basics.merge(ratings, on="tconst")
5top_movies = merged[merged["titleType"] == "movie"].nlargest(250, "averageRating")

These datasets cover 26+ million titles. No pagination, no selectors, no 403 errors. The license is for personal and non-commercial use only — you can't republish or resell the data.

The No-Code Shortcut: Thunderbit Handles Pagination for You

For readers who need paginated IMDb data but don't want to write pagination logic, supports both click-based pagination and infinite scroll natively. You tell it to scrape, it handles the rest — including scrolling through lazy-loaded content.

Scrape IMDb with Python: The Complete Working Code (Copy-Paste Ready)

Here are two self-contained scripts you can run right now.

Script A: BeautifulSoup Method (CSS Selectors)

1import requests
2from bs4 import BeautifulSoup
3import pandas as pd
4url = "https://www.imdb.com/chart/top/"
5headers = {
6    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
7    "Accept-Language": "en-US,en;q=0.9"
8}
9response = requests.get(url, headers=headers)
10if response.status_code != 200:
11    print(f"Error: \{response.status_code\}")
12    exit()
13soup = BeautifulSoup(response.text, "lxml")
14movie_items = soup.select("li.ipc-metadata-list-summary-item")
15movies = []
16for item in movie_items:
17    try:
18        title = item.select_one("h3.ipc-title__text")
19        year = item.select_one("span.cli-title-metadata-item")
20        rating = item.select_one("span.ipc-rating-star--rating")
21        movies.append({
22            "title": title.text.strip() if title else "N/A",
23            "year": year.text.strip() if year else "N/A",
24            "rating": rating.text.strip() if rating else "N/A",
25        })
26    except Exception as e:
27        print(f"Skipping movie due to error: \{e\}")
28df = pd.DataFrame(movies)
29df.to_csv("imdb_top250_bs4.csv", index=False)
30print(f"Saved {len(df)} movies")
31print(df.head())

Script B: JSON-LD Method (Recommended)

1import requests
2from bs4 import BeautifulSoup
3import json
4import pandas as pd
5url = "https://www.imdb.com/chart/top/"
6headers = {
7    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
8    "Accept-Language": "en-US,en;q=0.9"
9}
10response = requests.get(url, headers=headers)
11if response.status_code != 200:
12    print(f"Error: \{response.status_code\}")
13    exit()
14soup = BeautifulSoup(response.text, "lxml")
15script_tag = soup.find("script", {"type": "application/ld+json"})
16if not script_tag:
17    print("No JSON-LD data found")
18    exit()
19data = json.loads(script_tag.string)
20movies = []
21for item in data.get("itemListElement", []):
22    movie = item.get("item", {})
23    rating_info = movie.get("aggregateRating", {})
24    directors = movie.get("director", [])
25    director_names = ", ".join(
26        d.get("name", "") for d in (directors if isinstance(directors, list) else [directors])
27    )
28    movies.append({
29        "rank": item.get("position"),
30        "title": movie.get("name"),
31        "url": movie.get("url"),
32        "rating": rating_info.get("ratingValue"),
33        "votes": rating_info.get("ratingCount"),
34        "year": movie.get("datePublished", "")[:4],
35        "genre": ", ".join(movie.get("genre", [])),
36        "director": director_names,
37        "description": movie.get("description"),
38    })
39df = pd.DataFrame(movies)
40df.to_csv("imdb_top250_jsonld.csv", index=False)
41print(f"Saved {len(df)} movies")
42print(df.head())

Both scripts include error handling and produce clean CSV output. Script B gives you richer data — director, description, URL — and is more resilient to layout changes.

How to Scrape IMDb Without Writing Any Code (Using Thunderbit)

Not everyone needs or wants to write Python. Maybe you're an operations analyst who just needs this week's top-rated movies in a spreadsheet. Maybe you're a content strategist who wants to compare genre trends across years. In those cases, writing a scraper is overkill.

Here's how to get the same data using :

Before you start:

Difficulty: Beginner
Time Required: ~2 minutes
What You'll Need: Chrome browser, (free tier works)

Step 1: Navigate to the IMDb page you want to scrape. Open the IMDb Top 250 (or any other IMDb list/search page) in Chrome.

Step 2: Click "AI Suggest Fields" in the Thunderbit sidebar. The AI scans the page and recommends columns — typically Title, Year, Rating, Genre, and a few others depending on the page. You'll see a preview table with the suggested fields.

Step 3: Adjust fields if needed. Remove columns you don't need, or add custom ones by clicking "+ Add Column" and describing what you want in plain English (e.g., "Director name" or "Number of votes").

Step 4: Click "Scrape." Thunderbit extracts the data. For pages with infinite scroll or pagination, it handles the scrolling automatically.

Step 5: Export. Click the export button and choose your format — Excel, Google Sheets, CSV, Airtable, or Notion. The data lands in your destination in seconds.

The key advantage here isn't just convenience — it's that Thunderbit's AI reads the page structure fresh each time. When IMDb changes its layout (and it will), the AI adapts. No selectors to update, no code to fix. For anyone who's been burned by a broken scraper at 2 AM before a deadline, that's worth a lot.

Thunderbit also supports subpage scraping — you can click into each movie's detail page and enrich your table with cast, director, runtime, and other fields that aren't visible on the list page. If you want to see it in action, check out the .

Is It Legal to Scrape IMDb? What You Need to Know

Users explicitly ask this in forums: "Is something like this legal?… IMDb does not want people scraping their website." It's a fair question, and no competitor article addresses it.

IMDb's robots.txt: The Top 250 chart (/chart/top/), individual title pages (/title/ttXXXXXXX/), and name pages (/name/nmXXXXXXX/) are NOT blocked by robots.txt. Blocked paths include /find, /_json/*, /search/name-text, /user/ur*/ratings, and various AJAX endpoints. There's no Crawl-delay directive specified.

IMDb's Conditions of Use: The relevant clause states: "You may not use data mining, robots, screen scraping, or similar data gathering and extraction tools on this site, except with our express written consent." An additional clause prohibits resale or commercial use of scraped data.

What this means in practice: Recent 2024 court rulings (Meta v. Bright Data, X Corp v. Bright Data) found that Terms of Service may not bind users who never agreed to them — if you're scraping publicly available data without logging in, the ToS enforceability is debatable. But this is an evolving legal area.

Safe alternatives: IMDb's are explicitly sanctioned for personal and non-commercial use. TMDb's API is permissive with a free API key. Both are solid options if you want to stay clearly in the clear.

Practical guidance: If you do scrape, use a respectful crawl rate (time.sleep(3) between requests), set proper headers, and don't hit paths blocked by robots.txt. For commercial projects, consult a legal professional or use the official datasets/API.

We've covered in depth on the Thunderbit blog.

Conclusion: Pick the Right Way to Scrape IMDb with Python

The short version:

BeautifulSoup + CSS selectors: Good for learning the fundamentals. Expect it to break every 6–12 months. Always include error handling.
JSON-LD extraction: The approach I'd recommend for any ongoing Python project. Follows the Schema.org standard, changes far less often than CSS classes, and gives you clean structured data without JavaScript rendering.
__NEXT_DATA__ JSON: Use this as a supplement for richer data on individual title pages (runtime, full cast, plot, poster images).
IMDb Official Datasets: The best choice for large-scale analysis. 26M+ titles, updated daily, no scraping required. Personal/non-commercial use only.
: The best choice for non-coders or anyone who wants data fast without maintaining code. AI adapts to layout changes, handles pagination, exports to Excel/Sheets/Airtable/Notion.

Bookmark this guide — I'll update it when IMDb's structure changes next. And if you want to skip the code entirely, and see how fast you can get from an IMDb page to a clean spreadsheet. If you're working with other sites too, our guide on covers the broader workflow.

Try AI Web Scraper for IMDb and more

FAQs

Is it legal to scrape IMDb?

IMDb's Terms of Service prohibit scraping without consent, but the enforceability of ToS on publicly accessible data is legally debatable after recent 2024 court rulings. The safest options are IMDb's (personal/non-commercial use) or the TMDb API (free key). If you do scrape, respect robots.txt, use reasonable delays between requests, and avoid blocked paths. For commercial use, consult a legal professional.

Why does my IMDb scraper return empty results?

Almost always, the cause is outdated CSS selectors — class names like td.titleColumn and td.ratingColumn haven't existed since June 2023. The fix is to switch to JSON-LD extraction (parse the <script type="application/ld+json"> tag) or update your selectors to the current ipc- prefixed classes. Also verify you're including a proper User-Agent header, as a missing header triggers a 403 error that can appear as empty results.

How do I scrape more than 25 results from IMDb?

The Top 250 page loads all 250 movies in a single response — no pagination needed. For search results, use the start= URL parameter (incrementing by 50) to page through results. For example: start=1, start=51, start=101. Add a time.sleep(3) between requests to avoid getting blocked. Alternatively, IMDb's official datasets at contain 26M+ titles with no pagination required.

What is __NEXT_DATA__ and why should I use it to scrape IMDb?

__NEXT_DATA__ is a JSON object embedded in a <script id="__NEXT_DATA__"> tag on IMDb's React/Next.js pages. It contains the complete structured data that React uses to render the page — titles, ratings, cast, genres, runtime, and more. Because it represents the underlying data model rather than the visual layout, it's more resilient to UI redesigns than CSS selectors. Use it alongside JSON-LD for the most robust extraction approach.

Can I scrape IMDb without coding?

Yes. Two main options: (1) Download IMDb's — 7 TSV files covering 26M+ titles, updated daily, free for non-commercial use. (2) Use , which reads the IMDb page, suggests extraction fields automatically, and exports to Excel, Google Sheets, or CSV in two clicks — no code, no selectors to maintain.

Learn More

Scrape IMDb with Python: Code That Actually Works

Need custom web data?

Try Thunderbit