Google pays $60 million a year to license Reddit data. OpenAI's deal is reportedly $70 million. That should tell you something about what's buried inside those comment threads. If you've ever tried to manually collect discussion threads, comments, or sentiment data from Reddit, you already know the frustration: endless scrolling, copy-pasting, and tab overload.
I spent a good chunk of last quarter helping our team at Thunderbit research how people actually extract Reddit data in 2025. The landscape has shifted dramatically since Reddit's 2023 API pricing overhaul, and most guides online are either outdated or only cover one method. So I pulled together everything that actually works right now — four distinct approaches, from full Python scripting to zero-code extraction — so you can pick the one that fits your skill level and use case. Whether you're building an NLP dataset, monitoring a subreddit for brand mentions, or just want a spreadsheet of trending posts, this guide has you covered.
What Is Reddit Scraping (and Why Does It Matter)?
Reddit scraping is the process of programmatically extracting posts, comments, user data, and metadata from Reddit's pages or API. Instead of manually browsing threads and copying text, you use a script or tool to collect structured data at scale.
Why bother? Reddit hosts over and generates an estimated . It's where people share unfiltered opinions about products, services, competitors, and trends — the kind of authentic signal that's nearly impossible to find on polished review sites or corporate blogs. Google pays roughly for a Reddit content license, and OpenAI's deal is reportedly . If the biggest AI companies on Earth are paying nine figures for this data, it's worth learning how to access it yourself.
Why Scrape Reddit with Python in 2025?
Python is the default language for Reddit scraping — PRAW, requests, BeautifulSoup, and pandas cover every step from API calls to data export. But the "why" goes beyond tooling.
Here are the most common use cases I see across business and research teams:
| Use Case | Who Benefits | Example |
|---|---|---|
| Market research & validation | Product managers, founders | Mining r/SaaS or r/Entrepreneur for repeated pain points |
| Sentiment analysis | Marketing, brand teams | Tracking how people talk about your product vs. competitors |
| Lead generation | Sales teams | Finding "looking for a tool that does X" posts in niche subreddits |
| Content ideation | Content marketers | Spotting trending questions and topics in r/marketing or r/SEO |
| Academic / NLP research | Researchers, data scientists | Building labeled datasets from comment threads for emotion classification |
| Competitive intelligence | Strategy, ops | Monitoring competitor subreddits for recurring complaints |
Reddit's user base hit an estimated , with — up 24% year-over-year. And after Google's August 2024 core update, Reddit content became roughly in organic search results.
Translation: the data you scrape from Reddit is increasingly the same data Google surfaces to searchers.
Which Method Should You Use to Scrape Reddit? (Quick Comparison)
The most common question in Reddit scraping forums is literally "Which method should I use?" So I built this table. Pick your row and go.
| Criteria | PRAW | .json Endpoint | BeautifulSoup (HTML) | No-Code (Thunderbit) |
|---|---|---|---|---|
| Setup complexity | Medium (API app + pip install) | None (just a URL) | Medium (pip + DOM inspection) | Very low (Chrome extension) |
| API key required? | Yes | No | No | No |
| Comment scraping | Deep (nested trees) | Limited (top-level) | Manual parsing | AI-structured |
| Pagination | Built-in | Manual (after param) | Manual | Auto |
| Rate limiting | 100 req/min (managed by PRAW) | ~10 req/min (unauthenticated) | Risk of IP blocks | Handled by tool |
| Best for | Full-featured projects, research | Quick one-off data grabs | Learning/customization | Non-coders, quick exports |
| Export options | CSV, JSON (manual code) | JSON (raw) | Custom (manual code) | Excel, Google Sheets, Airtable, Notion |
If you want full-featured Python projects with deep comment extraction, start with Method 1 (PRAW). Need a quick data grab in the next 10 minutes with no setup? Try Method 2 (the .json trick). Want to learn HTML scraping or need custom fields? Go with Method 3 (BeautifulSoup). And if you'd rather skip Python entirely and just get the data, jump to Method 4 ().
What Changed: Reddit's 2023–2024 API Pricing Update (and What Still Works for Free)
Almost no scraping guide talks about this — and it's the single most important context for anyone scraping Reddit today.
In June 2023, Reddit introduced paid tiers for API access for the first time since 2008. The fallout was massive:
- Pushshift died for public use. Reddit revoked Pushshift's API access in May 2023. Researchers who relied on it (over cited Pushshift) lost their primary data source overnight. The successor for historical data is , but there's no public live API replacement.
- Third-party apps shut down. Apollo, Reddit is Fun, Sync, BaconReader, and others all closed by June 30, 2023, after Reddit quoted Apollo's developer in API fees.
- Over 8,500 subreddits went dark in protest, including r/funny (40M subscribers), r/gaming, and r/science ().
What's still free in 2025:
The remains available for non-commercial, personal, and academic use — 100 queries per minute per OAuth client ID. PRAW works perfectly under this tier for moderate scraping. Unauthenticated access (including the .json endpoint) is capped at roughly 10 requests per minute.
The practical takeaway: For small-to-medium scraping tasks, the free tier is more than sufficient. For large-scale or commercial use, you'll need to either contact Reddit for enterprise access, use the .json endpoint or BeautifulSoup (which don't require API keys), or use a tool like Thunderbit that doesn't depend on Reddit's API at all.
Before You Start
- Difficulty: Beginner to Intermediate (varies by method)
- Time Required: ~15–30 minutes for Methods 1–3; ~5 minutes for Method 4
- What You'll Need:
- Python 3.8+ installed (for Methods 1–3)
- A Reddit account (for Method 1)
- Chrome browser (for Method 4)
- (for Method 4)
Method 1: How to Scrape Reddit with Python Using PRAW (Step-by-Step)
PRAW (Python Reddit API Wrapper) is the most popular and best-documented way to scrape Reddit with Python. It handles authentication, rate limiting, and pagination for you, and it's actively maintained — the latest stable release is PRAW 7.8.1 (October 2024), supporting Python 3.8 through 3.13.
Step 1: Create a Reddit App and Get Your API Credentials
Go to and scroll to the bottom. Click "are you a developer? create an app..."
Fill in the form:
- Name: anything descriptive (e.g., "my-reddit-scraper")
- App type: select script
- Redirect URI: enter
http://localhost:8080(required but unused for script apps) - Description: optional
Click Create app. You'll see your credentials:
- client_id — the 14-character string directly under the app name (labeled "personal use script")
- client_secret — the field labeled "secret"
You'll also need to accept Reddit's and before app creation completes.
One heads-up: since late 2024, new developers may need to submit an access request and wait for approval. This is the biggest friction point for first-time PRAW users, and there's no way around it.
Step 2: Install PRAW and Create a Reddit Instance
Open your terminal and run:
1pip install praw pandas
Then create a read-only Reddit instance:
1import praw
2reddit = praw.Reddit(
3 client_id="YOUR_CLIENT_ID",
4 client_secret="YOUR_CLIENT_SECRET",
5 user_agent="python:reddit-scraper:v1.0 (by u/yourname)",
6)
7# reddit.read_only is True by default for script apps without a password
The user_agent format matters. Reddit actively throttles generic strings like python-requests/2.x. Use Reddit's recommended format: platform:app_id:version (by u/username).
Step 3: Scrape Posts from a Subreddit
Here's how to fetch the top posts from r/python for the past month and store them in a pandas DataFrame:
1import pandas as pd
2subreddit = reddit.subreddit("python")
3rows = []
4for post in subreddit.top(time_filter="month", limit=500):
5 rows.append({
6 "id": post.id,
7 "title": post.title,
8 "selftext": post.selftext,
9 "score": post.score,
10 "upvote_ratio": post.upvote_ratio,
11 "num_comments": post.num_comments,
12 "author": str(post.author) if post.author else "[deleted]",
13 "created_utc": post.created_utc,
14 "url": post.url,
15 "permalink": f"https://reddit.com{post.permalink}",
16 })
17df = pd.DataFrame(rows)
18print(df.head())
You can swap .top() for .hot(), .new(), or .controversial(), and time_filter accepts "all", "day", "hour", "month", "week", or "year".
Fair warning: Reddit caps any listing at roughly 1,000 items, no matter how high you set limit. That's a Reddit-side ceiling, not a PRAW limitation.
Step 4: Export Reddit Data to CSV or Excel
1df.to_csv("reddit_python_top.csv", index=False)
2df.to_json("reddit_python_top.json", orient="records", lines=True)
PRAW handles rate limiting automatically — it reads the X-Ratelimit-Remaining and X-Ratelimit-Reset headers on every response and sleeps between calls as needed. For moderate scraping, you rarely need to add manual delays.
How to Scrape Reddit Comments with Python (Deep Nested Threads)
Scraping comments is where most people hit a wall.
Reddit stores comments as a tree: each comment can have child comments, and some branches are collapsed behind "load more comments" links. In PRAW's world, these hidden branches are represented as MoreComments objects.
Here's the mental model:
1Submission (t3_abc123)
2├── Comment A (top-level)
3│ ├── Reply A1
4│ │ └── Reply A1a
5│ └── Reply A2
6├── Comment B (top-level)
7│ └── MoreComments (hidden — "load more comments")
8└── MoreComments (hidden — "continue this thread")
Using replace_more() to Fetch All Hidden Comments
The replace_more() method walks the comment tree and replaces each MoreComments placeholder with the actual comments it points to:
1submission = reddit.submission(id="abcdef")
2submission.comments.replace_more(limit=10) # practical cap for large threads
3all_comments = submission.comments.list() # flattened breadth-first
Setting limit=None replaces every single MoreComments node — but on a thread with 5,000+ comments, this can take several minutes because each replacement is one API request returning at most ~100 comments. For large threads, I recommend starting with limit=10 or limit=20 and increasing only if you need completeness.
Flattening Nested Comments into a Table
1rows = []
2for c in all_comments:
3 rows.append({
4 "comment_id": c.id,
5 "parent_id": c.parent_id, # t1_xxx = parent comment, t3_xxx = submission
6 "depth": c.depth,
7 "author": str(c.author) if c.author else "[deleted]",
8 "body": c.body,
9 "score": c.score,
10 "created_utc": c.created_utc,
11 "is_submitter": c.is_submitter,
12 })
13comments_df = pd.DataFrame(rows)
Top-level comments have parent_id starting with t3_ (the submission's fullname). The depth column tells you how deeply nested each comment is — useful for filtering or visualization. One gotcha: len(all_comments) usually won't match submission.num_comments because deleted, removed, and spam-filtered comments are excluded from the tree.
Method 2: The .json Endpoint Trick — Scrape Reddit Without an API Key
Append .json to any Reddit URL. That's it. You get structured JSON back — no authentication, no app registration, no pip install.
Example: https://www.reddit.com/r/python/hot.json
Forum users mention this trick constantly, yet almost no tutorial covers it.
A Working Python Code Snippet
1import requests
2headers = {"User-Agent": "python:reddit-scraper:v1.0 (by /u/yourname)"}
3r = requests.get(
4 "https://www.reddit.com/r/python/hot.json",
5 headers=headers,
6 params={"limit": 100},
7)
8data = r.json()
9for post in data["data"]["children"]:
10 p = post["data"]
11 print(p["title"], p["score"], p["num_comments"], p["author"])
The User-Agent header is critical. Reddit blocks or throttles generic user agents like python-requests/2.31.0 — as , "this rate limiting is based on user-agent." Use the same descriptive format as PRAW.
How to Handle Pagination with the after Parameter
The .json endpoint returns ~25 results by default (max 100 per request). To get more, use the after cursor from the response:
1import requests, time
2headers = {"User-Agent": "python:reddit-scraper:v1.0 (by /u/yourname)"}
3after = None
4all_posts = []
5for _ in range(10): # up to ~1000 posts
6 r = requests.get(
7 "https://www.reddit.com/r/python/hot.json",
8 headers=headers,
9 params={"limit": 100, "after": after},
10 )
11 data = r.json()
12 all_posts.extend(data["data"]["children"])
13 after = data["data"].get("after")
14 if not after:
15 break
16 time.sleep(6) # ~10 QPM = one request every 6 seconds
The after value is a cursor token (format: t3_xxxxxx). Like PRAW, the hard ceiling is ~1,000 items total across paginated requests.
Limitations of the .json Method
- No deep comment tree access — you get top-level comments plus one level of "more" stubs, but no auto-expansion like PRAW's
replace_more() - Read-only — no voting, posting, or moderation
- ~10 requests per minute for unauthenticated traffic — aggressive loops trigger 429 errors
- Same 1,000-item listing cap as the authenticated API
This method is best for quick one-off grabs, prototyping, or situations where you don't want to register an API app.
Method 3: How to Scrape Reddit with BeautifulSoup (HTML Parsing)
If you've done any web scraping before, you probably know BeautifulSoup. The key insight for Reddit specifically: use old.reddit.com instead of the new React-based frontend. The old interface is server-rendered, lighter, and far easier to parse — confirm it's still online and scraper-friendly.
Setting Up Requests and BeautifulSoup
1pip install requests beautifulsoup4
1import requests
2from bs4 import BeautifulSoup
3headers = {"User-Agent": "python:reddit-scraper:v1.0 (by /u/yourname)"}
4r = requests.get("https://old.reddit.com/r/python/", headers=headers)
5soup = BeautifulSoup(r.text, "html.parser")
Extracting Post Data from the DOM
On old.reddit.com, each post lives inside a <div> with class thing. The most stable selectors are the data-* attributes:
1for thing in soup.select("div#siteTable > div.thing"):
2 title_el = thing.select_one("a.title")
3 print({
4 "title": title_el.get_text(strip=True) if title_el else None,
5 "author": thing.get("data-author"),
6 "score": thing.get("data-score"),
7 "comments": thing.get("data-comments-count"),
8 "domain": thing.get("data-domain"),
9 "url": title_el.get("href") if title_el else None,
10 })
Prefer the data-* attributes over nested class selectors — Reddit has tweaked class names over the years, but the data attributes are template-driven and rarely change.
Handling Pagination on old.reddit.com
1import time
2url = "https://old.reddit.com/r/python/"
3all_rows = []
4while url:
5 r = requests.get(url, headers=headers)
6 soup = BeautifulSoup(r.text, "html.parser")
7 for thing in soup.select("div#siteTable > div.thing"):
8 title_el = thing.select_one("a.title")
9 all_rows.append({
10 "title": title_el.get_text(strip=True) if title_el else None,
11 "author": thing.get("data-author"),
12 "score": thing.get("data-score"),
13 "comments": thing.get("data-comments-count"),
14 "url": title_el.get("href") if title_el else None,
15 })
16 nxt = soup.select_one("span.next-button a")
17 url = nxt["href"] if nxt else None
18 time.sleep(2) # politeness delay
When to Use BeautifulSoup vs. PRAW
BeautifulSoup is a good fit when you want to learn DOM scraping, don't want to register an OAuth app, or need custom fields PRAW doesn't expose. But it's more fragile — HTML structure can change without warning, IP blocking is more aggressive in 2025 than it used to be, and you have to write all the pagination and error-handling code yourself. For reliability and depth, PRAW wins.
Method 4: How to Scrape Reddit Without Code Using Thunderbit
A confession: a lot of people searching "how to scrape Reddit with Python" don't actually want to write Python. They want the data. If that's you, this section is your escape hatch.
is an AI-powered Chrome extension our team built specifically for this kind of use case — extracting structured data from web pages without writing code.
Step 1: Install Thunderbit and Open a Reddit Page
Install the , then navigate to any Reddit subreddit or post page (e.g., reddit.com/r/python).
No API key, no Python environment, no terminal commands.
Step 2: Click "AI Suggest Fields" and Let AI Read the Page
Click the Thunderbit icon in your browser toolbar, then hit "AI Suggest Fields." Thunderbit's AI scans the page and automatically suggests columns like Post Title, User Name, Upvotes, Comments Count, Date Posted, Post Description, Community Name, and Post URL.
You can add, remove, or rename columns as needed. For example, if you only care about post titles and scores, just delete the other fields.
Step 3: Click "Scrape" and Export Your Data
Hit "Scrape" and Thunderbit extracts the data, handling pagination automatically. Once the table is populated, export directly to Excel, Google Sheets, Airtable, or Notion — no CSV code needed.
For deeper data, Thunderbit's subpage scraping lets you click into individual threads and enrich your table with comment data automatically. This is conceptually similar to PRAW's replace_more() — but without writing a single line of code.
Bonus: Scheduled Scraping for Ongoing Reddit Monitoring
If you need to track a subreddit daily — say, monitoring brand mentions in r/SaaS or competitor discussions in a niche community — Thunderbit's scheduled scraper handles repeat runs. You describe the interval in plain English (e.g., "every weekday at 9am") and the tool does the rest, delivering fresh data to your connected spreadsheet or database.
You can learn more about Thunderbit's Reddit scraping capabilities on the .
Tips and Best Practices for Scraping Reddit with Python
I've learned most of these the hard way — they apply regardless of which method you chose above.
Respect Reddit's Terms of Service and Rate Limits
Reddit's explicitly prohibit commercial scraping without written approval — and that applies to all access methods, not just the API. For personal, academic, and internal research use, the free OAuth tier and Thunderbit's workflows are within reasonable-use bounds.
Rate limit cheat sheet:
| Scenario | Limit | What Happens |
|---|---|---|
| Authenticated (OAuth) | 60–100 req/min | PRAW manages this automatically |
| Unauthenticated (.json, HTML) | ~10–30 req/min | 429 Too Many Requests |
| Generic User-Agent | Heavily throttled | 403 Forbidden or silent blocks |
Always set a descriptive User-Agent string. This is the single most common reason first-time scrapers hit 429 or 403 errors.
Store and Structure Your Data Cleanly
- Use pandas DataFrames with explicit column order for predictable CSV/Excel exports
- Convert
created_utcto human-readable timestamps:pd.to_datetime(df["created_utc"], unit="s") - Deduplicate on
idwhen scraping across multiple sortings (hot, new, and top often overlap) - Handle deleted authors:
str(post.author) if post.author else "[deleted]"
Handle Common Errors Gracefully
| Error | Cause | Fix |
|---|---|---|
| 429 Too Many Requests | Exceeding rate limit (60-100 req/min for OAuth) | Implement exponential backoff; check X-Ratelimit-Reset header |
| 403 Forbidden | Bad User-Agent or blocked IP | Use a unique, descriptive UA string; ensure OAuth app is active |
None author | Deleted or suspended account | Wrap with if post.author else "[deleted]" |
prawcore.TooManyRequests | PRAW-level rate limit buffer triggered | Increase ratelimit_seconds or spread requests evenly |
| 5xx or 413 on large trees | Reddit backend overload on deep threads | Wrap replace_more() in retry logic; limit recursion depth |
Reddit Scraping Use Cases: What Can You Do with the Data?
The scraping is step one. Here's what actually moves the needle:
- Sales teams: Monitor subreddits like r/SaaS, r/smallbusiness, or r/Entrepreneur for "looking for a tool that does X" posts. Feed matches into lead lists or CRM workflows. Use Thunderbit's scheduled scraper for daily monitoring.
- Marketing and content teams: Track brand mentions, analyze sentiment trends, and mine trending questions for content ideas. Combine Reddit exports with Google Sheets for team collaboration.
- Ecommerce and operations: Monitor competitor product discussions for recurring complaints. Subreddits like r/BuyItForLife and vertical-specific communities are goldmines for product feedback.
- Researchers and analysts: Build NLP datasets — academic papers in 2024 used datasets of to for sentiment and emotion classification. PRAW's corpus collection is citable in peer review.
If you want to go deeper on how to or , we've covered those workflows in detail on the Thunderbit blog.
Wrapping Up
Reddit scraping in 2025 looks nothing like it did two years ago. The 2023 API changes killed Pushshift, shut down beloved third-party apps, and introduced paid tiers.
But the free tier is alive and well for personal and academic use, and there are more ways to get the data than ever.
Here's the one-line summary for each method:

Python veteran or spreadsheet-by-lunchtime person — one of these four methods will get you there. If you'd rather skip the code entirely, you can and see how it handles Reddit in a couple of clicks. And if you want to keep sharpening your Python scraping skills, bookmark this guide — I'll keep it updated as Reddit's landscape continues to evolve.
For more on web scraping approaches, check out our guides on , , and .
FAQs
Is it legal to scrape Reddit with Python?
Reddit's prohibit commercial scraping without written approval. The free OAuth tier is available for personal, non-commercial, and academic use. The legal framing is pipe-agnostic — it applies whether you use the API, the .json endpoint, or HTML scraping. Always check Reddit's current terms before scraping at scale.
Does PRAW still work after Reddit's 2023 API changes?
Yes. PRAW 7.8.1 (October 2024) is actively maintained and operates within the automatically. The 2023 pricing changes mainly affected high-volume and commercial API usage, not typical PRAW scraping patterns.
Can I scrape Reddit without an API key?
Yes — the .json endpoint and BeautifulSoup HTML parsing both work without API keys. also requires no API key. All three methods are still bound by Reddit's Terms of Service for commercial use.
How do I scrape Reddit comments, not just posts?
With PRAW, use submission.comments.replace_more(limit=10) followed by submission.comments.list() to flatten the nested comment tree into a list. With Thunderbit, use subpage scraping to automatically enrich a post-listing scrape with comment data from each thread.
What's the fastest way to scrape Reddit without coding?
The lets you scrape Reddit posts and comments in two clicks and export directly to Excel, Google Sheets, Airtable, or Notion — no Python, no API key, no setup required.
Learn More
