Skip to main content
Back to Blogs
Reddit
API
JSON
Python
Pandas
Data Analysis
Web Scraping

Reddit API and JSON for AI TRAINING

Amit Divekar

Reddit API and JSON for AI TRAINING

The Trick That Took Me Way Too Long to Find

I spent an embarrassing amount of time looking for Reddit's "official" data API before I stumbled onto this. Just append .json to almost any Reddit URL and you get back raw JSON. That's it. No OAuth dance, no API key, no developer account approval. Just slap .json on the end of a subreddit URL and Reddit hands you the same structured data the site itself uses to render pages.

So visiting https://www.reddit.com/r/india/.json gives you something like this:

Reddit JSON API Response

You get titles, authors, scores, comment counts, timestamps, URLs - basically everything you'd want for training data or analysis. The structure is consistent across subreddits, which makes it actually pleasant to work with once you know where to dig.

Useful GitHub Resources

I've poked around a fair number of repos while building Reddit scrapers, and a few are genuinely worth bookmarking.

Reddit API Documentation Wiki

The reddit-archive/reddit Wiki on GitHub has solid documentation on the JSON schema - specifically useful when you're trying to figure out what's in a comment object vs a post object, since they share some field names but diverge in annoying ways.

Parsers and Scrapers Worth Knowing

alumbreras/reddit_parser handles large JSON dumps from Pushshift archives and writes them into SQLite, which is handy if you're working with historical data at scale. RaheesAhmed/Reddit-Scraper and Mohamedsaleh14/Reddit_Scrapper both take the live API route instead, pulling current data for whatever purpose you need.

Wrappers for Different Languages

If you'd rather not deal with raw JSON at all, there are wrappers for common languages. The renoki-co/reddit-json-api PHP library handles public subreddit data including pagination and sorting. On the Python side, PRAW (Python Reddit API Wrapper) is the de facto standard - it abstracts away the JSON entirely, though I'd still recommend understanding the raw format before leaning on it too heavily.

Exporting Your Own Data

karlicoss/rexport is one I found useful specifically for pulling your own account's saved posts and comments as JSON files. Handy if you want to do something with your own Reddit history.

Getting This Working with Pandas

The first time I tried pd.read_json() directly on a Reddit URL, it blew up immediately. Then I tried requests.get() without any headers and got a wall of 429 errors. Reddit blocks the default Python user-agent pretty aggressively. There are two things you actually need to get this working: a proper User-Agent header, and some code to flatten Reddit's deeply nested JSON before Pandas can make sense of it.

Here's the version that actually works.

Working Pandas Code

import pandas as pd import requests # 1. Define the target Reddit JSON URL url = "https://www.reddit.com/r/python.json" # 2. Add a User-Agent header (Reddit blocks default Python/Pandas agents) headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'} # 3. Fetch the data using requests response = requests.get(url, headers=headers) data = response.json() # 4. Extract the list of posts (located under 'data' -> 'children') # We use json_normalize to flatten the nested 'data' field of each post posts = [child['data'] for child in data['data']['children']] df = pd.json_normalize(posts) # 5. Select only the columns you actually care about df_clean = df[['title', 'author', 'score', 'num_comments', 'url', 'created_utc']] # Convert the timestamp to readable date df_clean['created_utc'] = pd.to_datetime(df_clean['created_utc'], unit='s') print(df_clean.head())

What's Actually Happening Here

The User-Agent header is non-negotiable. Without it you'll get a 429 Too Many Requests immediately - Reddit has been blocking the default Python/requests agent for years and they're not going to stop.

The nested structure is the other hurdle. Reddit's JSON buries everything inside data.children, and each child has its own data key with the actual post fields. If you try pd.read_json() directly or call pd.DataFrame(data) without extracting the right level, you'll get a single-column mess or a key error. json_normalize is the right tool here - it flattens nested dicts into columns, so data.title becomes just title in your DataFrame.

The posts themselves live at data['data']['children'], and each one's actual content is under the data key inside that, which is exactly what the list comprehension on line 4 is pulling out.

A Few Tips That Save Time

To get top-ranked posts instead of new ones, change the URL to https://www.reddit.com/r/subreddit/top.json. By default you only get 25 posts per request - appending ?limit=100 bumps that to the maximum Reddit allows in a single call. For comments on a specific thread, use the post's comment URL with .json appended (e.g., .../comments/id.json). The response comes back as a list of two objects: the first is the post itself, and the second is the comment tree. I got tripped up by that the first time and kept trying to parse it like a single object.

Wrapping Up

You don't need OAuth, you don't need a developer account, and you don't need any special libraries to start pulling Reddit data. The .json trick works on subreddits, user profiles, search results, comment threads - basically anywhere you have a Reddit URL. For AI training datasets or exploratory analysis, it's genuinely one of the fastest data collection pipelines I've put together.


Connect With Me

I'm @amitdevx on GitHub if you want to see what I'm building, and you can find me on LinkedIn too. Always happy to talk Python or data stuff.