Reddit Front Page (2018)

Over the past few months I've noticed I get a lot less enjoyment out of browsing Reddit. There wasn't any clear reason, just a general feeling that every hour I spent there was an hour wasted. It didn't use to be that way, I think – while there were always some forums there filled with noise, it also contained fresh analyses and insightful commentary and regularly surfaced them to the front page (/r/all).

Filtering With RES

My first attempt to fix things was installing Reddit Enhancement Suite, a Chrome extension that implements (among other things) the ability to hide subreddits from view. When I noticed that particular noisy subreddits were taking up too much of the page, I added them to my RES blacklist. Unfortunately, RES is client-side and can't easily request more links on heavily filtered pages. I noticed that the front page would sometimes be empty, because every link on it came from a subreddit that I didn't want to see.

Next I tried using Reddit's built-in subreddit filtering, a feature added in November 2016 to handle just this use case. After quickly hitting their limit of 99 hidden subreddits, I moved things around between their filter and RES to optimize the number of filtered links per page view. Reddit would block the most popular of the noisy subreddits, and RES could take the long tail.

Collecting More Data

Filtering didn't seem to be very effective, and I was still regularly seeing pages containing only noise, or nothing at all. Was the problem lack of data? I wrote a quick-n-dirty[1] scraper that would hit Reddit's API, saving the current top 2000 posts to JSON files for analysis:

import datetime
import json
import os.path
import time
import urllib
import urllib2

timestamp = datetime.datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%SZ")
os.mkdir(timestamp)

after = None
for request_num in range(20):
    out_filename = os.path.join(timestamp, "%02d.json" % (request_num + 1,))
    print "[%s]" % (out_filename)
    params = {"limit": "100"}
    if after is not None:
        params["after"] = after
    req = urllib2.Request(
        url = "https://reddit.com/r/all/.json?" + urllib.urlencode(params),
        headers = {
            # https://github.com/reddit-archive/reddit/wiki/API#rules
            "user-agent": "darwin:com.john-millikin.redditpopularity:v1 (by /u/jmillikin)",
        },
    )
    response_fp = urllib2.urlopen(req)
    response = json.load(response_fp)
    with open(out_filename, "wb") as fp:
        json.dump(response, fp, indent=2)
    time.sleep(2)
    after = response["data"]["children"][-1]["data"]["name"]

Then I extracted the most interesting fields into an SQLite database for easier querying:

import glob
import json
import sqlite3
import sys

# https://www.sqlite.org/datatype3.html
db = sqlite3.connect(sys.argv[1].strip("/") + ".sqlite")
db.execute("""
CREATE TABLE posts (
  name text,
  created_utc integer,
  subreddit text,
  score integer,
  num_comments integer,
  domain text,
  title text,
  url text
);
""")
for filename in glob.glob(sys.argv[1] + "/*.json"):
    with open(filename, "rb") as fp:
        response = json.load(fp)
    for list_item in response["data"]["children"]:
        post = list_item["data"]
        db_row = [
            post["name"],
            int(post["created_utc"]),
            post["subreddit"],
            int(post["score"]),
            int(post["num_comments"]),
            post["domain"],
            post["title"],
            post["url"],
        ]
        insert_sql = "INSERT INTO posts VALUES (%s)" % (", ".join("?" for _ in db_row),)
        db.execute(insert_sql, db_row)

db.commit()
db.close()

Analysis

By Subreddit

OK, we've got a snapshot of the top 2000 posts and can refresh it at will. Which subreddits should I filter out server-side to minimize noise on /r/all?

sqlite> SELECT COUNT(DISTINCT subreddit) FROM posts;
1608
sqlite> .mode column
sqlite> .headers on
sqlite> .width 20 10
sqlite> SELECT subreddit, COUNT(*) AS count FROM posts
   ...> GROUP BY subreddit ORDER BY count DESC, subreddit
   ...> LIMIT 20;
subreddit             count
--------------------  ----------
aww                   5
funny                 5
gaming                5
gifs                  5
pics                  5
politics              5
AskReddit             4
BlackPeopleTwitter    4
CrappyDesign          4
FortNiteBR            4
PrequelMemes          4
Rainbow6              4
dankmemes             4
leagueoflegends       4
memes                 4
nba                   4
oddlysatisfying       4
soccer                4
todayilearned         4
trees                 4

This result was pretty surprising to me. I had expected to see power law numbers, with "default" subreddits like /r/funny having an order of magnitude more posts on /r/all than the average. But it looks like the front-page algorithm optimizes for maximum subreddit variety, featuring over 1600 unique subreddits within the top 2000 posts. With a limit of 99 subreddits in the server-side filter, there's just no practical way to hide noise based on subreddit name.

By Domain

Here's where that power law showed up. Take a look at which domains the top 2000 posts are pointing at:

sqlite> SELECT domain, COUNT(*) AS count FROM posts
   ...> GROUP BY domain ORDER BY count DESC, domain
   ...> LIMIT 20;
domain                count
--------------------  ----------
i.redd.it             925
i.imgur.com           344
gfycat.com            137
imgur.com             103
v.redd.it             56
twitter.com           32
youtube.com           29
reddit.com            8
streamable.com        8
youtu.be              8
cdna.artstation.com   7
cdnb.artstation.com   4
clips.twitch.tv       4
media.giphy.com       4
self.AskReddit        4
self.leagueoflegends  4
streamja.com          4
78.media.tumblr.com   3
cdn.discordapp.com    3
inquisitr.com         3

1565 images! Out of the top 2000 posts on the world's biggest internet forum, over 75% of them are just pictures[2]!

/r/all posts per domain as of 2018-03-11 00:40:30 UTC
{ "animation": false, "tooltip": { "trigger": "item", "formatter": "{b}: {c}" }, "xAxis": { "type": "category", "axisLabel": { "rotate": -30, "margin": 15 }, "data": [ "i.redd.it", "i.imgur.com", "gfycat.com", "imgur.com", "v.redd.it", "youtube.com", "twitter.com", "cdn*.artstation.com", "reddit.com", "streamable.com" ] }, "yAxis": { "type": "value" }, "series": [{ "type": "bar", "data": [ 925, 344, 137, 103, 56, { "value": 37, "itemStyle": { "color": "#2f4554" } }, { "value": 32, "itemStyle": { "color": "#2f4554" } }, 11, { "value": 8, "itemStyle": { "color": "#2f4554" } }, { "value": 8, "itemStyle": { "color": "#2f4554" } } ] }] }

And the dropoff is incredible -- the #10 domain on /r/all has 0.04% of the posts!

Filtering

Without Images

What happens if we kick out all the image hosts?

sqlite> SELECT COUNT(*) FROM posts;
2000
sqlite> DELETE FROM posts WHERE url LIKE "%.jpg" OR url LIKE "%.gif" OR domain IN ('i.redd.it', 'i.imgur.com', 'gfycat.com', 'giant.gfycat.com', 'imgur.com', 'v.redd.it', 'm.imgur.com', 'i.gyazo.com', 'cdna.artstation.com', 'cdnb.artstation.com', 'flickr.com') OR domain LIKE "%.media.tumblr.com";
sqlite> SELECT COUNT(*) FROM posts;
388

sqlite> SELECT domain, COUNT(*) AS count FROM posts GROUP BY domain ORDER BY count DESC, domain LIMIT 30;
domain                          count
------------------------------  ----------
twitter.com                     32
youtube.com                     29
reddit.com                      8
streamable.com                  8
youtu.be                        8
clips.twitch.tv                 4
self.AskReddit                  4
self.leagueoflegends            4
streamja.com                    4
inquisitr.com                   3
nytimes.com                     3
self.Jokes                      3
self.Showerthoughts             3
thehill.com                     3
variety.com                     3
businessinsider.com             2
dailycaller.com                 2
dailymail.co.uk                 2
en.wikipedia.org                2
epicgames.com                   2
newsweek.com                    2
rawstory.com                    2
salon.com                       2
self.AskOuija                   2
self.Brawlstars                 2
self.CFB                        2
self.Competitiveoverwatch       2
self.DestinyTheGame             2
self.LifeProTips                2
self.WritingPrompts             2

We have less than a quarter of the original dataset, but the link quality is higher. I see some newspapers in the list now, and the two most popular domains together are only 15% of the links.

Without Self-Posts

Many of the remaining high-scoring posts are "self-posts", text posted directly to Reddit by users — basically a comment. Lets look more closely to see if they might be interesting:

sqlite> SELECT score, subreddit, title FROM posts WHERE domain LIKE "self.%" LIMIT 20;
score                 subreddit             title
--------------------  --------------------  ----------------------------------------------------------------------------------------------------
47354                 Showerthoughts        Being a blacksmith must have been a real pantydropper back in the day seeing how Smith is the most c
6008                  atheism               “Religion is what keeps the poor from murdering the rich.” ―Napoleon Bonaparte
23625                 AskReddit             What should people stop buying?
6289                  garlicoin             If this post gets 20,000 upvotes, I will give 5 random commenters 1000 GRLC.
15396                 Jokes                 A priest and a rabbi were sitting next to each other on an airplane.
1919                  leagueoflegends       MLG has wiped their entire LoL archive channel. This means many important pre-LCS VODS no longer exi
5852                  WritingPrompts        [WP] One evening, a portal to hell opens at the foot of your bed. A demon strides through, rips off
2671                  CrazyIdeas            I'm starting a charity to raise awareness of pyramid schemes. Donate $100 to register as a fundraise
2087                  ireland               IRELAND ARE 6 NATIONS CHAMPIONS UPVOTE PARTY!!!!
5334                  askscience            Am I using muscles to keep my eyelids open or to keep them closed or both?
5320                  confession            I got married tonight and it was the worst, most stressful day of my life.
1773                  nintendo              Happy March 10th aka MAR10 aka Mario Day!
2924                  personalfinance       A “subscription” box charged me for 4 of their $107 boxes without my consent and won’t refund
1125                  IAmA                  [AMA REQUEST] A Designer For Expensive Brands Like Gucci, Louis Vuitton, Etc
7414                  dadjokes              My teenage daughter came home from school and she was blazing mad. “We had sex education today, da
3886                  AskReddit             What is something everyone knows, but no one wants to admit?
1412                  YouShouldKnow         YSK that by looking up "3.11" on yahoo.co.jp, 10 cents will be donated to the East Japan Earthquake
493                   Competitiveoverwatch  Uber: "You know what Hydration is called in the sky, Matt?"
847                   leagueoflegends       Clutch Gaming vs. Echo Fox / NA LCS 2018 Spring - Week 8 / Post-Match Discussion
494                   canada                CBC reporting Doug Ford has won PC Leadership in Ontario by the slimmest of margins. Christine Ellio

Not really. There are two good links here (awareness of a disaster-relief charity and a breaking political story), but it's 90% noise. Lets delete them for now, and consider re-adding with strict filtering in the future.

sqlite> DELETE FROM posts WHERE domain LIKE "self.%";
sqlite> SELECT COUNT(*) FROM posts;
223

Without Sports or Video Games

I'm assuming that anybody who cares enough about disc golf (etc) to click its posts is already subscribed directly. Lets delete posts from any subreddits that are obviously for a specific game (physical or virtual). In theory, Reddit could support this directly in their server by a simple tagging system for subreddits.

sqlite> DELETE FROM posts WHERE subreddit IN ('49ers', 'Artifact', 'Barca', 'BattleRite', 'CollegeBasketball', 'Destiny', 'DetroitRedWings', 'DotA2', 'FantasyPL', 'FortNiteBR', 'GlobalOffensive', 'GreenBayPackers', 'LiverpoolFC', 'LonghornNation', 'MkeBucks', 'NUFC', 'NYYankees', 'NintendoSwitch', 'PS4', 'SquaredCircle', 'Steam', 'StreetFighter', 'aoe2', 'baseball', 'canucks', 'chelseafc', 'civbattleroyale', 'cowboys', 'detroitlions', 'discgolf', 'eagles', 'gamernews', 'hockey', 'lakers', 'minnesotatwins', 'nba', 'osugame', 'reddevils', 'smashbros', 'soccer', 'speedrun', 'sports', 'starcraft', 'thelastofus', 'torontoraptors', 'xboxone');
sqlite> SELECT COUNT(*) FROM posts;
169

Ranking

Reddit's Default Ranking

Here's what the front page would look like, using the above filters with Reddit's current ranking algorithm:

sqlite> .width 7 12 30 22 120
sqlite> SELECT score, num_comments, domain, subreddit, title FROM posts LIMIT 25;
score    num_comments  domain                          subreddit               title
-------  ------------  ------------------------------  ----------------------  --------------------------------------------------------------------------------------------------------------------
23083    1466          ultimateclassicrock.com         Music                   40 year old rock station in Chicago replaced by Christian radio at midnight last night. Signed off with Motley Crue’s
36976    1491          youtube.com                     todayilearned           TIL that before the Super Bowl XLI Halftime Show, the show coordinator asked Prince if he'd be alright performing in the
8255     286           businessinsider.com             Futurology              SpaceX rocket launches are getting boring — and that's an incredible success story for Elon Musk: “His aim: dramatic
24711    791           inquisitr.com                   technology              Senate Bill Meant To Punish Equifax Might Actually Reward It: Thanks to last-minute changes in legislation designed to d
11283    772           cbsnews.com                     politics                80 percent of mass shooters showed no interest in video games, researcher says
4621     278           haaretz.com                     worldnews               'Caved to religious pressure': Israeli army takes down viral Women's Day video empowering female soldiers
12958    599           fox13news.com                   FloridaMan              Florida woman jailed for 5 months because of a failed field drug test. The lab test took 7 months to come back, revealin
40663    1571          usatoday.com                    books                   Banning literature in prisons perpetuates a system that ignores inmate humanity
26701    343           dailymail.co.uk                 UpliftingNews           Cute video shows no-kill shelter putting old chairs to good use by letting rescue dogs curl up on them in their cages
59591    3772          seattletimes.com                news                    Costco says extra profit from tax cuts will be shared with employees
6080     285           bellinghamherald.com            nottheonion             A man found 54 human hands in the snow. Russia says they’re probably just trash
20262    903           indiewire.com                   television              Bill Hader’s ‘Massive Panic Attacks’ on ‘SNL’ Inspired His New HBO Series, ‘Barry’
2097     148           scontent-lht6-1.xx.fbcdn.net    batman                  And that's how you end the greatest live action superhero film of all time.
4092     75            web.archive.org                 savedyouaclick          Scientists warn of mysterious and deadly new epidemic called Disease X that could kill millions around the world | "Dise
2595     69            twitter.com                     TrumpCriticizesTrump    "I told Rex Tillerson, our wonderful Secretary of State, that he is wasting his time trying to negotiate with Little Roc
4403     314           youtube.com                     videos                  He is not using auto tune but a form of yodeling.
3885     88            youtu.be                        youtubehaiku            [Poetry] Rejected Theme Song from READY PLAYER ONE
6152     237           inquisitr.com                   AgainstHateSubreddits   Reddit’s Financial Ties To Jared Kushner’s Family Under Scrutiny Amid Inaction Against The_Donald Hate Speech
44691    880           aero.umd.edu                    science                 Scientists create nanowood, a new material that is as insulating as Styrofoam but lighter and 30 times stronger, doesn?
4252     364           nydailynews.com                 politics                FedEx won't ship items like stamps, coins or ashes — but they'll ship guns at a discount
2137     131           heroichollywood.com             Marvel                  Marvel's 'Black Panther' Joins The $1 Billion Box Office Club
6039     679           space.com                       space                   Trump Praises Commercial Space Industry at Cabinet Meeting
2277     535           clips.twitch.tv                 LivestreamFail          OWL referee or should I say "no fun police". DED game btw
1891     119           wect.com                        offbeat                 Cop who lied to Uber driver about it being "illegal to film police" gets reinstated, abruptly retires the next day.
2827     75            salon.com                       esist                   Is Donald Trump a cult leader? Expert says he “fits the stereotypical profile”

This is better than we started with, but even after all that bulk deletion we still have to contend with noise like /r/savedyouaclick (posting clickbait on purpose), /r/LivestreamFail (people I don't know doing things I will never care about), and /r/youtubehaiku (America's Funniest Home Videos for snake people).

By Score

What if we rank directly on voted score?

sqlite> SELECT score, num_comments, domain, subreddit, title FROM posts ORDER BY score DESC LIMIT 25;
score    num_comments  domain                          subreddit               title
-------  ------------  ------------------------------  ----------------------  --------------------------------------------------------------------------------------------------------------------
59591    3772          seattletimes.com                news                    Costco says extra profit from tax cuts will be shared with employees
44691    880           aero.umd.edu                    science                 Scientists create nanowood, a new material that is as insulating as Styrofoam but lighter and 30 times stronger, doesn?
40663    1571          usatoday.com                    books                   Banning literature in prisons perpetuates a system that ignores inmate humanity
36976    1491          youtube.com                     todayilearned           TIL that before the Super Bowl XLI Halftime Show, the show coordinator asked Prince if he'd be alright performing in the
30939    1845          youtube.com                     videos                  It's the weekend and you know what that means
26701    343           dailymail.co.uk                 UpliftingNews           Cute video shows no-kill shelter putting old chairs to good use by letting rescue dogs curl up on them in their cages
26312    2873          jpost.com                       politics                Putin: Jews might have been behind U.S. election interference
24711    791           inquisitr.com                   technology              Senate Bill Meant To Punish Equifax Might Actually Reward It: Thanks to last-minute changes in legislation designed to d
23083    1466          ultimateclassicrock.com         Music                   40 year old rock station in Chicago replaced by Christian radio at midnight last night. Signed off with Motley Crue’s
20262    903           indiewire.com                   television              Bill Hader’s ‘Massive Panic Attacks’ on ‘SNL’ Inspired His New HBO Series, ‘Barry’
12958    599           fox13news.com                   FloridaMan              Florida woman jailed for 5 months because of a failed field drug test. The lab test took 7 months to come back, revealin
11932    1418          timesofisrael.com               worldnews               Putin suggests ‘Jews with Russian citizenship’ behind US election interference
11283    772           cbsnews.com                     politics                80 percent of mass shooters showed no interest in video games, researcher says
8255     286           businessinsider.com             Futurology              SpaceX rocket launches are getting boring — and that's an incredible success story for Elon Musk: “His aim: dramatic
7490     80            np.reddit.com                   bestof                  Redditor mentions psychiatrist Dr. Tyler Black in a thread about gamer psychology and violence, Dr. Tyler Black shows up
6657     227           en.wikipedia.org                todayilearned           TIL of Major Digby Tatham-Warter, a British major who brought an umbrella into battle, using it to stop an armoured vehi
6341     740           youtu.be                        videos                  Girl goes on Dr. Phil and says she is pregnant with baby Jesus. Ultrasound reveals she is literally full of shit.
6152     237           inquisitr.com                   AgainstHateSubreddits   Reddit’s Financial Ties To Jared Kushner’s Family Under Scrutiny Amid Inaction Against The_Donald Hate Speech
6080     285           bellinghamherald.com            nottheonion             A man found 54 human hands in the snow. Russia says they’re probably just trash
6039     679           space.com                       space                   Trump Praises Commercial Space Industry at Cabinet Meeting
5254     784           scmp.com                        worldnews               Putin said he “couldn’t care less” if fellow Russian citizens sought to meddle in US election, insisting such effo
4790     176           jaha.ahajournals.org            science                 top cardiologists have better patient outcomes when they are away. Study of patient outcomes during Transcatheter Cardio
4621     278           haaretz.com                     worldnews               'Caved to religious pressure': Israeli army takes down viral Women's Day video empowering female soldiers
4403     314           youtube.com                     videos                  He is not using auto tune but a form of yodeling.
4252     364           nydailynews.com                 politics                FedEx won't ship items like stamps, coins or ashes — but they'll ship guns at a discount

This … is good! I would enjoy reading a Reddit front page that looked like this.

Conclusions

Reddit's server-side filtering options are not currently useful for /r/all, because their ranking algorithm intentionally optimizes for a small number of posts from many subreddits, but their filter has a hard capacity limit of 99 subreddits. Their filtering could be made much more effective by offering the ability to hide unwanted domains, hide self posts, and hide certain broad categories of subreddit that users can easily self-recognize (e.g. "video games", "sports", or "livestreamers").

Other Findings

/r/The_Donald

One interesting note is that of the top 2000 posts on /r/all, none of them came from controversial right-wing political subreddit /r/The_Donald. This appears to be intentional: at the time of writing /r/The_Donald has several recent posts with scores in the 2000-8000 range, which is far above the minimum scores seen on /r/all:

sqlite> SELECT score, num_comments, domain, subreddit, title FROM posts ORDER BY score ASC LIMIT 10;
score    num_comments  domain                          subreddit               title
-------  ------------  ------------------------------  ----------------------  --------------------------------------------------------------------------------------------------------------------
55       15            pcper.com                       hardware                AMD FreeSync 2 for Xbox One S and Xbox One X
57       0             nytimes.com                     netneutrality           Washington Governor Signs First State Net Neutrality Bill
63       5             oregonlive.com                  oregon                  Bend woman gets 21 years for drugging kids so she could go tanning, do CrossFit
70       4             bitcoinafrica.io                BasicIncome             Universal Basic Income Experiment Launches in Kenya and Uganda Partly Funded by Bitcoin
71       40            youtube.com                     SugarPine7              Sexy nightmare.
75       41            reddit.com                      Drama                   /u/GallowBoob outs his sockpuppet as he justifies his pedophilia.
77       1             vstinner.github.io              Python                  How Victor Stinner fixed a very old GIL race condition in Python 3.7
79       12            dailymail.co.uk                 EnoughLibertarianSpam   Pro-gun poster girl is shot in the back by her four-year-old son
97       81            youtube.com                     eurovision              It's Benjamin Ingrosso with "Dance You Off" for Sweden!
98       228           baytoday.ca                     CanadaPolitics          Doug Ford wins PC leadership race in close vote

I am personally happy about this because I find candidate-specific political forums very noisy, but it's not clear how to reconcile this behavior with the Reddit administration's public claims of content neutrality.


  1. My first attempt used praw, but it requires a registered client ID and I didn't want to go through that just to get a few MB of JSON.

  2. You might object that pictures can be uesful, but those aren't the kind Reddit upvotes. Currently #1 on the site, judged by the users to be more interesting than any other post, is a gif of a falling tree plus fake captions.

Change Feed