Subreddit Algebra
Yesterday, FiveThirtyEight featured a fantastic article by Trevor Martin, a Ph.D student in Computational Biology at Stanford University. Martin’s piece, Dissecting Trump’s Most Rabid Online Following, looked at the toxic communities surrounding Donald Trump, notably r/The_Donald, by using a machine learning technique called latent semantic analysis. LSA uses words and concepts from two sets of documents and shows how closely they are related. Martin used this process to find the overlap between different subreddits; two different subreddits are more similar if users comment in both. He then goes further to use what he calls “subreddit algebra”. By adding or subtracting the subreddits together, other related subreddits can be revealed. For example, r/nba + r/minnesota = r/timberwolves. If you’re interested in semantic vector math, there’s a fun twitter bot that does this algebra several times per day.
As with all FiveThirtyEight’s data stories, they make their code freely available for readers to try out themselves. I thought it’d be interesting to take a peek at some subreddits that are a little closer to home (and a whole lot less racist and sexist). If you don’t want to run this yourself, feel free to skip to the results below.
The Setup
If you want to follow along, you’ll need some familiarity with the Google Cloud Platform since that’s where everything will be run. Specifically, you’ll be using their BigQuery service, which is a tool for working with massive datasets. You’ll also want to set up a bucket in Google Storage. Your outputs will be quite large and they don’t allow you to export directly to your local file system. Finally, you’ll need some basic familiarity with the R language and an environment to run R scripts. RStudio is a great tool for this.
First, from your Google Cloud console, create a new project to contain the various tables you’ll be generating. Next, head over to BigQuery and create a new dataset under your project. You could call this something like ‘reddit’. This will hold your results. You’ll be querying against fh-bigquery:reddit_comments
set that is made available to you by default. Click on the Compose Query button and use this code from the fivethirtyeight GitHub repository. Change line 19 to the path of your own dataset you just created.
Take the resulting dataset that this query generates and export it to the storage bucket you created. From there, you can download it as a CSV file.
Now, in RStudio, load the vector analysis script from the repository. You’ll need to change the path to the CSV file on line 20 to your exported CSV. And, of course, change the various subreddits after line 59. Now the fun begins!
The Results
The first obvious search is for similar subreddits to r/IowaCity. What kinds of things do Iowa City folks post about? The higher the number, the more related the subreddits are.
Cedarrapids 0.4627451
Madisonwi 0.4278260
Uiowa 0.4216467
Milwaukee 0.4069844
Homebrewing 0.3992629
Beer 0.3941419
Chicago 0.3916151
Indianapolis 0.3868063
Iowa 0.3850677
Smoking 0.3823774
Ok, not surprising. Surrounding cities plus beer drinking and smoking meats. Iowa City redditors are a chill bunch. What about the uiowa subreddit?
IowaCity 0.4216467
Mazdaspeed6 0.2913548
Swimming 0.2766708
Projectcar 0.2719264
Madisonwi 0.2699070
Cartalk 0.2696891
College 0.2646985
Cars 0.2642775
Civilengineering 0.2637309
Milwaukee 0.2634588
I’ll admit, there are a surprising amount of car discussion going on. Perhaps not when you see some of the cars downtown.
What happens when we take the uiowa out of Iowa City? IowaCity – uiowa =
PoGoIC 0.2447359
Smoking 0.2135908
Homebrewing 0.2053004
BBQ 0.2028280
Grilling 0.1997918
Sousvide 0.1983743
Wine 0.1961068
Cedarrapids 0.1937385
Bourbon 0.1917187
Spicy 0.1895046
Iowa City likes to grill out and drink. And play Poekmon Go. Let’s see what librarians are up to. From r/Libraries:
Librarians 0.6681721
Teachers 0.6463503
Knitting 0.6231567
Parenting 0.6165957
Weddingplanning 0.6118699
Genealogy 0.6118073
Wedding 0.6039990
Femalefashionadvice 0.6024974
Crochet 0.6010991
Vegetarian 0.5975182
Congratulations, librarians, on your marriage and children! And your new fiber arts project. What happens when we remove the wedding planning from librarians’ reddit posts?
Corruption 0.3048685
HistoryofIdeas 0.2961678
CornbreadLiberals 0.2932469
TrueProgressive 0.2924358
Scifi 0.2919506
Media 0.2833257
WarOnComcast 0.2797392
TechNewsToday 0.2789546
InCaseYouMissedIt 0.2789388
Obama 0.2774487
What other interesting algebra problems could we think up? Send me an email and I’ll try to post a few next week. After all, it’s Friday and I’m off to drink beer, grill some vegetarian food, and read sci-fi after I’m done parenting for the day. This weekend might be a good time to pick up knitting.