Over the past few days, I've had the pleasure of finally downloading /u/Stuck_In_the_Matrix's dataset posted to reddit: Nearly all of the comments ever posted on Reddit. It takes about 12 hours to download the data via the torrent link, and about 16 to parse into a MongoDB database. I didn't add some initial indices at first, and ran into a few issues. Pro-tip: if you don't add indexes, you're going to have a bad time. Index the comment_id, the parent_id, the timestamp, and the user's name.
Of the downstream data structures I've built, one is the edge collection - for each comment that has a parent type of t3_ (a link), generate a new edge for each of it's children, and then spawn the same process for its children. If an edge already exists, then increment the interaction_count between the parent poster and the child poster by one. In this way, you can generate an edge list that is directed and weighted. In addition, edges are partitioned to the subreddit - only comments from a particular subreddit are generated at any time, and only on request. So far, I have mapped out three separate subreddits that have been historically, qualitatively known for problematic posts, users, and comments, and have been known to generate controversy for the website, TheRedPill, KotakuInAction, and MensRights. While far from a solid take-away, I decided to start by just mapping out the intersection of these communities.
From the Edge collection I created, I selected out cases where a parent user had posted in at least two of these communities. I then selected cases where the users had posted at least ten times (an arbitrary cut-off just to cut down on how much mapping I'd have to do for an initial dive (though this will still in effect capture the core of the intersection of these networks). I then calculated the variance of posting for each person (i.e., how evenly spread their participation was across all three subreddits), and then I stored which subreddit was their modal subreddit. I colored nodes by subreddit, and then weighted nodes by degree, betweenness, and variance. What follows are some quick take-aways from this initial look at the dataset with respect to these three highly controversial subreddits:
Nodes sized by In-Degree
Nodes sized by Betweenness
Nodes sized by Variance
1.MensRights and TheRedPill are more tightly coupled than KotakuInAction: MensRights is Grayish-White, TheRedPill is red, and KotakuInAction is green (meaning a node is colored (or assigned to their most-participated subreddit) by the highest post count for that nodes post history). What results is a clearly stronger relationship between these two subreddits (though it should be noted that MensRights and TheRedPill are both roughly twice-three times as large as KotakuInAction, so assuming this relationship is meaningful right out of the gate is probably open to skepticism.
2.Intersections as shared cognitive spaces: Graphs like this can capture the degree to which two communities are the same. Much like similarity measures such as the Jaccard Index captures the degree to which two nodes are similar, the sum of Jaccard Indices is likely a good way to interpret the degree to which two communities overlap, and in one interpretation, share a cognitive similarity. On one extreme, a community-level Jaccard Index of zero would indicate no shared members and would lend itself well to an interpretation where the two communities do not interact. On the other extreme, a Jaccard Index of one at the community-level would indicate that the community is essentially the same, and probably shares the same views as either of the single communities. Of course this is entangled with what defines a community, and also likely violates some statistical measures about dependence of ties. Still, I think that the general qualitative take-away holds for now, and the details lie in a more rigorous analysis.
3.KotakuInAction is it's own beast: It started much later than the other two (MensRights was established in March 2008(!), TheRedPill October 2012, and KotakuInAction August 2014), is much smaller, but has created much more controversy for the site. Judging by the membership counts against the user counts alone, the three are independent in their adoption function. That KotakuInAction is so thinly tied to the other two is interesting, though.
4.Modal subreddits for an account need not be nested in their modal communities: This would represent people moving towards new communities, people barely more active in some certain community, and maybe a few other qualifiers I'm not thinking of right now. Either way, historical in-flows, or migratory patterns at the high level, would be useful for uncovering how people move around and the degree to which that tells us a story.