Basically, you give createDataset the reddit object, the subreddits in list or generator form , a start and end date, a base name for the database, and a fine scale which I’ll get to in a moment. Choose too many, and you’ll quickly start getting subreddit specific words, such as names, which will trivialize the problem e. But are there more subtle differences between subreddits that can be used to group them in meaningful ways as well? So we can cluster the subreddits cleanly, but what defines these clusters? Creating a reddit data set. The algorithm clustered the data into 7 nicely separated clusters, as displayed in images below. As a first pass analysis on these data, I calculated the euclidean distance between the dimensional normalized word distributions for each pair of subreddits, resulting in the following matrix:.
|License:||For Personal Use Only|
|iPhone 5, 5S resolutions||640×1136|
|iPhone 6, 6S resolutions||750×1334|
|iPhone 7, 7 Plus, 8, 8 Plus resolutions||1080×1920|
|Android Mobiles HD resolutions||360×640, 540×960, 720×1280|
|Android Mobiles Full HD resolutions||1080×1920|
|Mobiles HD resolutions||480×800, 768×1280|
|Mobiles QHD, iPhone X resolutions||1440×2560|
|HD resolutions||1280×720, 1366×768, 1600×900, 1920×1080, 2560×1440, Original|
Some serve as learning resources for those new to a field, while others are places for debates among experts.
Well, say we want to get all the posts from PRAW has many other methods to grab specific submissions, comments, users, etc. Each row in comments represents a single comment in a post.
To answer the question of whether users in different subreddits write in distinguishable ways, I analyzed the frequency of words used in the comments of each subreddit. I’ve also collected a data set of almost all the posts along with their top comments from the top subreddits from March In a lot of ways, these make intuitive sense. Recent Posts Mining my google search history for clues, Part I Sentiment analysis of movie taglines Clustering subreddits by common word usage Creating a reddit data set Hello World!
I suppose how we write says a lot about us. There are subreddits for any and every topic one can think of, and redditors know that subreddits quickly take on dynamic personalities.
Sentiment analysis of movie taglines. Once you have a generator or list of subreddit objects and your praw object, call createDataset to start downloading comments and posts into a sqlite3 database. So, for example, it’s easy to grab posts from the last hour, day, week, month, or year, but challenging to grab posts from the month before last, or even the last month except for today.
A value of 1 indicates that the word has the same frequency as the mean frequency for that word.
Not only did subreddits cluster in reddti reasonable fashion according to topics, many of the clusters can be defined by differences in just a few individual words, with pronouns having a disproportionate influence.
As a general overview, we can look at the contribution of each word to each of the principal components. There are some obvious answers: Choose too many, and you’ll quickly start getting subreddit specific words, such as names, which will trivialize the problem e. How much information does it take to categorize a subreddit?
You can get this database here. One really nice feature of affinity propagation is that, as gentlemaanboners to k-means clustering, it doesn’t require you to estimate the number of clusters beforehand. Some are incredibly supportive, while others quickly become havens for trolls. Choosing the right number of words to analyze is a bit of a balance. But are there more subtle differences between subreddits that can be used to group them in meaningful ways as well?
The columns contain the postIDpostTitlepostBody text if a self-post, url if a linkpostScore as of when it was downloadedsubredditNameand subredditID. The other clusters are defined by more subtle patterns, and are less dominated by individual words.
Reddit – gentlemanboners – Gal Gadot | Gal Gadot | Gal Gadot, Gal gardot, Gal gadot wonder woman
I then used affinity propagationa clustering algorithm based on message passing, to cluster the data in the first 3 principal components. However, finding posts within a specific time range is much trickier.
Some subreddits are known for vigorous discussion, while others simply represent a constantly updated collection of entertaining content. Cooler colors signify more similar subreddits, hotter colors subreddits that are more different. I eventually figured out that the reddit search engine accepts timestamp queries with the date provided in the unix time format.
First, there are a few bastions of blue off the diagonal. If you want that, I recommend visiting their docs. Creating a reddit data set.
This isn’t intended as a tutorial for PRAW. Unfortunately, we’ve yet to find a great way rsddit visualize a dimensional space, so I used principal components analysis PCAone of gentlemaanboners most basic forms of dimensionality reduction, to allow us to better visualize the data. So we can cluster the subreddits cleanly, but what defines these clusters?