WallStreetBets (WSBs) had been slowly growing in popularity since it’s inception in late January, 2012. Then the GME saga of 2021 hit — which saw WSBs’ user base more than triple to ~9.5 million “Degenerates” in little over a month. As the popularity of WSBs has increased, some have turned to the subreddit for investing advice. Although basing decisions solely off of WSBs posts is a terrible idea, considering it as part of wider analysis can be extremely valuable.
Whilst you could manually analyse posts, automating the process can save a lot of time. There are two main ways of getting data from Reddit: using the official API and scraping. There are pros and cons to both methods, the most notable being limits on the number of API calls you can make, and the comparatively complicated approached needed to properly utilise scraping. Although this article focuses around using the API, if you choose to use scraping instead make sure that you scrape “Old Reddit” instead of the newer homepage.
In short, the code outlined in this article downloads the most recent posts from the WSBs Reddit page, and loads all of the comments associated with those posts. Next, tickers are extracted from both comments and posts, and the sentiment of each is determined. The most dominant tickers are noted, and the sum sentiment associated with each post is stored. This information is then outputted to a series of CSVs.
Before we dive in to aspects of the code, note that this isn’t a comprehensive coding tutorial; the full code can be found on my GitHub. Rather, I’m going to outline the main aspects of the code in the hope it will assist you in modifying the code to fit your use case — it could even be applied to an entirely different subreddit.
Working with the Reddit API
On the whole, the Reddit API is very easy to use. To get all of the data we need we can use simple GET requests.
We need to make two separate calls to the API. The first to get the most recent posts on the WSBs subreddit, and the second to get the comments (or ‘replies’) made to that post. All of the API endpoints are outlined in the API documentation.
In the above example there are three functions: request_data(), get_recent_posts(), and get_post_comments().
request_data() is used to download the content of the URL endpoint using the Python Requests library. Pay particular attention to the user-agent portion of the headers variable. Reddit ask that users don’t use generic user-agents (such as ‘Mozilla/5.0’) when accessing the API, and threaten to block individuals who do. Instead, you should use a custom user-agent taking the form
<platform>:<app ID>:<version string> (by /u/<reddit username>)
get_recent_posts() formulates a URL to grab the most recent posts from the subreddit. The endpoint will return between 1 and 100 results depending on the user’s requirements (specified using the count variable). The URL is then fed to request_data() in the ultimate line, and the result returned as JSON.
get_post_comments() is similar to get_recent_posts() in that it constructs a URL which when used in conjunction with request_data() allows us to retrieve data from the Reddit API as JSON. The URL constructed in get_post_comments() returns (almost) all of the comments associated with that post.
You’ll likely have noticed the presence of “almost” in the previous paragraph. Whilst the comment endpoint returns many comments, when Reddit determines there are too many comments to return at once it holds back a portion. These must then be accessed using a different POST endpoint. As we’re only interested in gaining a broad understanding of sentiment and Reddit prioritises popular comments these additional comments are ignored.
Extracting Information from Reddit API Results (JSON)
The JSON returned from both calls to the posts and comments endpoints are nested. This means that we need to loop over the data. However, extracting all of the important information from the comments JSON is much more challenging than utilising data on posts.
To work with the posts data we only need a single of functions, and an additional loop.
In this code gist you can see that we make a call to get_recent_posts(), and then process subsections of that JSON using the parse_post() function. The parse_post() function extracts only the elements we are interested in, and returns the values (along with a key) as a Python dictionary.
As previously stated, handling the comments is more complicated. Rather than using one function to parse the data, we rely on a set of functions forming a class.
Hopefully it’s apparent from the code what’s going on. The main steps are as follows: get_comment_info() is called, and provided with the JSON returned from the call to the function get_post_comments() in our RedditAPI class. This data is then worked through. The next step is performed by convert_comment_dict() which extract the items from the JSON that we’re interested in, and returns them as a dictionary — similar to out parse_post() function. manage_reply() is then called (which itself calls extract_relevant()) to grab the relevant portions from the potentially deeply nested JSON. All of this means that no matter how long the chain of comments, we acknowledge every reply.
Finding Tickers in WallStreetBets JSON Data
Tickers are important to our application as they tell us which stocks are being discussed. When considered alongside our assessment of sentiment (outlined shortly) they can advise us which stocks to read up on.
Before we can do anything we need a list of tickers to work with. Fortunately Nasdaq provide CSVs containing tickers for all of the major American exchanges. Download CSVs for the exchanges you’re interested in from Nasdaq.
Once the CSVs are downloaded to a local directory we need to load a single column from each of them into our code — the rest can be discarded. The following Ticker class handles all aspects of ticker identification, from CSV loading to entry matching.
load_tickers() is used to load in our CSV files, and convert them to a Python list.
To match and extract any tags in a given string we call check_for_tickers(), providing both a list of tickers (sourced from the Nasdaq CSVs), and the string we want to check for tickers.
Whilst most of the code is self-explanatory, the banned_tags list deserves a comment. Owing to their short length and alphabetical nature, tickers can sometimes take the form of words or acronyms. Take YOLO, which is both an acronym and a cannabis ETF. GOOD is even a ticker representing Gladstone Commercial Corporation. I’m sure there are some readers thinking we could rely on symbols preceding the ticker (e.g. ‘$’) or simply ignore any suspected tickers if they’re not uppercase — perhaps even a combination of the two. My only retort to that argument is “have you ever been on WSBs?”.
Getting the Sentiment of Posts and Comments on WallStreetBets using NLTK
Natural Language Toolkit (NLTK) is “ a suite of libraries and programs for symbolic and statistical natural language processing”. By using some of the built in functions we can rapidly analyse the sentiment of messages.
Before we look at the code, it’s worth noting that the model included in the code is not optimised for use on WSBs, or financial information generally. Rather, the model was designed to assess the sentiment of Tweets. If you’re looking to improve on the vanilla code posted on Github then altering the sentiment analysis portion would be a valuable place to start.
When the user inputs a string to the NLTK model a “positive” or “negative” result is returned. Whilst this may seem rudimentary, we can use it to determine if a discussion is positive or negative, guiding our further reading. Although modified, this code is based off of Daityari’s example on the Digital Ocean Community pages. For more information on coding specifics I highly recommend you check their article out.
Multiple CSVs are generated as outputs. One CSV lists information on posts, whereas the others outline the comments associated with each post.
Last Few Thoughts…
Hopefully the ease at which Reddit posts can be downloaded, and their sentiment extracted is now apparent. The full code can be found on my GitHub account. Just remember that the code is not production ready and I can’t guarantee even a percentage of accuracy!
I’d love to hear your thoughts, and any suggestions to improve this basic code. There’s clearly room to hook a variation of this code up to a database, and perform more intensive NLP using a more appropriate training dataset under Tensorflow/PyTorch.