Dataset of Historical Reddit Archive

Overview

The Reddit Archive Dataset was collected and compiled in the month of August 2023. This dataset includes both reddit submissions (Posts) and comments on the interval (2015 - 2022, with plans to stage 2005-2014). The database is hosted on Amazon Web Services and is accessible via AWS Athena and queryable using PrestoSQL. Please e-mail the ASC IT HelpDesk for more information about gaining access via Pennkey SSO.

Getting Started 🤲

Go into aws, access athena, start querying...

Limitations 🤔

Querying is not free, and there is a nominal cost associated with each query. The ASC IT Department sets thresholds for amount of data queried per day, which totals to $5 queried OR 1TB of data scanned. If these limitations are a burden, please do not hesitate to reach out and we can review the usecase for an appropriate solution.

Tips & Tricks 🪄

When querying, it is best to use the partitioned columns in the `WHERE` clause. This will greatly reduce the search space and speed up the query. Notably, the data on AWS S3 is multi-partitioned along `YEAR` and `MONTH` integer values.

For more information on multi-partitioning in AWS Athena, please see here: Athena Multi-Partioning of Data

Legality & Terms of Usage 👩‍⚖️

Please refer to Reddit's Content Policy and the Reddit Data API Terms for more information on the legality of using this data. On consultation with UPenn OIS, it was determined that the data is not sensitive given it's public social media content and therefore can be used for private and well-defined research purposes. The data cannot under any circumstances be used for personal or private gain.

Metadata

Below are schematas for both the `comments` and `submissions` tables. Please click the respective dropdowns to view the SQL table setup & schema

Athena Table Schemas for Comments:

Using sql keyword `Describe` we see the schema for table parquet_reddit_comments


              >>> `DESCRIBE parquet_reddit_comments` 

              created_utc         	timestamp
     	                    
              author_flair_css_class	string
    	                    
              score               	int 
     	                    
              subreddit           	string
    	                    
              link_id             	string 

              subreddit_id        	string  
             	                    
              controversiality    	int   
               	                    
              body                	string   
            	                    
              retrieved_on        	date    
             	                    
              distinguished       	string    
           	                    
              gilded              	int      
            	                    
              id                  	string    
           	                    
              parent_id           	string    
           	                    
              edited              	string    
           	                    
              author_flair_text   	string    
           	                    
              author              	string   
            	                    
              day                 	int    
              	                    
              year                	int    
              	                    
              month               	int    
              	                    
              

              # Partition Information	 	  

              # col_name            	data_type           	comment              

                
              year                	int    

              month               	int

Athena Table Schemas for Submissions:

Using sql keyword `Describe` we see the schema for table parquet_reddit_submissions


            >>> `DESCRIBE parquet_reddit_submissions` 

            author              	string              	                    

            created_utc         	timestamp           	                   
 
            edited              	string              	                   
 
            parent_id           	string              	                   
 
            id                  	string              	                   
 
            gilded              	int                 	                   
 
            distinguished       	string              	                   
 
            retrieved_on        	date                	                   
 
            title               	string              	                   
 
            controversiality    	int                 	                   
 
            subreddit_id        	string              	                   
 
            link_id             	string              	                   
 
            sticked             	string              	                   
 
            subreddit           	string              	                   
 
            ups                 	int                 	                   
 
            score               	int                 	                   
 
            day                 	int                 	                   
 
            year                	int                 	                   
 
            month               	int                 	                   
 
            

            # Partition Information	 	
 
            # col_name            	data_type           	comment 
            
                
            year                	int                 	              
      
            month               	int

Sanity Checking 😵‍💫

Given that this dataset is quite large and provided from a public academic torrent, there is no way to verify the integrity of the data by the ASC IT Department. The service is provided as-is and provides no guarantees on the accuracy of the data or warranty of fitness.

Public Reference from Reddit

For reference, Reddit provides a year-end-review for the calendar year 2020 which is publicly accessible here: Reddits-2020-year-in-review

Reddit states in the review they had 303.4 million posts & 2 billion comments

Our Database suggests we have approximately 326.7 million posts and 2.1 billion comments. It's unclear if Reddit's public year end review is undercounting however as a sanity check these results are reassuring.

ASC Reference - Sample Query for Unique Values in 2020:

SELECT COUNT(*) as total_rows
              FROM ascdata.parquet_reddit_submissions
              WHERE year = 2020;

SELECT COUNT(*) as total_rows
              FROM ascdata.parquet_reddit_comments
              WHERE year = 2020;

Sample Queries 🧑‍💻

TBD THIS SECTION NEEDS WORK! - 03/21/2024 Etienne P Jacquot

The provided query retrieves data from the "parquet_reddit_comment_test" table. It selects the "created_utc" and "body" columns where the text in the "body" column contains the word "database" (case-insensitive) and the corresponding "year" value is greater than or equal to 2015. The query returns a maximum of 10 results meeting these criteria.
SELECT "created_utc","body" FROM "parquet_reddit_comment_test" WHERE lower(body) LIKE '%database%' AND "year" >= 2015 limit 10;
The provided query retrieves data from the "parquet_reddit_submissions" table within the "ascdata" database. It selects all columns where the "author" column matches the exact string "Lotushope", the "year" column is equal to 2022, and the "id" column contains the substring "rxsuj4". The query returns a maximum of 10 results meeting these criteria.
SELECT * FROM "ascdata"."parquet_reddit_submissions" WHERE author like 'Lotushope' AND year=2022 and id like '%rxsuj4%' limit 10;
The provided query retrieves data from the "parquet_reddit_comments" table within the "ascdata" database. It selects all columns where the "year" column is equal to 2022 and the "link_id" column matches the exact string "t3_wp7j78". The query returns a maximum of 10 results meeting these criteria.
SELECT * FROM "ascdata"."parquet_reddit_comments" WHERE year=2022 AND link_id like 't3_wp7j78' limit 10

Substring counting across time with substr `database`
Below This SQL query analyzes the "parquet_reddit_comments" table in the "ascdata" database. Focused on the year 2020, it counts comments containing the term "database" (case-insensitive) and groups the count by month. The top 10 months by comment count are displayed.
SELECT "month",count(*) AS agg FROM "ascdata"."parquet_reddit_comments" WHERE lower(body) LIKE '%database%' and "year" = 2020 GROUP BY "month" limit 10

Subbreddit Data Comments
Below This SQL query operates on the "parquet_reddit_comments" table in the "ascdata" database. It targets the year 2020, counting comments for each "subreddit" and labeling the count as "agg." The results are grouped by "subreddit" and sorted in ascending order based on the "agg" count.
select "subreddit", count(*) as agg FROM "ascdata"."parquet_reddit_comments" where "year" = 2020 group by "subreddit" order by agg asc