Data science projects start with quality data. You’ve heard it before: Garbage in, garbage out.
Fortunately, there are tons of free datasets online, from governmental and economic data, to narrower topics like MLB stats and video game sales.
Whether you’re putting together a data science project to land a job, or you just want to brush up on your SQL or data analyst skills, we’ve highlighted some of our favorite sources of free data, as well as interesting datasets you can use for your next project. Here’s where to look for free datasets:
Start with These Public Data Sources
You can collect data from just about everywhere, from Wikipedia, to your own personal Facebook data. But if you don’t know what kind of data you are looking for, these dataset search engines and data repositories are the place to start:
1. Google Dataset Search
Google’s data search engine is super useful for finding datasets in a particular niche. This is a great starting point for both paid and free datasets from top sources around the web. Other great useful Google sources: Google Trends and Google’s Public Data Directory.
Find all of the U.S. government’s free and open datasets here. This is a rich source for public economic data – like housing, wages and inflation – as well as education, health, agriculture and census data. With more than 300,000 datasets available, you can’t miss this one.
Jay, co-founder of Interview Query, used this dataset to land his first data science job and get 10+ interested email offers from companies.
FiveThirtyEight might be best known for its data journalism. Fortunately, the site also makes most of the data it uses in its reporting open to the public. This is a great source for a wide range of data, with a particular focus on politics, sports and culture.
Kaggle is one of the most popular communities for data scientists, and the site’s user-published datasets are great for self-guided ML or analysis projects. You’ll find a wide range of data, from movie reviews, to customer sales data, and fortunately most have some of the preprocessing done. This is a great source for data for sentiment analysis projects, as well as data analysis and visualization projects.
5. UCI Machine Learning Repository
Check out the University of California Irvine’s repository, which features nearly 500 public datasets. This is a great source for clean, ready-to-model data, and there’s data in a wide range of niches, from a dataset of chickenpox cases, to bank marketing data.
6. GitHub’s Awesome-Public-Datasets
This regularly updated library of datasets is a great place to start. All the data is organized by category, with options like machine learning and software, and you’ll find quick links to sources.
7. Amazon Web Service Open Data Registry
Amazon’s registry provides public access to data from a range of organizations, from the 1000 Genomes Project, to NASA. You’ll also find helpful usage examples for many of the sets, as well as project links for various organizations and groups.
8. Pew Internet
The Pew Research Center’s data repository has a major focus on culture and media. In particular, you’ll find lots of data sets and surveys covering media consumption, social media use and demographic trends like this 2018 Twitter Survey.
data.world calls itself a “collaborative data community,” and the site has built a dedicated audience of data scientists, which have collaborated on projects like social bot detection and data journalism. You’ll find datasets in a range of categories, from crime, to Twitter.
10. COVID datasets
There’s a plethora of regularly updated COVID public data available online. Some of the best sources include: CDC COVID Data Tracker and Our World In Data. For more niche projects, try the Coronavirus Tweets Database, featuring more than 1 billion Tweets, as well as The Marshall Project’s COVID cases in prisons datasets.
Interesting Datasets for Data Science Projects
Looking for a specific type of data for your project? We’re surfacing some of the most interesting datasets to use in a wide range of data science projects, from data analysis and visualization, to machine learning and data cleaning.
Datasets for Machine Learning
Whether you want to work with predictions or classification, these datasets are super interesting and they’re great for machine learning projects. The data is relatively clean, and the data lends itself nicely to machine learning, e.g. plenty of variables that can help to make predictions for the target column.
- Stroke Prediction Dataset - Build a stroke prediction model with this handy dataset. The CVS contains patient information - like gender, age, pre-existing conditions, and smoking status - which can help you build a model.
- Divorce Predictors Dataset - This dataset from the UCI Machine Learning Repository contains survey data from married couples. Use the data to identify predictive indicators of divorce or to build a divorce prediction model.
- January Flight Delay Prediction Dataset - With data from more than 400,000 flights in January 2019 and January 2020, this data from the Bureau of Transportation is great to build a model for winter flight delays.
- Twitter User Gender Classification - Can you predict gender from a Twitter user's profile and tweets? Build models to answer that question with this dataset, which contains info on more than 20,000 Twitter users.
- Mushroom Classification Dataset - This classic dataset from UCI is a great source for a classification data science project. One great project idea: Build a model to identify classifiers for poisonous mushrooms.
- Credit Card Approval Prediction Dataset - This is a great dataset for a financial prediction model. Use the data to understand if an applicant is "good" or "bad."
- Water Quality Dataset - Use water quality metrics from nearly 4,000 bodies of water to predict whether the water is safe for consumption or not.
Datasets for Data Visualization
Build data visualizations with these helpful datasets. We looked at data that had potential for interesting visualizations, as well as datasets that weren’t too messy and or overly complex.
- Twitter Edge Nodes Dataset - With more than 11 million nodes and 85 million edges, this dataset is useful for building graphical relationship models of Twitter users.
- Hotel Booking Demand Data - A great dataset for visualizing hotel bookings. You'll be able to build visualizations that answer questions like: When's the best time of year to book? and How long is the optimal stay length to receive the best rate?
- Amazon Top 50 Bestselling Books 2009-2019 - Design visualizations that show top authors, best-selling titles, and review ratings for the best-selling books on Amazon.
- COVID Jobs Impact & Hiring Data - Visualize the impact COVID is having on hiring with this dataset from the Amazon Open Data Registry. It features regularly updated hiring data from 3+ million jobs organizations.
- Latest Polls from FiveThirtyEight - If you're interested in political visualizations, FiveThirtyEight is one of the best data sources. Its updated polling data is great for visualizing averages and polling movements.
- U.S. International Trade in Goods and Services 1960-Present - Build charts to visualize the United State's international trade, including top imports, top exports and annual trade balances.
Datasets for Exploratory Data Analysis
Say you want to take a big data set and investigate. As you start to dive into the data, you can begin to discover patterns, trends and anomalies. These datasets are perfect for exploratory data analysis projects, because they contain a lots of mostly clean data.
- Netflix Original Films & IMDB Scores - A super fun dataset to explore and great for beginners, this features all of the Netflix original movies up to June 1, 2020 and corresponding IMDb scores.
- Superstore Sales Dataset - Featuring 4 years of data from a superstore, this dataset is perfect for analyzing and identifying trends, as well as sales forecasting.
- Marketing Analytics Data - This dataset is comprised of mock marketing analytics data, used by master's in business analytics students. A great source for analysis and visualization.
- Animal Shelter Analytics Data - You'll find lots of interesting info here. This is a great dataset for surfacing actionable insights for animal shelters, including what factors led to successful outcomes for the animals.
- Why Americans Don't Vote - Non-Voter Data - Another FiveThirtyEight dataset, this one features survey data from non-voters in the U.S. A few project ideas: Identify key factors that result in non-voting, or build a voting likeliness model.
- Website Crawling Data - A sprawling dataset from Amazon, the Common Crawl corpus features crawling data from billions of websites. Check out the Example Projects page for ideas.
Datasets for Natural Language Processing
There are plenty of large datasets that are great for sentiment analysis and natural language processing projects. Data like movie reviews, Tweets, reddit comments and more are all great for these types of projects.
- Reddit Vaccine Myths Dataset - An interesting dataset for performing sentiment or text analysis, this features thousands of posts from the popular subreddit Vaccine Myths.
- Wikibooks Dataset - There's more than 270,000 book chapters in 12 languages in this dataset. It's perfect for performing a wide range of NLP tasks, like text parsing, text generation or semantic analysis.
- Spam Clickbait Headlines Catalogue - Featuring 3+ million headlines from the now-defunct tabloid The Examiner, this is a great place to start an NLP news analysis project.
- TripAdvisor Hotel Reviews - Explore thousands of hotel reviews from TripAdvisor and build semantic prediction or top clustering models.
- A Million News Headlines - Another helpful medium source, this features headlines from nearly 20 years. It's a great dataset for performing LSA or LDA tasks.
- Disneyland Reviews Data - With more than 40,000 reviews from three Disneyland locations, this is a great data source for performing sentiment analysis.
Datasets for Computer Vision
Whether you're looking into image processing or speech recognition, these datasets will help you practice your deep learning skills. These are great image and audio datasets.
- VoxCeleb Speech Corpus - The VoxCeleb large-scale dataset features audio-visual data, from 7,000 speakers. It's a great dataset for performing emotional recognition, speaker recognition or talking face synthesis.
- Face Mask Detection Database - There are about 900 images in this dataset of people wearing facemasks. Build models to detect if someone is wearing a mask, not wearing a mask, or wearing a mask improperly in it.
- Unsplash Open Library - This rich visual-text dataset is loaded with helpful information. Use the photos for object detection. A bonus: There's millions of keywords and metadata you can use for EDA projects, as well.
- CheXperts: Chest X-Rays from Stanford AIMI - This dataset from Stanford features 200,000+ chest radiographs. Build a model to detect pathologies and see how well your model performs against radiologists.
- Pokemon Images and Types - There are thousands of images of Pokemon characters in this dataset. Use the data to build a prediction model to determine the type of Pokemon based on the image.
- ImageNet Image Database - A classic image dataset from Stanford, you'll find more than 14 million images here. This is one of the best for performing object recognition tasks.
Thanks for Reading!
Build your data science coding skills with our Data Science Course, with modules in machine learning, modeling, Python and SQL.