Data Sources!



For anyone who is interested in getting started in the analytics field, the first thing you need is data, and lots of it. Fortunately there are tons of open source data sets available for public use! These data sources can be great to derive insights from on your own or to supplement an existing data set. Often when interviewing for an analytics position, if you have work you have done on public data in the past that you can point to, it's a huge plus.

This site is a massive trove of government data, much of which has been released in response to the freedom of information act. There are data sets available for nearly every category imaginable. Some data sets will fit in Excel, others are too large and must be downloaded as a comma separated or tab separated file (CSV/TSV).


This is primarily a supplemental database that contains information on every zip code in the United States and its territories. It also includes approximate longitude and latitude coordinates for geographic visualizations. If you are looking for more precise geocoding on individual addresses Google has an API (application programming interface) that allows you to geocode 2,500 addresses daily for free. More large scale geocoding will tend to cost you some $$$.


This is a site dedicated to machine learning data sets. The data is specifically catered to creating "classification" models and most data sets have a primary field that you can attempt to classify correctly. 



To illustrate how easy it is to access these datasets, I built this visualization of Volcanic Eruptions in Tableau Smithsonian data from up to 10,000 years ago! The darker red the dot the bigger the eruption:

Comments

Popular Posts