Mark A. Wolters

Some links to data sets and data repositories

2016-08-25 Original version.
2017-12-20 Moved information about software to a new page.

  1. Introduction
  2. Where you can publish data and get credit
  3. Lists of repositories
  4. Individual repositories
    1. Machine learning
    2. Satellite imagery
  5. Specific data sets


In thinking about the role that data plays in statistical research in academia, I decided that most statistics papers can be described by one of the following three caricatures:

  1. Theory. Focused on fundamental results using theorem and proof. Data is not necessary; adding data would only sully the abstract mathematical purity of the results. Simulated data may be tolerated.

  1. Methodology. Concerned about advancing the state of the art in real-world data analysis tasks. To show that it's really state-of-the-art, a lot of simulation comparisons with other methods are needed. To show that it's useful, an analysis of a "real" data set is required.

  1. Application. The result of collaboration. There's a particular collection of data that somebody really wanted to use to get an answer to a question. Getting results in the subject-matter domain is more important than having novelty in the methods.

Data sets are important to someone who, like me, sees themself working on research of types 2 and 3. For work that is more methodologically oriented, finding good data can be hard. At the same time, demonstrating your method on an interesting and real(istic) data set can be a huge help in increasing the perceived relevance of what you're doing.

So what makes a "good" data set for methodological work? Of course it needs to be well suited to the type of model or analysis you're working with, but ideally it is also scientifically interesting, potentially important, and available in a ready-to analyze state.

This last feature is probably the most important. Raw data from most sources is messy, disorganized, and full of errors and special cases. It takes a lot of effort to "clean" such data and process it to the point where it's suitable for analysis–-an amount of effort that is usually prohibitive if you only need the data to demonstrate a new data analysis idea.

The current trend toward open sharing of data sets online is a great development. In many cases, the data sets took a great deal of time and effort to obtain. Hopefully as time goes on, more mechanisms will arise to reward researchers for sharing these important products of their research.

Below are listed some data repositories, data sets, and related links I've come across online. It's a collection for my own benefit, but I thought by putting it here it may help to increase the visibility of these resources a tiny bit more. At initial writing the list isn't long. I will update it as time goes on.

Where you can publish data and get credit

It seems that there are already a considerable number of "data journals" in various fields of study. For example this blog post and this one list many of them. There is a Wikipedia article on data publishing. Some general-purpose publication targets are:

Lists of repositories

Individual repositories

Machine learning

Satellite imagery

Specific data sets