Wolters-data

2016-08-25	Original version.
2017-12-20	Moved information about software to a new page.

In thinking about the role that data plays in statistical research in academia, I decided that most statistics papers can be described by one of the following three caricatures:

Theory. Focused on fundamental results using theorem and proof. Data is not necessary; adding data would only sully the abstract mathematical purity of the results. Simulated data may be tolerated.
Methodology. Concerned about advancing the state of the art in real-world data analysis tasks. To show that it’s really state-of-the-art, a lot of simulation comparisons with other methods are needed. To show that it’s useful, an analysis of a “real” data set is required.
Application. The result of collaboration. There’s a particular collection of data that somebody really wanted to use to get an answer to a question. Getting results in the subject-matter domain is more important than having novelty in the methods.

Data sets are important to someone who, like me, sees themself working on research of types 2 and 3. For work that is more methodologically oriented, finding good data can be hard. At the same time, demonstrating your method on an interesting and real(istic) data set can be a huge help in increasing the perceived relevance of what you’re doing.

So what makes a “good” data set for methodological work? Of course it needs to be well suited to the type of model or analysis you’re working with, but ideally it is also scientifically interesting, potentially important, and available in a ready-to analyze state.

This last feature is probably the most important. Raw data from most sources is messy, disorganized, and full of errors and special cases. It takes a lot of effort to “clean” such data and process it to the point where it’s suitable for analysis—an amount of effort that is usually prohibitive if you only need the data to demonstrate a new data analysis idea.

The current trend toward open sharing of data sets online is a great development. In many cases, the data sets took a great deal of time and effort to obtain. Hopefully as time goes on, more mechanisms will arise to reward researchers for sharing these important products of their research.

Below are listed some data repositories, data sets, and related links I’ve come across online. It’s a collection for my own benefit, but I thought by putting it here it may help to increase the visibility of these resources a tiny bit more. At initial writing the list isn’t long. I will update it as time goes on.

Where you can publish data and get credit

It seems that there are already a considerable number of “data journals” in various fields of study. For example this blog post and this one list many of them. There is a Wikipedia article on data publishing. Some general-purpose publication targets are:

Scientific Data. A Nature Group journal for publishing “data descriptor” articles that explain your data set. The actual data must be hosted a third-party open-access repository.
PLOS ONE. Accepts articles describing software or databases.

Lists of repositories

re3data. “re3data.org is a global registry of research data repositories that covers research data repositories from different academic disciplines.” You can browse repositories by subject area.
Open Access Directory repository list. Open Access Directory is a wiki for all things related to open access scholarship.
List of repositories recommended by Scientific Data. A long list of repositories for data in natural, biological, and environmental sciences.
Elsevier Data Search It’s a search engine for finding data from academic papers. It seems like a good concept, you can preview the figures and tables from the paper before deciding to click through to the original data source. Searching is done by the scientific content/description of the data, not by its statistical characteristics.

Individual repositories

Machine learning

UCI Machine Learning Repository. A good place to look for data related to classification, clustering, regression, variable selection, and so on. Some large enough to be called “big data.”
Kaggle. Organized around data science competitions. Each competition centers around and interesting data set.

Satellite imagery

LAADS Web. One of many NASA portals for obtaining data from Earth-orbiting satellites. There is a learning curve involved in using their system and understanding the data. But see next…
EOSDIS Worldview NASA website that allows you to browse satellite imagery in a more user-friendly way. You can view and download various calculated values derived from image data.

Specific data sets