CLLD – Cross-Linguistic Linked Data

Citing CLLD Databases and Reproducible Research

posted Monday, July 28, 2014 by Robert Forkel

ZENODO's integration with GitHub provides a service that brings citeability and usability of CLLD databses for reproducible research to a new level.

While creating and sharing research data has been around for a long time, there still does not seem to be much of a standard culture for how to cite it. Still this is all the more important, because citation is the currency of research, and getting cited is the way for a researcher to accumulate recognition.

The CLLD project approaches this problem very pragmatically, by emulating the world of traditional publications; e.g. CLLD databases are

  • regarded as publications,
  • stress similarity to known publication formats such as edited volumes or chapters in books,
  • recommend a particular citation format.

But typically for electronic databases there's the "moving target" problem: Ideally databases are updated, enlarged and maintained in general. So how can we make sure, a specific state of the database can be cited? Again there are examples of how to deal with this problem from the traditional publication model: Lists of errata or new editions. And again, CLLD databases often use one or both of these approaches.

Unfortunately this only raises a new problem: How to keep older editions accessible? From a technical point of view, this is largely a user interface problem. It is not too much of a problem to teach relational databases to store older state or some sort of history. But how to provide access to this data while funnelling all non-specific requests to the newest version is not trivial, as can be seen in discussions about the usability of revision control software.

Backups or data dumps are a very pragmatic solution to this problem. They represent the reasonable trade-off of making the newest version easily accessible, while leaving access to older versions to the technically skilled.

This is even more so for complex datasets like WALS, where there is no one data dump format, satisfying all requirements like

  • portability (db-native SQL dumps typically aren't);
  • comprehensiveness (is there something like a BLOB in RDF?);
  • self explanatory (is a WALS chapter description metadata or part of the data?);
  • easy to create and easy to consume.

Arguably, for complex datasets which are typically accessed through a web application, the application itself dictates how to interpret the data, thus forms part of the metadata of the dataset.

Since the source code of the web application for most CLLD databases is kept in git repositories on GitHub, it seems natural to store the data in the same repository. To make this easier, the clld toolkit provides functionality to dump a database as set of related valid csv files, described by metadata in JSON table schema format. Such a data dump can then be created for each edition of the database and added to the repository.

This approach has now become a lot more attractive through the integration of GitHub with ZENODO. ZENODO is a research data archive funded by EU, and operated by a team at CERN. Integration with GitHub means that whenever a repository on GitHub (which is registered with ZENODO) gets a new release, a snapshot of this repository is archived with ZENODO with ZENODO.

Archiving with ZENODO also means integration with the over-arching eco-system of scholarly publications:

  • A DOI is assigned to the archive, making citation easier and more uniform: 10.5281/zenodo.11040.
  • Metadata of the archive can be harvested by scholarly search engines.
  • ZENODO is listed in meta-archives and data portals like re3data.

To also tackle the ease-of-use problem of such archives from the data consumer's perspective, the clld toolkit again exploits the fact that such an archive contains both the dataset and the source code of the application serving it: Creating a fully functional local WALS Online clone (serving the data of a particular WALS edition) can be as easy as running a short shell script.

Of course, there is still enough complexity and detail for devils to hide in this approach. E.g. it may not be trivial to find a working Python 2.7 environment in which to run the above script 20 years from now. But with the steps we have taken now, long-term access to databases like ours can take advantage of technologies like virtualization in the cloud or containerization of software (like Docker); e.g. an archived EC2 AMI or docker repository may solve the problem of creating a working Python 2.7 environment.

Since clld applications provide a uniform API to access the data, this also means reproducible research based on CLLD databases may procede as follows (in the near future):

  1. Start up a particular version of a CLLD database on a virtual machine.
  2. Run your data collection routines accessing the database through the API.

And it does not take too much imagination, to see the first step in this recipe (i.e. on-demand hosting of archived databases) as part of the offerings of a research library.