CLLD – Cross-Linguistic Linked Data

The Open Source analogy for research data curation

posted Tuesday, February 03, 2015 by Robert Forkel

We are obviously  not the first ones to have come up with the idea of  using GitHub for collaborative data curation.

What GitHub and the pull request have done for open source software development is simply too good to lose out on when it comes to research data curation: They created a global community around a:

which lowers the level of expertise for potential contributors to what seems the optimal threshold to divide signal from noise.

Now, knowing how to use a version control system may not quite be the optimal threshold when trying to stipulate researchers to contribute to collaborative curation projects. But it should be safe to say  it’s on its way into the curriculum.

So what is our take on using GitHub for collaborative data curation? First of all, we see it as a way to formalize procedures which have been part of data curation at all times:

  • git provides a formal protocol for merging changes, thus it’s the cornerstone of collaboration.
  • A byproduct of this protocol is the ability to identify and re-create different states of the data, thereby providing support for versioned data.
  • The pull request is a protocol for reviewed data updates.
  • The fork can be used to formalize the concept of maintenance transfer.
  • And releases can be seen as the equivalent of publications.

Note: While some advantages of using git hinge on using line-based text formats for the data, preferably not cluttered with markup ( csv ,  BibTeX ,  JSON if pretty printed), most of the points hold for binary data as well. But again, missing out on  the level of support provided by a mature system such as git would seem foolish.

But you can go further:

  • For the Tsammalex data we used GitHub’s  Webhooks system, namely the integration with  Travis CI to implement  data integrity checks which are run upon pushing new data, providing a comfortable mechanism to make sure only data releases passing these tests are deployed to the  Tsammalex website.
  • Of course issue trackers which come with most hosted version control systems make a good list of errata.
  • GitHub adds more and more data visualization functionality (e.g.  rendering maps displaying GeoJSON data in a repository.), thereby turning the data repository itself into a basic data browser.
  • And again the webhook system allows integration with a service like  Zenodo, which provides  backup, archiving and a DOI for releases of a GitHub repository.

So considering all this and trying to eat our own dogfood, we are happy to make  Glottolog (our flagship when it comes to integrating data from multiple sources) available for this type collaboration:  clld/glottolog-data.