The Open Source analogy for research data curation
We are obviously
not the first ones
to have come up with the idea of
using GitHub for collaborative data curation.
What GitHub and the pull request have done for open source software development is simply
too good to lose out on when it comes to research data curation: They created a global community around a:
which lowers the level of expertise for potential contributors to what seems the optimal threshold
to divide signal from noise.
Now, knowing how to use a version control system may not quite be the optimal threshold when trying to
stipulate researchers to contribute to collaborative curation projects. But it should be safe to say
it’s on its way
into the curriculum.
So what is our take on using GitHub for collaborative data curation?
First of all, we see it as a way to formalize procedures which have been part of data curation
at all times:
- git provides a formal protocol for merging changes, thus it’s the cornerstone of collaboration.
- A byproduct of this protocol is the ability to identify and re-create different states of the data, thereby providing support for versioned data.
- The pull request is a protocol for reviewed data updates.
- The fork can be used to formalize the concept of maintenance transfer.
- And releases can be seen as the equivalent of publications.
Note: While some advantages of using git hinge on using line-based text formats for the data,
preferably not cluttered with markup
( csv ,
JSON if pretty printed), most of the
points hold for binary data as well. But again, missing out on
the level of support
provided by a mature system such as git would seem foolish.
But you can go further:
- For the Tsammalex data we used GitHub’s
namely the integration with Travis CI
data integrity checks
which are run upon pushing new data, providing a comfortable mechanism to make sure only data releases
passing these tests are deployed to the
- Of course issue trackers which come with most hosted version control systems make a good list of errata.
- GitHub adds more and more data visualization functionality (e.g.
rendering maps displaying GeoJSON data in a repository.), thereby turning the data repository itself
into a basic data browser.
- And again the webhook system allows integration with a service like
Zenodo, which provides
backup, archiving and a DOI for releases of a GitHub repository.
So considering all this and trying to eat our own dogfood, we are happy to make
(our flagship when it comes to integrating data from multiple sources) available
for this type collaboration: