CLLD – Cross-Linguistic Linked Data

The Second Year

posted Wednesday, January 21, 2015 by Robert Forkel

As it turns out, our predictions for main work packages in 2014 have been rather inaccurate, to say the least.

Instead, among the big themes of last year we find outreach and exploring collaborative data curation.


After exploring how CLLD might fit into the Digital Humanities in Passau, I was off to LREC in Reykjavík to talk about the LD in CLLD at the 3rd Workshop on Linked Data in Linguistics. I also registered for the workshop on Collaboration and Computing for Under-Resourced Languages in the Linked Open Data Era which seemed to have been a good fit for a talk on CLLD as well; So this might be a direction to explore in the two years to come.

The highlight of the year was a more focussed workshop, though, the workshop on Language Comparison with Linguistic Databases (LanCLiD). With a successor workshop in Leipzig next year and a discussion group on the web, we hope to establish a group of interested researchers coherent enough to work on topics like interoperability of resources.

Collaborative data curation

A question that comes up about WALS Online every once in a while is how to contribute to the database. We do not really have a good answer to that by now (although the upcoming Journal of Cross-Linguistic Databases may turn out to be the answer). Tsammalex a database we published as beta version this year shows - at least technically - an interesting alternative: Bringing the collaboration model of the open source software world to research data. We’ll explain what fork, release and pull request mean in this context in a separate post soon.

Of course we have not been the only or first ones to harness GitHub for data curation. The curators of the PHOIBLE database have been doing this for some time now. Which brings us to the next topic:

New databases

The first new database published in 2014 was SAILS, a typological database on languages of South America (see also the release notes). Later in the year we finally published PHOIBLE, the world’s largest collection of phoneme inventories (see also the release notes).

So while we did work on lexical datasets in 2014, we did not publish any. Fortunately this changed earlier this week, when the ASJP dataset was finally published within the CLLD framework; and then there is the “open beta” version of Tsammalex, a multilingual lexical database on plants and animals including an image repository. Over the next months we plan to expand the coverage substantially and hope to make this a cross-disciplinary useful resource.


The good thing about the failed predictions for 2014: They can be recycled and make good predictions for 2015!