CLLD – Cross-Linguistic Linked Data

The First Year

posted Friday, January 03, 2014 by Robert Forkel

After the first of four years of funding it is time to review what has been achieved in 2013 and to outline the next milestones of the project.

Setting up project infrastructure

Since software is an important result of the project, we had to look for a way to host code repositories. But for a project that wants to reach out to the wider linguist and developer communities, we actually didn't have to look hard. If lowering barriers for contribution and collaboration is a goal, there's no way around GitHub.

So we set up an organization on GitHub. For each database which joins the project we create two repositories, one for the software serving the database, and one for the data.

Hosting these repositories on GitHub does also provide us with issue tracking, which again is split into one issue tracker for the software of a database (classical bug tracking), and one for issues with the data (comparable to errata).

All our web applications are prepared for easy, automatic deployment to a rather generic target platform. In particular, it must be possible to deploy an application serving one of our databases on a new virtual machine in the VMWare cluster of the GWDG running a single command.

The first datasets

Just in time for ALT 10 we managed to get the first two databases published with the new clld framework:

APiCS Online – the Atlas of Pidgin and Creole Language Structures Online.
While rather similar to WALS Online - one of the databases that served as prototypes for what should be possible with a cross-linguistic database framework - some peculiarities have been introduced with APiCS, most notably multiple values for one language/parameter pair, thus the notion of a value set, as collection of these values.
Glottolog
Glottolog was the first legacy database re-implemented as CLLD application. Glottolog is not really a typological database, but the abstractions of the clld framework turned out to be flexible enough to model the Glottolog data as well.

Triggered by Google ending the support of its maps API version 2 in November 2013, we had to migrate three already published databases to the clld framework:

eWAVE – the electronic World Atlas of Varieties of English
did not pose any particular problems, since its data model is a simplification of the ones of WALS and APiCS.
WALS Online – the World Atlas of Language Structures Online.
Being re-implemented for the second time, the WALS data and functionality was well known - but at the same time rather extensive. So we ended up implementing more "convenience" functionality than intended, in particular bringing back the functionality to compose multiple features.
WOLD – the World Loanword Database
played the same role WALS did for typological database in the area of lexicographical databases; i.e. it served as a prototype for the data model and functionality of a cross-linguistic lexicographical database framework. So porting it to the clld framework went as planned.

The last database published within the project in 2013 was AfBo: A world-wide survey of affix borrowing - a rather small database which still did pose some data modelling problems, because its main research objects are language pairs rather than single languages.

2014

One focus in 2014 will be on databases with lexicographical data. Migrating The Intercontinental Dictionary Series and the database of ASJP (the Automated Similarity Judgement Program) is already underway.

Another main topic are the database journals to be started in 2014. For both, the typological database journal JCLD (Journal of cross-linguistic databases) as well as the dictionary journal, we are in the process of collecting suitable initial or seed publications.

The third workpackage we will tackle in 2014 is a CLLD portal, i.e. a place where you can find data/resources across all CLLD datasets.