CLLD – Cross-Linguistic Linked Data

clld 10.0.0 - 10 years of clld

posted Wednesday, January 18, 2023 by Robert Forkel

After 10 years of development it seems to be the right time for release 10.0.0 of the clld framework. And maybe it’s also time for some retrospection.

On the plus side, there’s quite a few clld apps out there, running mostly flawlessly and serving well known and well frequented datasets (such as Glottolog or WALS Online). But clld didn’t really gather a developer community, and only a small community of users of the software.

Overall, the most important role of clld might have been as a learning tool - for me in particular and the wider community at large.

So what is the lesson to be learned from 10 years of clld?

Despite appearing as a long-running, successful project, CLLD (the project) and clld the web app framework have taught me that

considering a web application the authoritative source for a dataset is very fragile (and actually not a good idea).

Multiple, important aspects contribute to this fragility.

Software maintenance:

Se vogliamo che tutto rimanga come è, bisogna che tutto cambi. (Il Gattopardo)

A running web application means deployed software, i.e. code that runs on a publicly accessible server. To just maintain the “status quo” (e.g. being able to read WALS’ chapter 1 by typing http://wals.info/chapter/1 into a browsers location bar) a lot of things needed changing:

  • clld migrated from
    • Python 2.7 to Python >=3.7
    • PostgreSQL 8 to PostgreSQL 14
    • Ubuntu 12.04 to Ubuntu 22.04
    • etc.
  • HTTPS everywhere!

Since we can’t keep up with everything (- in particular not with the crazy world of frontend web tech), cautiously surveying dependencies is a must.

And of course, there are already challenges ahead, e.g. migrating to SQLAlchemy 2.

Longterm perspective:

Cool URIs don’t change (W3C)

With WOLD having to change its domain I learned early in the project that web databases are often not cool (but the Internet Archive’s Wayback Machine is!).

So while paying a couple of euros to keep a domain may seem negligible when compared to the cost of porting thousands of lines of code to Python 3, it’s surprisingly(?) difficult in our academic institutional settings to make sure such payments are made (ideally without having to transfer domains all the time).

Data curation:

Cross-Linguistic Data … is code (Forkel 2015)

While a dataset like WALS Online had a long phase of collaborative data collection before its publication in 2008, it now has an even longer history of being a dataset in need of maintenance, i.e. versioning, releases, incorporating feedback.

Having to reconcile these requirements with WALS Online being the app at https://wals.info shaped my thinking about collaborative data curation and eventually embracing a platform like GitHub for data curation.

WALS Online - being as important and influential as it is - forced me to make the in-app curation model work for some time, but this only strenghtened my conclusion that CSV and GitHub wins.

Data publication:

data available upon reasonable request (various authors all the time)

So, assuming https://wals.info/feature/1A.tab is a “reasonable” request, clld web applications seem to have the data publication requirement covered.

clld even seems to have been ahead of its time when it comes to data citation:

The CLLD project approaches this problem very pragmatically, by emulating the world of traditional publications; e.g. CLLD databases are regarded as publications, stress similarity to known publication formats such as edited volumes or chapters in books, recommend a particular citation format.

The Tromsø recommendations for citation of research data in linguistics from 2019 went a step further, addressing also the issue of multiple versions of datasets.

But this creates a new problem: How to keep older editions accessible?

In 2014, I wrote:

Backups or data dumps are a very pragmatic solution to this problem. They represent the reasonable trade-off of making the newest version easily accessible, while leaving access to older versions to the technically skilled.

But I also pointed out that

for complex datasets which are typically accessed through a web application, the application itself dictates how to interpret the data, thus forms part of the metadata of the dataset. […] This is even more so for complex datasets like WALS, where there is no one data dump format, satisfying all requirements.

In 2018, we finally published our attempt at such a “data dump format, satisfying all requirements”: CLDF - Cross-Linguistic Data Formats, advancing data sharing and reuse in comparative linguistics.

In 2019 we went one step further, making CLDF not a dump format created by an app, but the authoritative format used to feed the PHOIBLE clld app from CLDF data.

Thus, with CLDF datasets published on Zenodo, a new kind of clld app became possible.

So, having learned a lot about the problems of web databases, I’m looking forward to a couple more years of making clld less relevant by focusing on CLDF :)