CLLD – Cross-Linguistic Linked Data

A new kind of clld app

posted Thursday, March 25, 2021 by Robert Forkel

We recently published TuLar - the Tupían Language Resources site - and the Hindu Kush Areal Typology site. Both are examples of a new kind of clld app - one that aggregates different kinds of linguistic data with an areal focus, rather than collecting data of the same kind (e.g. typological questionnaires) globally. While this kind of customization was always possible within the clld framework, it can now be done more efficiently and in a more principled way. In the following we describe how.

Both sites/projects follow a new data publication paradigm: The web application now “merely” serves as browsable, human user interface to the data - allowing users to “window shop”. The actual, citable data underlying the apps is now one (or multiple) CLDF datasets, archived with and longterm-accessible via DOI from Zenodo. This separation of concerns relieves the web app from being the single-point-of-failure when it comes to data accessibility. At the same time, uniformity of data access - across many datasets - is actually improved thanks to the CLDF specification.

The Hindu Kush Areal Typology dataset bundles two CLDF datasets in one record on Zenodo - curated following the usage example in the cldfbench paper by List and Forkel.

TuLaR goes one step further and aggregates data from multiple Zenodo records. This can be done transparently by bundling the relevant datasets in a dedicated Zenodo community tular, and querying Zenodo’s OAI-PMH interface. To make this easier, we use our own zenodoclient package. The short code snippet below will

  • query all datasets in the TuLaR community
  • sort them by version number
  • print the latest version for each dataset:
from zenodoclient.oai import Records

for (org, repos), recs in itertools.groupby(
    sorted(
        oai.Records('tular'),
        key=lambda r: (r.repos.org, r.repos.repos, r.version),
        reverse=True),
    lambda r: (r.repos.org, r.repos.repos),
):
    if org == 'tupian-language-resources':
        print(r.repos.repos, r.tag)

Both apps profit from the CLDF specification because it minimizes the amount of code necessary to ingest the data into the app’s database.

In conclusion, not only does this new data publication paradigm relieve the web app from duties it was badly positioned to fulfill, but it also allows the app to take on a new role: It can function much like an overlay journal, adding a layer of quality control, or serving as a filter on top of the ever-growing universe of CLDF datasets.