CLLD – Cross-Linguistic Linked Data

The Austronesian Comparative Dictionary

posted Friday, March 17, 2023 by Robert Forkel

Robert Blust’s The Austronesian Comparative Dictionary - web edition (ACD) has been online for a long time - and actually still is (again) at https://www.trussel2.com/ACD/ . However, after the unexpected passing of Steve Trussel, the site was not updated anymore and can now be regarded as a legacy resource.

But in 2021, Robert Blust reached out to the Department of Linguistic and Cultural Evolution at the MPI EVA in Leipzig, to find a new home for the ACD, and in particular one that would allow further updating. Thus, we set out to

  • port the old website to a clld application
  • fed from data in a CLDF dataset, created by scraping the old website.

While retrieving the data via scraping HTML pages was not exactly trivial, it still proved to be a viable way forward. Within 2021 we were able to incorporate the first (small) set of updates from Robert Blust. Then Robert Blust passed away in 2022. But to ensure a future for his legacy he had already designated Alexander D. Smith as his successor as editor of the ACD.

So now, the lifetime project of Robert Blust lives on as

  • CLDF dataset curated at https://github.com/lexibank/acd and as
  • clld “web edition” at https://acd.clld.org

ACD as CLDF dataset

The basis of the ACD are wordlists for more than 1,000 Austronesian languages. Thus, The ACD CLDF dataset is a Wordlist. But arguably the “meat” of the ACD are the cognate sets, i.e. the grouping of words into sets, posited to have descended from a common ancestral protoform. These relations between words can be mapped to CLDF cognates and cognate sets.

This simple data model is made more complex by the fact that the ACD contains multiple levels of reconstruction, with simple cognate sets being only the first level, to be grouped into deeper reconstructions.

The levels of reconstruction are based on an assumed family tree:

           ┌─Formosan
──PAN──────┤
           │          ┌─PWMP──── ──PPh
           └─PMP──────┤
                      │          ┌─PCMP
                      └─PCEMP────┤
                                 │          ┌─PSHWNG
                                 └─PEMP─────┤
                                            └─POC

So, for example, a cognate set with forms from both, Oceanic (OC) and South Halmahera-West New Guinea (SHWNG) languages, can be used to reconstruct a protoform in Proto-Eastern Malayo-Polynesian (PEMP).

Due to the size and complexity of the data, analysis is best done using CLDF SQL, i.e. converting the CLDF dataset to a SQLite database running

cldf createdb cldf/cldf-metadata.json acd.sqlite

and then querying this database.

The schema of the database looks as follows: db schema

Querying the forms making up an ACD cognate set (e.g. *gaway₁ tentacles of octopus, squid, jellyfish, etc.) can be done using SQL as below:

SELECT
    c.cldf_id AS CID,
    pf.ID AS PID,
    proto.cldf_form AS Protoform,
    l.`group` AS `Group`, 
    l.cldf_id AS LID, 
    l.cldf_name AS Language, 
    f.cldf_form AS Form, 
    p.cldf_name AS Gloss
FROM
    CognateTable AS c,
    FormTable AS f,
    LanguageTable AS l,
    ParameterTable AS p,
    `protoforms.csv` AS pf,
    FormTable AS proto
WHERE
    c.cldf_cognatesetReference = '29846' AND
    c.cldf_formReference = f.cldf_id AND
    f.cldf_languageReference = l.cldf_id AND
    f.cldf_parameterReference = p.cldf_id AND
    c.reconstruction_id = pf.id AND
    pf.form_id = proto.cldf_id
order by Protoform, `Group`;

Notes:

  • The lower-level cognate sets are modeled as rows in the custom table protoforms.csv.
  • We select from FormTable twice! Once to select the forms grouped into the cognate set, and once to select the reconstructed proto-form.

Running this query will result in a table listing13 forms, grouped into two intermediate reconstructions:

CID PID Protoform Group LID Language Form Gloss
244-ec9fd9ff6236a4a6db5937287cc2c357-1-1 6472 gaway₁ WMP 244 Bikol gáway poisonous tentacles of the jellyfish
350-8e24200c0d718c850ee43a594748c017-1-1 6472 gaway₁ WMP 350 Muna gawe-gawe fringe, appendage (on clothes, fish)
252-df393fba3977a564ccbfbfdd4f5b85c3-1-1 6472 gaway₁ WMP 252 Cebuano gawáy tentacle
330-a11f483868c262d56e963ef5502525f5-1-1 6473 kawe₁ OC 330 Maori kawekawe tentacles of a cuttlefish; tendrils of a creeper, fringe on a mat, etc.
620-d05a30438a7bd1531b424eaafdba0de7-1-1 6473 kawe₁ OC 620 Hawaiian ʔaweʔawe tentacles; runners, as on a vine
620-df393fba3977a564ccbfbfdd4f5b85c3-1-1 6473 kawe₁ OC 620 Hawaiian ʔawe tentacle
1086-aea6b8baa3bbcd5fab13609a56c5f227-1-1 6473 kawe₁ OC 1086 Chuukese óó tentacle of octopus or squid
833-903be7cdc3c3879b681a1972439c4859-1-1 6473 kawe₁ OC 833 Marshallese ko octopus tentacles; rays of the sun
292-514e95bbbbd0e46089ac5cfebb620d6e-1-1 6473 kawe₁ OC 292 Kapingamarangi gawe bilibili tentacle of octopus
18945-a1eb53ec754b0f40d1f0a301f0447447-1-1 6473 kawe₁ OC 18945 Sa’a ka-kawe tentacles of octopus; branching of the fingers of the human hand
434-cec8b062c1be873bc6798d97100b6d40-1-1 6473 kawe₁ OC 434 Tongan kave tentacle of cuttlefish or octopus
401-514e95bbbbd0e46089ac5cfebb620d6e-1-1 6473 kawe₁ OC 401 Samoan ʔave tentacle of octopus
345-90c79f7796b66e245dfa1a7af6a775ad-1-1 6473 kawe₁ OC 345 Motu gave feelers of octopus

Outlook

While converting the ACD data to CLDF might already lead to better re-usability for analysis, it was the essential first step towards making the data maintainable. So on this new basis we can grow and refine the ACD, making sure it keeps its place as essential resource in Austronesian Comparative Linguistics.