CLLD - Cross-Linguistic Linked Data

The Austronesian Comparative Dictionary

posted Friday, March 17, 2023 by Robert Forkel

Robert Blust’s The Austronesian Comparative Dictionary - web edition (ACD) has been online for a long time - and actually still is (again) at https://www.trussel2.com/ACD/ . However, after the unexpected passing of Steve Trussel, the site was not updated anymore and can now be regarded as a legacy resource.

But in 2021, Robert Blust reached out to the Department of Linguistic and Cultural Evolution at the MPI EVA in Leipzig, to find a new home for the ACD, and in particular one that would allow further updating. Thus, we set out to

port the old website to a clld application
fed from data in a CLDF dataset, created by scraping the old website.

While retrieving the data via scraping HTML pages was not exactly trivial, it still proved to be a viable way forward. Within 2021 we were able to incorporate the first (small) set of updates from Robert Blust. Then Robert Blust passed away in 2022. But to ensure a future for his legacy he had already designated Alexander D. Smith as his successor as editor of the ACD.

So now, the lifetime project of Robert Blust lives on as

CLDF dataset curated at https://github.com/lexibank/acd and as
clld “web edition” at https://acd.clld.org

ACD as CLDF dataset

The basis of the ACD are wordlists for more than 1,000 Austronesian languages. Thus, The ACD CLDF dataset is a Wordlist. But arguably the “meat” of the ACD are the cognate sets, i.e. the grouping of words into sets, posited to have descended from a common ancestral protoform. These relations between words can be mapped to CLDF cognates and cognate sets.

This simple data model is made more complex by the fact that the ACD contains multiple levels of reconstruction, with simple cognate sets being only the first level, to be grouped into deeper reconstructions.

The levels of reconstruction are based on an assumed family tree:

           ┌─Formosan
──PAN──────┤
           │          ┌─PWMP──── ──PPh
           └─PMP──────┤
                      │          ┌─PCMP
                      └─PCEMP────┤
                                 │          ┌─PSHWNG
                                 └─PEMP─────┤
                                            └─POC

So, for example, a cognate set with forms from both, Oceanic (OC) and South Halmahera-West New Guinea (SHWNG) languages, can be used to reconstruct a protoform in Proto-Eastern Malayo-Polynesian (PEMP).

Due to the size and complexity of the data, analysis is best done using CLDF SQL, i.e. converting the CLDF dataset to a SQLite database running

cldf createdb cldf/cldf-metadata.json acd.sqlite

and then querying this database.

The schema of the database looks as follows: db schema

Querying the forms making up an ACD cognate set (e.g. *gaway₁ tentacles of octopus, squid, jellyfish, etc.) can be done using SQL as below:

SELECT
    c.cldf_id AS CID,
    pf.ID AS PID,
    proto.cldf_form AS Protoform,
    l.`group` AS `Group`, 
    l.cldf_id AS LID, 
    l.cldf_name AS Language, 
    f.cldf_form AS Form, 
    p.cldf_name AS Gloss
FROM
    CognateTable AS c,
    FormTable AS f,
    LanguageTable AS l,
    ParameterTable AS p,
    `protoforms.csv` AS pf,
    FormTable AS proto
WHERE
    c.cldf_cognatesetReference = '29846' AND
    c.cldf_formReference = f.cldf_id AND
    f.cldf_languageReference = l.cldf_id AND
    f.cldf_parameterReference = p.cldf_id AND
    c.reconstruction_id = pf.id AND
    pf.form_id = proto.cldf_id
order by Protoform, `Group`;

Notes:

The lower-level cognate sets are modeled as rows in the custom table protoforms.csv.
We select from FormTable twice! Once to select the forms grouped into the cognate set, and once to select the reconstructed proto-form.

Running this query will result in a table listing13 forms, grouped into two intermediate reconstructions:

CID	PID	Protoform	Group	LID	Language	Form	Gloss
244-ec9fd9ff6236a4a6db5937287cc2c357-1-1	6472	gaway₁	WMP	244	Bikol	gáway	poisonous tentacles of the jellyfish
350-8e24200c0d718c850ee43a594748c017-1-1	6472	gaway₁	WMP	350	Muna	gawe-gawe	fringe, appendage (on clothes, fish)
252-df393fba3977a564ccbfbfdd4f5b85c3-1-1	6472	gaway₁	WMP	252	Cebuano	gawáy	tentacle
330-a11f483868c262d56e963ef5502525f5-1-1	6473	kawe₁	OC	330	Maori	kawekawe	tentacles of a cuttlefish; tendrils of a creeper, fringe on a mat, etc.
620-d05a30438a7bd1531b424eaafdba0de7-1-1	6473	kawe₁	OC	620	Hawaiian	ʔaweʔawe	tentacles; runners, as on a vine
620-df393fba3977a564ccbfbfdd4f5b85c3-1-1	6473	kawe₁	OC	620	Hawaiian	ʔawe	tentacle
1086-aea6b8baa3bbcd5fab13609a56c5f227-1-1	6473	kawe₁	OC	1086	Chuukese	óó	tentacle of octopus or squid
833-903be7cdc3c3879b681a1972439c4859-1-1	6473	kawe₁	OC	833	Marshallese	ko	octopus tentacles; rays of the sun
292-514e95bbbbd0e46089ac5cfebb620d6e-1-1	6473	kawe₁	OC	292	Kapingamarangi	gawe bilibili	tentacle of octopus
18945-a1eb53ec754b0f40d1f0a301f0447447-1-1	6473	kawe₁	OC	18945	Sa’a	ka-kawe	tentacles of octopus; branching of the fingers of the human hand
434-cec8b062c1be873bc6798d97100b6d40-1-1	6473	kawe₁	OC	434	Tongan	kave	tentacle of cuttlefish or octopus
401-514e95bbbbd0e46089ac5cfebb620d6e-1-1	6473	kawe₁	OC	401	Samoan	ʔave	tentacle of octopus
345-90c79f7796b66e245dfa1a7af6a775ad-1-1	6473	kawe₁	OC	345	Motu	gave	feelers of octopus

Outlook

While converting the ACD data to CLDF might already lead to better re-usability for analysis, it was the essential first step towards making the data maintainable. So on this new basis we can grow and refine the ACD, making sure it keeps its place as essential resource in Austronesian Comparative Linguistics.

Previous: 18 Jan 2023 » clld 10.0.0 - 10 years of clld

CLLD – Cross-Linguistic Linked Data

The Austronesian Comparative Dictionary

ACD as CLDF dataset

Outlook