Robert Blust’s The Austronesian Comparative Dictionary - web edition (ACD) has been online for a long time - and actually still is (again) at https://www.trussel2.com/ACD/ . However, after the unexpected passing of Steve Trussel, the site was not updated anymore and can now be regarded as a legacy resource.
But in 2021, Robert Blust reached out to the Department of Linguistic and Cultural Evolution at the MPI EVA in Leipzig, to find a new home for the ACD, and in particular one that would allow further updating. Thus, we set out to
clld
applicationWhile retrieving the data via scraping HTML pages was not exactly trivial, it still proved to be a viable way forward. Within 2021 we were able to incorporate the first (small) set of updates from Robert Blust. Then Robert Blust passed away in 2022. But to ensure a future for his legacy he had already designated Alexander D. Smith as his successor as editor of the ACD.
So now, the lifetime project of Robert Blust lives on as
clld
“web edition” at https://acd.clld.orgThe basis of the ACD are wordlists for more than 1,000 Austronesian languages. Thus, The ACD CLDF dataset is a Wordlist. But arguably the “meat” of the ACD are the cognate sets, i.e. the grouping of words into sets, posited to have descended from a common ancestral protoform. These relations between words can be mapped to CLDF cognates and cognate sets.
This simple data model is made more complex by the fact that the ACD contains multiple levels of reconstruction, with simple cognate sets being only the first level, to be grouped into deeper reconstructions.
The levels of reconstruction are based on an assumed family tree:
┌─Formosan
──PAN──────┤
│ ┌─PWMP──── ──PPh
└─PMP──────┤
│ ┌─PCMP
└─PCEMP────┤
│ ┌─PSHWNG
└─PEMP─────┤
└─POC
So, for example, a cognate set with forms from both, Oceanic (OC) and South Halmahera-West New Guinea (SHWNG) languages, can be used to reconstruct a protoform in Proto-Eastern Malayo-Polynesian (PEMP).
Due to the size and complexity of the data, analysis is best done using CLDF SQL, i.e. converting the CLDF dataset to a SQLite database running
cldf createdb cldf/cldf-metadata.json acd.sqlite
and then querying this database.
The schema of the database looks as follows:
Querying the forms making up an ACD cognate set (e.g. *gaway₁ tentacles of octopus, squid, jellyfish, etc.) can be done using SQL as below:
SELECT
c.cldf_id AS CID,
pf.ID AS PID,
proto.cldf_form AS Protoform,
l.`group` AS `Group`,
l.cldf_id AS LID,
l.cldf_name AS Language,
f.cldf_form AS Form,
p.cldf_name AS Gloss
FROM
CognateTable AS c,
FormTable AS f,
LanguageTable AS l,
ParameterTable AS p,
`protoforms.csv` AS pf,
FormTable AS proto
WHERE
c.cldf_cognatesetReference = '29846' AND
c.cldf_formReference = f.cldf_id AND
f.cldf_languageReference = l.cldf_id AND
f.cldf_parameterReference = p.cldf_id AND
c.reconstruction_id = pf.id AND
pf.form_id = proto.cldf_id
order by Protoform, `Group`;
Notes:
protoforms.csv
.FormTable
twice! Once to select the forms grouped into the cognate set,
and once to select the reconstructed proto-form.Running this query will result in a table listing13 forms, grouped into two intermediate reconstructions:
CID | PID | Protoform | Group | LID | Language | Form | Gloss |
---|---|---|---|---|---|---|---|
244-ec9fd9ff6236a4a6db5937287cc2c357-1-1 | 6472 | gaway₁ | WMP | 244 | Bikol | gáway | poisonous tentacles of the jellyfish |
350-8e24200c0d718c850ee43a594748c017-1-1 | 6472 | gaway₁ | WMP | 350 | Muna | gawe-gawe | fringe, appendage (on clothes, fish) |
252-df393fba3977a564ccbfbfdd4f5b85c3-1-1 | 6472 | gaway₁ | WMP | 252 | Cebuano | gawáy | tentacle |
330-a11f483868c262d56e963ef5502525f5-1-1 | 6473 | kawe₁ | OC | 330 | Maori | kawekawe | tentacles of a cuttlefish; tendrils of a creeper, fringe on a mat, etc. |
620-d05a30438a7bd1531b424eaafdba0de7-1-1 | 6473 | kawe₁ | OC | 620 | Hawaiian | ʔaweʔawe | tentacles; runners, as on a vine |
620-df393fba3977a564ccbfbfdd4f5b85c3-1-1 | 6473 | kawe₁ | OC | 620 | Hawaiian | ʔawe | tentacle |
1086-aea6b8baa3bbcd5fab13609a56c5f227-1-1 | 6473 | kawe₁ | OC | 1086 | Chuukese | óó | tentacle of octopus or squid |
833-903be7cdc3c3879b681a1972439c4859-1-1 | 6473 | kawe₁ | OC | 833 | Marshallese | ko | octopus tentacles; rays of the sun |
292-514e95bbbbd0e46089ac5cfebb620d6e-1-1 | 6473 | kawe₁ | OC | 292 | Kapingamarangi | gawe bilibili | tentacle of octopus |
18945-a1eb53ec754b0f40d1f0a301f0447447-1-1 | 6473 | kawe₁ | OC | 18945 | Sa’a | ka-kawe | tentacles of octopus; branching of the fingers of the human hand |
434-cec8b062c1be873bc6798d97100b6d40-1-1 | 6473 | kawe₁ | OC | 434 | Tongan | kave | tentacle of cuttlefish or octopus |
401-514e95bbbbd0e46089ac5cfebb620d6e-1-1 | 6473 | kawe₁ | OC | 401 | Samoan | ʔave | tentacle of octopus |
345-90c79f7796b66e245dfa1a7af6a775ad-1-1 | 6473 | kawe₁ | OC | 345 | Motu | gave | feelers of octopus |
While converting the ACD data to CLDF might already lead to better re-usability for analysis, it was the essential first step towards making the data maintainable. So on this new basis we can grow and refine the ACD, making sure it keeps its place as essential resource in Austronesian Comparative Linguistics.