CLLD – Cross-Linguistic Linked Data

Retro-digitizing the "Tableaux phonétiques des patois suisses romands"

posted Thursday, September 24, 2020 by Robert Forkel

Exploiting CLDF and the enhanced CLDF support in clld 7 made collaboration on retro-digitizing Gauchat’s “Tableaux phonétiques des patois suisses romands” from 1925 easy. Read more in the preprint A digital, retro-standardized edition of the Tableaux phonétiques des patois Suisses romands (TPPSR) and below.

In the beginning of the CLLD project, publication of a database meant pushing a web application live. The application would then be the sole result of a long process and the only interface for potential users of the data.

In particular thanks to CLDF, we can now break down the database-creation process into well-defined steps, with well-specified artefacts along the way, which are re-usable on their own.

In the case of TPPSR, Hans Geisler did the heavy-lifting of

  • scanning the book (accessible and archived at https://archive.org/details/gauchat-et-al-1925-tppsr)
  • converting the data to text files according to the Unicode standard.

At this point,

  • the concept list used to collect the lexical data by Gauchat can be extracted and added to Concepticon,
  • orthography profiles according to the specification in Moran and Cysouw 2018 can be generated.

The “raw” digital data serves as input for cldfbench - our CLDF workbench, described here. The orthography profiles are used to inform automatic segmentation of the data.

Since the process of converting raw data to CLDF typically takes multiple iterations, it is best practice to keep input, output and processing code in a version-controled repository: https://github.com/lexibank/tppsr

The released CLDF data DOI then serves as input for the clld app to be served at https://tppsr.clld.org/