CLLD – Cross-Linguistic Linked Data

csv support in clld applications

posted Monday, July 28, 2014 by Robert Forkel

With the clld framework it has always been easy to provide custom representations of the resources in a database. As of version 0.15 this mechanism is used in the core framework to provide csv and csv metadata representations for any datatable in a clld app.

With as much as a conference and a w3c working group dedicated to it, csv (a.k.a. comma-separated values) seems to have survived the attacks by XML, RDF, JSON, etc. quite well. Of course most users of CLLD databases have known this all the time. And obviously tool support has always been close to perfect for csv.

So following best practices outlined in w3c's Model for Tabular Data and Metadata on the Web and CSV Lint's about page, there is now generic csv support built into the clld framework.

To understand what this looks like in practice, we can look at an example, the datatable listing languages in Glottolog. We can see that csv and csv.csvm are listed among the download formats when the table's download selector is clicked. But the same information is also available for automated processing as HTTP Link header:

$ curl -I http://glottolog.org/glottolog/language
HTTP/1.1 200 OK
Date: Mon, 28 Jul 2014 12:03:05 GMT
Content-Type: text/html; charset=utf-8
Content-Length: 13598
Connection: keep-alive
Vary: Accept
Link: ; rel="describedby"; type="application/json"
Link: ; rel="alternate"; type="text/csv"
...

So let's look at the description:

$ curl http://glottolog.org/glottolog/language.csv.csvm
{
    "fields": [
        {
            "constraints": {
                "type": "http://www.w3.org/2001/XMLSchema#int",
                "unique": false
            },
            "name": "child_dialect_count"
        },
...
        {
            "constraints": {
                "type": "http://www.w3.org/2001/XMLSchema#string",
                "unique": false
            },
            "name": "classificationcomment"
        },
...

The data structure you see above is the description of the fields in the corresponding csv file represented in JSON Table Schema format, as understood by CSV Lint. So feeding the two URLs for data and schema to CSV Lint we can get a validation result: CSV Validation

Using the tools of the excellent csvkit you can investigate the field names for a csv file

$ curl "http://glottolog.org/glottolog/language.csv" | csvcut -n
  1: child_dialect_count
  2: child_family_count
  3: child_language_count
  4: classificationcomment
  5: description
  6: family_pk
  7: father_pk
  8: globalclassificationcomment
  9: hid
 10: id
 11: jsondata
 12: latitude
 13: level
 14: longitude
 15: markup_description
 16: name
 17: pk
 18: status

Since the exported csv does always list the rows as filtered and sorted in the datatable, you can use URL parameters to customize the contents of the csv export. So listing the glottocodes and names of Indo-European languages forther north than 55° ordered by decreasing latitude can be done as

$ curl "http://glottolog.org/glottolog/language.csv?type=languages&sEcho=1&iSortingCols=1&iSortCol_0=6&sSortDir_0=desc&sSearch_2=Indo&sSearch_6=%3E+55" | csvcut -c 10,16
icel1247,Icelandic
jamt1238,Jamtska
faro1244,Faroese
kalo1256,Kalo Finnish Romani
balt1257,Baltic Romani
swed1254,Swedish
latv1249,Latvian
scot1245,Scottish Gaelic
scan1238,Scanian
hibe1235,Hiberno-Scottish Gaelic
scot1243,Scots
lith1251,Lithuanian

Note that this functionality will become available in existing CLLD apps only over time once they are upgraded to recent versions of the clld framework.