CLLD – Cross-Linguistic Linked Data

Default Unicode Collation using pg_collkey

posted Friday, January 16, 2015 by Robert Forkel

A problem which recently popped up again, and which we eventually found a satisfying solution for is described as multilingual sorting in an article about the Oracle database system.

Now that article is about Oracle, but we use the open source database system PostgreSQL, and somewhat surprisingly (because typically PostgreSQL is a beacon when it comes to supporting standards), the state of multilingual sorting (or collation in database lingo) in PostgreSQL is sorry.

But the blog post linked above did hint at a possible solution: Use pg_collkey to interface with the ICU tools for proper unicode collation support.

Proper unicode collation support in our case means, being able to sort multilingual lexical data in such a way that all linguistically meaningful characters (clicks, IPA symbols, etc.) are respected. The algorithm which describes such a collation is called Unicode collation algorithm, and implemented in said ICU tools.

Unfortunately the status of ICU support in PostgreSQL hasn’t changed much since 2009. Thus, the pg_collkey workaround is still necessary, and turns out to be somewhat tricky.

So here’s a recipe to make pg_collkey work under Ubuntu 12.04 with PostgreSQL 9.1 and Ubuntu 14.04 with PostgreSQL 9.3:

  1. Install the distribution packages postgres-server-dev-9.x and libicu-dev.
  2. Download the pg_collkey package version 0.5.1 from here.
  3. Build the software …

Unfortunately, the Makefile provided in the pg_collkey package did not work on our target systems. To help the compiler find the PostgreSQL header files, we had to feed it the include directory explicitely - rather than relying on pg_config. With this tweak, all went well on Ubuntu 12.04. We had no such luck on Ubuntu 14.04, though, because the linker flags returned by

icu-config --ldflags

didn’t work. Re-using the linker flags returned by the same command on Ubuntu 12.04 for 14.04 did the trick, though, so here’s a Makefile template adapted to our needs (replace __pg_version__ with 9.1 or 9.3):

ICU_CFLAGS = `icu-config --cppflags-searchpath`
PG_INCLUDE_DIR = /usr/include/postgresql/__pg_version__/server
PG_PKG_LIB_DIR = `pg_config --pkglibdir`

collkey_icu.so: collkey_icu.o
        ld -shared -o collkey_icu.so collkey_icu.o -ldl -lm   -L/usr/lib/x86_64-linux-gnu -licui18n -licuuc -licudata  -ldl -lm

collkey_icu.o: collkey_icu.c
        gcc -Wall -fPIC $(ICU_CFLAGS) -I $(PG_INCLUDE_DIR) -o collkey_icu.o -c collkey_icu.c

clean:
        rm -f *.o *.so

install:
        install collkey_icu.so $(PG_PKG_LIB_DIR)

Using this makefile, running

make

and

make install

did work on both target systems.

Now, creating the collkey functions using the sql file provided in the package should work:

psql -d <dbname> -f collkey_icu.sql

In the clld package, we wrapped support for pg_collkey in two functions:

If you want to see the unicode collation algorithm in action, look at the sorting of names of species in Tsammalex (hint: clicks are sorted after latin letters)!