A problem which recently popped up again, and which we eventually found a satisfying solution for is described as multilingual sorting in an article about the Oracle database system.
Now that article is about Oracle, but we use the open source database system PostgreSQL, and somewhat surprisingly (because typically PostgreSQL is a beacon when it comes to supporting standards), the state of multilingual sorting (or collation in database lingo) in PostgreSQL is sorry.
But the blog post linked above did hint at a possible solution: Use
to interface with the ICU tools for proper unicode collation support.
Proper unicode collation support in our case means, being able to sort multilingual lexical data in such a way that all linguistically meaningful characters (clicks, IPA symbols, etc.) are respected. The algorithm which describes such a collation is called Unicode collation algorithm, and implemented in said ICU tools.
Unfortunately the status of ICU support in PostgreSQL hasn’t changed much since 2009.
pg_collkey workaround is still necessary, and turns out to be somewhat tricky.
So here’s a recipe to make
pg_collkey work under Ubuntu 12.04 with PostgreSQL 9.1 and
Ubuntu 14.04 with PostgreSQL 9.3:
pg_collkeypackage version 0.5.1 from here.
Unfortunately, the Makefile provided in the
pg_collkey package did not work on our target
systems. To help the compiler find the PostgreSQL header files, we had to feed it the
include directory explicitely - rather than relying on
pg_config. With this tweak, all went
well on Ubuntu 12.04. We had no such
luck on Ubuntu 14.04, though, because the linker flags returned by
didn’t work. Re-using the linker flags returned by the same command on Ubuntu 12.04 for
14.04 did the trick, though, so
here’s a Makefile template
adapted to our needs (replace
__pg_version__ with 9.1 or 9.3):
ICU_CFLAGS = `icu-config --cppflags-searchpath` PG_INCLUDE_DIR = /usr/include/postgresql/__pg_version__/server PG_PKG_LIB_DIR = `pg_config --pkglibdir` collkey_icu.so: collkey_icu.o ld -shared -o collkey_icu.so collkey_icu.o -ldl -lm -L/usr/lib/x86_64-linux-gnu -licui18n -licuuc -licudata -ldl -lm collkey_icu.o: collkey_icu.c gcc -Wall -fPIC $(ICU_CFLAGS) -I $(PG_INCLUDE_DIR) -o collkey_icu.o -c collkey_icu.c clean: rm -f *.o *.so install: install collkey_icu.so $(PG_PKG_LIB_DIR)
Using this makefile, running
did work on both target systems.
Now, creating the collkey functions using the sql file provided in the package should work:
psql -d <dbname> -f collkey_icu.sql
clld package, we wrapped support for
pg_collkey in two functions:
If you want to see the unicode collation algorithm in action, look at the sorting of names of species in Tsammalex (hint: clicks are sorted after latin letters)!