8.9. Character encoding issues

The 7-bit ASCII character set originally employed by PC computers in the days of yore turned out to be insufficient for languages other than English. Reference data may require characters not included in the ASCII character set. The string sorting order may also follow different rules. refdb supports national character sets as well as Unicode, which is sort of a superset of all national character sets. As a refdb user and administrator you'll have to deal with character encoding issues at different levels.

8.9.1. Character encodings of databases

While it is possible to convert the data during import and export (see the following sections), it is still worthwile to spend a few thoughts about the character encoding used by your reference databases. If possible, use an encoding that ensures a suitable string sorting order for your data. Choosing a proper encoding also avoids unnecessary character encoding conversions when importing or exporting data.

The available encodings are limited by your database engine:

SQLite

SQLite currently supports only ISO-8859-1 (the default) and UTF-8 as a compile-time option. If you install a binary package, it most likely uses ISO-8859-1.

MySQL

This database engine supports a fairly large number of encodings, but versions prior to 4.1 allow only one encoding per server instance. That is, all databases have to use the same character encoding. Please see the MySQL documentation for the growing list of supported encodings

PostgreSQL

This database engine supports a variety of encodings as a per-database option. That is, all reference databases may use different encodings. Please see the PostgreSQL documentation for a current list of supported encodings.

8.9.2. Character encodings of imported data

We'll have to distinguish two different sorts of data:

RIS

This plain-text format does not have a built-in way to declare the character encoding of the data. Instead you have to use the -E option of the addref and updateref commands to specify the encoding if it is different from the default (ISO-8859-1).

Please note that the import filters med2ris.pl, en2ris.pl, and to a limited extent also marc2ris.pl support on-the-fly character encoding conversion.

risx and xnote

These are XML formats that can use the XML way of declaring the encoding. This is done in the processing instructions, which is the first line in a XML file. Due to a limitation of the parser used for importing XML data, only four encodings are accepted by refdb: UTF-8, UTF-16, ISO-8859-1, US-ASCII. If your data use a different encoding, use the iconv command line utility (usually a part of the libiconv package) to convert your data to one of the accepted encodings.

8.9.3. Character encodings of exported data

By default, data are exported without a character conversion, i.e. the data will use whatever encoding the database uses. If you want the exported data in a different format, request the encoding with the -E option. This option is accepted by the getref and getnote commands of refdbc as well as by the refdbib client. You may request any encoding that your local libiconv installation supports. man 3 iconv or man iconv_open should give a clue which encodings are available.