Last changed 2004-07-06

Firebird INTL Architecture

Selecting Character Set and Collation

Preface

Disclaimer

My knowledge of the engine's internals is limited. I have (temporarily?) given up to understand it, let alone modify it.

Please note that I just started writing this document, so there are gaps and inconsistencies all around, I suppose

Feedback

If you want to discuss matters mentioned here, drop me a note at peter_jacobi AT gmx DOT net or, preferably bring it to the relevant mailing list: firebird-support for concrete support questions, firebird-architect for the ivory tower stuff and firebird-devel for code. I also created a firebird-i18n on Yahoogroups, but it is dormant and has next to no subscribers.

Introduction

Terminology

Character Set

For our introductary discussion it's sufficent to say, that a Character Set is the mapping between abstract characters and bytes stored in the computer. For example the character € EURO SIGN is stored as one byte valued 0x80 in the Windows CP1252 character set, as one byte valued 0xA4 in the ISO 8859-15 character, as two bytes valued 0x20 0xAC in the UTF-16BE character set and as three bytes value 0xE2 0x82 0xAC in the UTF-8 character set.

There is more fine grained vocabulary in use, with terms "Abstract Character Repertoire", "Coded Character Set", "Character Encoding Form", "Character Encoding Scheme", and "Transfer Encoding Syntax". You can learn more than you probably need or want from Unicode Technical Report #17 . Another good explanation is in te W3C Character Model WD .

Collation

A collation defines the ordering of character strings according to certain, typically culturally dependant rules. For any two strings it will give on and only of the results:

As if you haven't guessed it, there is something to be found on the Unicode site togive the whole picture:
Unicode Technical Standard #10

Narrow, MBCS, Wide - How do you like it?

All definition of technical quality here:

Narrow Character Sets
Character sets where all encoded charcters fit into one byte, e.g. ASCII, ISO 8859-1, DOS CP437, Win CP1252
Multi-Byte Character Sets
Character sets using a varying number of bytes per character, e.g. SJIS (Japanese), Big-5, GB18030. For all practical aspects also UTF-8, a secific (and widely used) character encoding scheme for Unicode.
Wide Character Sets
Character sets where all character take the same number of bytes, but more than one: Unicode UTF-32 or Unicode UTF-16 (when only used for BMP characters.

Defaults

There are two simple rule about Firebird's defaults for character set and collation:

  1. It doesn't look at your system's locale or ask you about it during installation. It's the same on any system.
  2. It doesn't help that much

So you have to manually set character set and collation for best results and therefore this small guide exists.

NONE doesn't mix well.

A most important warning: If you define any column in your database to use character set other than NONE, you better five any column a defined character set. Otherwise, whenever a non-ASCII character finds it way into the NONE column, most processing with this field will fail with transliteration error.

So if you have and all charset NONE database and have the need to add a column with defined character set, be prepared to act very carefully or do a complete migration.

Whenever you want to add "don't care about charset" column to a database which uses a defined character set everywhere, don't add it as CHARSET NONE, but use CHARSET OCTETS for binary data and CHARSET ASCII for ASCII data or just use the same character set you use everywhere else.

In Firebird 1.5.1 the transliteration rules for CHARSET NONE have been changed, so that the above mentioned error doesn't occur anymore, but you still have the underlying semantic problem, when mixing NONE and defined character sets. You are only given another possibility to shoot yourself into the foot.

The sad state of MBCS support

Theoretically, Firebird does support MBCS. UTF-8 (until recently only under the historic name UNICODE_FSS) is supported and all the importand MBCSs for Chinese, Japanese and Korean.

But peeking into the implementation, you'll note that for MBCS columns, always the maximal byte length is allocated. E.g. a CHAR (100) CHARACTER SET UNICODE_FSS takes 300 bytes. This in itself isn't that bad, and for on disk storage is nearly no problem, due to run length encoding. But the engine itself doesn't handle the column as a 100 character column which may take 100 to 300 bytes, but as a 300 byte column which may take 100 to 300 characters. Ouch. So you can happily store more than 100 characters in a CHAR (100) and engine's internal string operations may give rather surprising results.

I've put this also in the SF Bug Tracker:

Unfortunately it doesn't support voting for bugs...

Please note that some components try to compensate for this behaviour. Also exclusively using VARCHAR instead of CHAR does help.

The Choices

(1) NONE

Staying with the defaults, use character set NONE as database character set and as a connection character set.

Pro

Contra

Verdict

A good choice when you never expect non-ASCII letters in your database. Other non-ASCII characters (the occasional μ or non-ASCII punctation (“ ” † • —), are fine, as long as all connected systems agree on default character set.

(2) UTF-8 a.k.a. UNICODE_FSS

Pro

The usual pros of UTF-8 itself:

Contra

Verdict

With its current limitations, only useful in specific cases:
You want to store Unicode text but doesn't need any processing of it in the engine.

(3) UTF-16, UCS-2, and the like

It's internally used as intermediate step in charset conversions, but it's not surfaced in rdb$character_sets.

Pro

Contra

Verdict

Not ready yet.

(4) Narrow Character Sets (ISO 8859-* and others)

Usually done by creating the database with a default character set and also connecting with this character set. As you can't specify a database default collation, it has to be given for any column which should get culture correct sorting.

Pro

Contra

Verdict

This is the most useful solution when you are character set needs are covered by one 8-bit character set.

If you need some ASCII columns in your database, declare them ASCII, not NONE.

(5) MBCS (BIG-5, SJIS, KSC)

Pro

Contra

Verdict

This works good enough and you don't have other choices. To work around the MBCS mis-behaviour always use VARCHAR. If you need some ASCII columns in your database, declare them ASCII, not NONE.

(6) Mix of character sets, connect Unicode

Use in the database your favorite character sets, for efficency and availability of collations. Shield your application from character set complexity by connecting with UTF-8 (or UTF-16) character set.

Pro

Contra

Verdict

Abyssimal.

(7) Mix of character sets, connect NONE

This is the workaround version of (6). As Firebird's character set conversion doesn't cut, offload the the character set conversion to some middle layer or toolkit. To get this fully working you must connect using charset NONE and use Firebird 1.5.1, so that Firebird doesn't interfere with it owns conversions.

The Python interface KInterbaseDB works this way, and to a varying degree also the .NET provider and ODBC.

Pro

Contra

Verdict

Good, flexible solution. Needs FB1.5.1 to fully work.

Collation Coices

Collations and Character Sets

As collations should specify order and some misc operations (uppercasing to be exact) on characters, they shouldn't depend on character set. The question how to sort the strings Huette, Hütte and Hüte couldn't care less about the byte sequences used to represent these strings.

But the SQL standard and even more Firebird itself take quite another view. Without checking every possibility, I'm tempted to postulate that no two character set share a truly identical collation.

Codepoint Order collations

This is the default for every character set and simply means that the stored representations of the strings are compared byte by byte. So these colations clearly give different results for different charcter sets - and in the case of Latin script, the sort order of accented characters will be 'wrong'.

Multi-Level collations

FIXME

For compatibility with ancient programs:

Closing Remarks

FIXME

References

FIXME