My knowledge of the engine's internals is limited. I have (temporarily?) given up to understand it, let alone modify it.
Please note that I just started writing this document, so there are gaps and inconsistencies all around, I suppose
If you want to discuss matters mentioned here, drop me a note at peter_jacobi AT gmx DOT net or, preferably bring it to the relevant mailing list: firebird-support for concrete support questions, firebird-architect for the ivory tower stuff and firebird-devel for code. I also created a firebird-i18n on Yahoogroups, but it is dormant and has next to no subscribers.
For our introductary discussion it's sufficent to say, that a Character Set is the mapping between abstract characters and bytes stored in the computer. For example the character € EURO SIGN is stored as one byte valued 0x80 in the Windows CP1252 character set, as one byte valued 0xA4 in the ISO 8859-15 character, as two bytes valued 0x20 0xAC in the UTF-16BE character set and as three bytes value 0xE2 0x82 0xAC in the UTF-8 character set.
There is more fine grained vocabulary in use, with terms "Abstract Character Repertoire", "Coded Character Set", "Character Encoding Form", "Character Encoding Scheme", and "Transfer Encoding Syntax". You can learn more than you probably need or want from Unicode Technical Report #17 . Another good explanation is in te W3C Character Model WD .
A collation defines the ordering of character strings according to certain, typically culturally dependant rules. For any two strings it will give on and only of the results:
As if you haven't guessed it, there is something to
be found on the Unicode site togive the whole picture:
Unicode Technical Standard #10
All definition of technical quality here:
There are two simple rule about Firebird's defaults for character set and collation:
So you have to manually set character set and collation for best results and therefore this small guide exists.
A most important warning: If you define any column in your database to use character set other than NONE, you better five any column a defined character set. Otherwise, whenever a non-ASCII character finds it way into the NONE column, most processing with this field will fail with transliteration error.
So if you have and all charset NONE database and have the need to add a column with defined character set, be prepared to act very carefully or do a complete migration.
Whenever you want to add "don't care about charset" column to a database which uses a defined character set everywhere, don't add it as CHARSET NONE, but use CHARSET OCTETS for binary data and CHARSET ASCII for ASCII data or just use the same character set you use everywhere else.
In Firebird 1.5.1 the transliteration rules for CHARSET NONE have been changed, so that the above mentioned error doesn't occur anymore, but you still have the underlying semantic problem, when mixing NONE and defined character sets. You are only given another possibility to shoot yourself into the foot.
Theoretically, Firebird does support MBCS. UTF-8 (until recently only under the historic name UNICODE_FSS) is supported and all the importand MBCSs for Chinese, Japanese and Korean.
But peeking into the implementation, you'll note that for MBCS columns, always the maximal byte length is allocated. E.g. a CHAR (100) CHARACTER SET UNICODE_FSS takes 300 bytes. This in itself isn't that bad, and for on disk storage is nearly no problem, due to run length encoding. But the engine itself doesn't handle the column as a 100 character column which may take 100 to 300 bytes, but as a 300 byte column which may take 100 to 300 characters. Ouch. So you can happily store more than 100 characters in a CHAR (100) and engine's internal string operations may give rather surprising results.
I've put this also in the SF Bug Tracker:
Unfortunately it doesn't support voting for bugs...
Please note that some components try to compensate for this behaviour. Also exclusively using VARCHAR instead of CHAR does help.
Staying with the defaults, use character set NONE as database character set and as a connection character set.
A good choice when you never expect non-ASCII letters in your database. Other non-ASCII characters (the occasional μ or non-ASCII punctation (“ ” † • —), are fine, as long as all connected systems agree on default character set.
The usual pros of UTF-8 itself:
With its current limitations, only useful in specific cases:
You want to store Unicode text but doesn't need any processing of it in
the engine.
It's internally used as intermediate step in charset conversions, but it's not surfaced in rdb$character_sets.
Not ready yet.
Usually done by creating the database with a default character set and also connecting with this character set. As you can't specify a database default collation, it has to be given for any column which should get culture correct sorting.
This is the most useful solution when you are character set needs are covered by one 8-bit character set.
If you need some ASCII columns in your database, declare them ASCII, not NONE.
This works good enough and you don't have other choices. To work around the MBCS mis-behaviour always use VARCHAR. If you need some ASCII columns in your database, declare them ASCII, not NONE.
Use in the database your favorite character sets, for efficency and availability of collations. Shield your application from character set complexity by connecting with UTF-8 (or UTF-16) character set.
Abyssimal.
This is the workaround version of (6). As Firebird's character set conversion doesn't cut, offload the the character set conversion to some middle layer or toolkit. To get this fully working you must connect using charset NONE and use Firebird 1.5.1, so that Firebird doesn't interfere with it owns conversions.
The Python interface KInterbaseDB works this way, and to a varying degree also the .NET provider and ODBC.
Good, flexible solution. Needs FB1.5.1 to fully work.
As collations should specify order and some misc operations (uppercasing to be exact) on characters, they shouldn't depend on character set. The question how to sort the strings Huette, Hütte and Hüte couldn't care less about the byte sequences used to represent these strings.
But the SQL standard and even more Firebird itself take quite another view. Without checking every possibility, I'm tempted to postulate that no two character set share a truly identical collation.
This is the default for every character set and simply means that the stored representations of the strings are compared byte by byte. So these colations clearly give different results for different charcter sets - and in the case of Latin script, the sort order of accented characters will be 'wrong'.
FIXME
For compatibility with ancient programs:
FIXME
FIXME