Gem #144: A Bit of Bytes: Characters and Encoding Schemes

Let's get started...

This Gem starts with a problem. As a French native, I often manipulate text files that contain accented letters (those accents, by the way, were often introduced as a shorthand to replace letters in words, to save paper when it was still an expensive commodity). Unfortunately, depending on how the file was created, my programs do not necessarily see the same byte contents (which depends on the encoding and the character set of the file), and, if I just try to display them on the screen (either in a text console, or in a graphical window), the output might not read like what I initially entered.

Glyphs

At this point, let's introduce the notion of glyphs. These are the visual representations of characters. For instance, I want "e-acute" to look like an "e" with a small acute accent above it. This visual representation is the final goal in a lot of applications, since that's what the user wants to see. In other applications, however, the glyphs are irrelevant. For instance, a compiler does not care how characters are displayed on your screen. It needs to know how to split a sequence of characters into words, but that's about it. It assumes your console, where error messages are displayed, will display the same glyphs you had in your source file when given the same bytes as the source file itself.

A text file does not embed the description of what its representation looks like. Instead, it is composed of bytes, which are combined in certain ways (sometimes called character encoding schemes) to make up code points. These code points are then matched to a specific character using a character set. Finally, the font determines how the character should be represented as a glyph.

A character's exact representation (its glyph) really depends on the font you are using, since a "lower case a" might have widely different aspects that depend on the font. This is outside the scope of this Gem, though.

In general, your application is not concerned with the mapping of characters to glyphs via the font. This is all taken care of by either the text console, or the GUI toolkit you are using. Your application will often let the user choose her preferred font, and then make sure to pass valid characters. The toolkit does the complex work of representing the characters. For example, this work is the role of the Pango toolkit (accessible from GtkAda).

Character Sets

A repertoire is a set of generally related characters, for instance the alphabets used to spell English or Russian words.

A character set is a mapping from a repertoire to a set of integers called code points. A given character, as we shall see, might exist in several different character sets with different code points.

Most of the standard character sets (sometimes abbreviated as charsets) are specific to one language. For instance, there exist ISO-8859-1 (also known as Latin-1) and ISO-8859-15, which are used for West European languages; we also have ISO-8859-5 and KOI8-R, which are different, but both used for Russian; Windows introduced a number of code pages, which are in fact character sets specific to that platform; Japanese texts often use ISO-2022-JP, whereas Chinese has several standard sets.

Let's take the simplest of them all, the ASCII charset. Most developers are familiar with it. For instance, in this set the code point 65 is associated with the letter upper-case-A. This set includes 128 characters, 31 of which have no visual representation. It contains no accented letters, but is basically appropriate for representing English texts.

In a lot of Western European languages, like French, ASCII was not sufficient, so ISO-8859-1 was built on top of it. The first 128 characters are the same, so code point 65 is still upper-case-A. But it also adds 128 extra characters, for instance 233 is lower-case-e-with-acute. See the Wikipedia page on ISO-8859-1 for more details.

Another example is ISO-8859-5, for Russian text, which is incompatible with ISO-8859-1, although it is also based on ASCII. So 65 is still upper-case-A, but this time 233 is cyrillic-small-letter-shcha and lower-case-e-with-acute does not exist.

As a result, if an application is reading an ISO-8859-5 encoded file, but believes it is ISO-8859-1, it will display an invalid glyph for most of the Russian letters, obviously making the text unreadable for the user.

In most applications (for instance, the GPS IDE), there is a way to specify which character set the application should expect the files to be encoded in by default, and a way to override the default encoding for specific files.

There exists one character set that includes all characters that exist in all the other character sets (or at least is meant to), and this is Unicode (somewhat akin to ISO-10646). It includes thousands upon thousands of characters (and more are added at each revision), while avoiding duplicates. For compatibility with a lot of existing applications, the first 256 characters are the same as in ISO-8859-1, so upper-case-A is still 65, and lower-case-e-with-acute is still 233. But now cyrillic-small-letter-shcha is 1097.

Nowadays, a lot of applications (and even programming languages) will systematically use Unicode internally. For instance, the GTK+ graphic toolkit only manipulates Unicode for internal strings, and so does Python 3.x. So whenever a file is read from disk by GPS, it is first converted from its native character set to Unicode, and then the rest of the application no longer has to care about character sets.

Given the size of Unicode, there are few (if any) fonts that can represent the whole set of characters, but that's not an issue in general since most applications do not need to represent Egyptian hieroglyphs...

Another major part of the Unicode standard is a set of tables to find the properties of various characters: which ones should be considered as white space, how to convert from lower to upper case, which letters are part of words, etc. This knowledge is often hard-coded in our applications and often involves a major change when an application decides to use Unicode internally.

Character Encoding Schemes

We now know how to represent characters as a combination of code points and a character set. But we often need to store those characters in files, which only contain bytes. That seems relatively easy when the code point is less than 256, but becomes much less obvious for other code points, like the 1097 we saw earlier.

In practice, this issue is solved in a number of ways. Encoding schemes such as the Japanese ISO-2022-JP use a notion of plane shift: special bytes indicate that from now on the bytes should be interpreted differently, until the next plane shift. Decoding and encoding therefore requires knowledge of the current state.

Unicode itself defines three different encoding schemes (with their variants), which are known as UTF-8, UTF-16, and UTF-32. The last number indicates the number of bits that each character is encoded in. Therefore, in UTF-32, each character occupies four bytes, which allows the whole set of Unicode characters to be represented. Decoding and encoding is therefore trivial, but there is a major waste of space associated with UTF-32.

In UTF-16, each character is encoded in two bytes, which is enough for all characters used by spoken languages. Other characters are for specific usage, like Egyptian hieroglyphs. For code points that do not fit in two bytes, Unicode defines a few special bytes (the surrogate pairs) that are similar to the plane shifts we described earlier. Thus, there is much less wasted space, but decoding and encoding becomes a bit more complex.

The above two encoding schemes are not backward compatible: an application that was written before Unicode and that only knew about ASCII or ISO-8859-1 will not understand the input strings properly.

For this reason, and to save even more space, Unicode also defines the UTF-8 encoding. For all ASCII characters, they are still represented as before using a single byte. Characters greater than 127 are encoded as a sequence of several bytes (and it is guaranteed that all bytes but the last are not part of ASCII).

Properly manipulating a UTF-8 string requires the use of specialized routines (since moving forward one character means moving forward 1 to 6 bytes). However, a casual application can, for instance, skip to the next white space character as it did before by moving forward one byte at a time and stopping when it sees 32 (a space) or 13 (a newline). This property can often be used by applications that do not need to represent the characters, like the example of the compiler we mentioned at the beginning.

Although the notions of character sets and character encoding schemes are orthogonal, often these notions are conflated. For instance, when someone mentions ISO-8859-1, it usually means the character set as well as its standard representation, where each character is represented as a single byte. Likewise, someone talking about UTF-8 will typically mean the Unicode character set together with the UTF-8 character encoding scheme.

Conversions

We now have almost all of the pieces in place, except for the conversion between character sets. In theory, it is enough to decode the input stream using the proper character encoding scheme, then find the mapping for the code points from the origin to the target character set, and finally use the target encoding scheme to represent the characters as bytes again.

When a character has no mapping into the target character set (for instance the e-acute in the Russian iso-8859-5), the application needs to decide whether to raise an error, ignore the character, or find a transliteration (for example, using e' for e-acute).

This is obviously tedious, and requires the use of big lookup tables for all the character sets your application needs to support.

On Unix systems, there exists a standard library, iconv, to do this conversion work on your behalf. The GNU project also provides such an open-source library for other systems.

We have recently added a binding to this library in the GNAT Components Collection (GNATCOLL.Iconv), making it even easier to use from Ada. For instance:

    with GNATCOLL.Iconv;   use GNATCOLL.Iconv;
    procedure Main is
       EAcute : constant Character := Character'Val (16#E9#);
       --  in ISO-8859-1

       Result : constant String := Iconv
          ("Some string " & EAcute,
           To_Code   => UTF8,
           From_Code => ISO_8859_1);
    begin
       null;
    end Main;

XML/Ada has also included such conversion tables for a while, but supports many fewer character sets. Check the Unicode.CSS.* packages.

As you can see above, we are reusing the string type, since, in Ada, a string is not associated with any specific character set or encoding scheme. In general, as we mentioned before, this is not an issue, since an application will use a single encoding internally (UTF-8 in most cases). Another approach is to use a Wide_String or Wide_Wide_String. The same comment as for UTF-16 and UTF-32 applies: these make character manipulation more convenient, but at the cost of wasted memory.

Manipulating UTF-8 and UTF-16 strings

The last piece of the puzzle, once we have a Unicode string in memory, is to find each character in it. This requires specialized subprograms, since the number of bytes is variable for each character.

XML/Ada's Unicode module includes such a set of subprograms in its Unicode.CES.* packages. In general, going forward is relatively easy and can be done efficiently, whereas going backward in a string is more complex and less efficient.

The GNAT run-time library also contains such packages, for instance GNAT.Encode_UTF8_String and GNAT.Decode_UTF8_String. In particular, the latter provides Decode_Wide_Character, Next_Wide_Character, and Prev_Wide_Character, to find all the characters in a string.