File Unicode.html    Author McKeeman    Copyright © 2008

Unicode

Unicode, or ISO 10464, is ASCII on steroids. The objective is to be able represent, especially on the WWW, all of the characters in use, or ever having been used. CJK, Cherokee, Cuneiform, Mathematics, Music... and so on.

To get all the worlds languages into computers, 127 bit patterns are not enough. The Han (or CJK) languages by themselves take more than 50K patterns. So there are three additional coding schemes, one with 8 bit codes (UTF-8) one with 16 bit codes (UTF-16) and one with 32 bit codes (UTF-32). UTF-8 and UTF-16 are variable length encodings. Each of the schemes can represent almost 232 unique characters. Unicode has directional control (Hebrew is right-to-left, for example), a collating sequence, a class for alphabetic letters (there are a lot of them), and many other representational features.

This is a big deal. National governments negotiate over who gets what blocks of characters. Unicode is the basis for browsers, Matlab, Java and XML among others.

ISO 10464 defines the characters (called codepoints). One can see the glyphs on various public websites. You can see some of the variety of characters in this Unicode fan page.

Ken Thompson and Rob Pike proposed a trick to make ASCII a proper subset of Unicode. It was accepted by ISO at UTF-8. Integer values 1-127 are reservered for ASCII and the rest of the codepoints are represented with more than one 8-bit byte. Here is the scheme (UTF-8):

  0bbbbbbb                                               7-bit ASCII
  110bbbbb 10bbbbbb                                      5+6         = 11 bits
  1110bbbb 10bbbbbb 10bbbbbb                             4+6+6       = 16 bits
  11110bbb 10bbbbbb 10bbbbbb 10bbbbbb                    3+6+6+6     = 21 bits
  111110bb 10bbbbbb 10bbbbbb 10bbbbbb 10bbbbbb           2+6+6+6+6   = 26 bits
  111111bb 10bbbbbb 10bbbbbb 10bbbbbb 10bbbbbb 10bbbbbb  2+6+6+6+6+6 = 32 bits

This scheme provides some ability to resynch if there is a transmission error because internal bytes have the unique prefix 10 and can be skipped for awhile until a new character starts. The agreement is to use the shortest code possible for each character. Code 0 is reserved for end-of-string as in C.

See the Unicode home page for details. See the IBM ICU for software to support Unicode.


Created: Wednesday, May 16, 2001
Last modified: Wed May 7 10:33:31 EDT 2008