[rescue] OT: broken de-MIME-ers should be shot! ;-)

James Lockwood james at foonly.com
Thu Apr 18 09:44:24 CDT 2002


On Thu, 18 Apr 2002, Kurt Mosiejczuk wrote:

> Is UTF-8 useless for Asian countries?  And is UTF-8 something that is mainly
> pushed by Win2k?  If so, I take back my comment about better handling it =)

UTF-8 is a variable size (1-3 byte) per character representation.
Typically "base ASCII" occupies 1 byte, Latin-1 and most other non-asian
characters occupy 2 bytes, and asian characters occupy 3.  It is
space-efficient if you are dealing mostly with "western" alphabets, but
requires additional deblocking whenever the text is processed (every
character must be examined in order to determine how many bytes it
occupies, you can't just index directly into a buffer to get the Nth
character).

UCS-2 (aka UTF-16) is a fixed size (2 byte) per character representation.
All characters occupy 2 bytes and indexing is trivial.  However, space
requirements double for western text.  This means that the optimal
technique for Unicode processing is frequently to use UCS-2 internally
(as is done with Java) to allow indexed text manipulation, and to use
UTF-8 as the backend storage in a database (where field indexing can be
done in bulk).

UTF-8 has been pushed by many others than Windows.  Java has been a
"paper" Unicode pusher (for the longest time you could manipulate Unicode
chars inside of programs but you couldn't actually do any useful I/O with
them) but has been getting better recently.  Most of the major database
engines now allow Unicode storage, either as part of the base package or
as an add-on.

I put the first production Sybase Unicode DB into service (according to
Sybase) in 1997.  It was an amazing success, unlike the flaming heap of
dog crap that was their Verity integration for full text search.

-James



More information about the rescue mailing list