[rescue] UTF-8 [was T5220 update]

Mouse mouse at Rodents-Montreal.ORG
Wed Nov 1 21:22:17 CDT 2017


>> as if Unicode were some kind of God-given One True Character Set and
>> UTF-8 its One True Encoding.
> UTF-8 is a reasonable compromise in a world of mutually-incompatible
> human scripts.

In some respects.  Unicode sort-of is; UTF-8 is...well, it takes all of
Unicode's faults (like the CJK disaster) and adds some of its own.

> It'd be Really Nice if the characters were the same width, but that
> means weighing lots of 0-bytes in text versus freezing out anyone
> whose languages aren't expressible in the 8-bit Latin encodings.

Storage compactness is a completely spurious claim except for those
using mostly-ASCII characters.  UTF-8, as compared to a stream of
16-bit codepoints, does not save storage for anything except ASCII, and
it expands everything above U0800, which is to say, everything except
Latin-1, Latin Extended A, Latin Extended B, Greek, Cyrillic, Armenian,
Hebrew, Arabic, Syriac, Thaana, and some stuff like IPA and modifiers.
As compared to a stream of 24-bit codepoints, UTF-8 compresses only the
stuff below U0800 and expands everything above the BMP.

In my opinion - and that's all it is, my opinion, and it's probably
worth about what you paid for it - UTF-8 is an abomination.  The
benefits of each character being the same size in memory far outweighs,
to me, the storage compaction UTF-8 provides for ASCII text (or, if you
use 24- instead of 16-bit codepoints, the handful of writing systems
outlined above).

> The old school of bickering code-pages can remind us how that goes.

See xkcd #927.  You can cover your eyes and wish as hard as you like,
but it won't make the installed base go away.

> For its faults, UTF-8 and Unicode are _FAR_ better than their
> predecessors.

Maybe they would have been if there were no installed base - though I
still consider variable-sized (in storage) characters an abomination.

The religion surrounding it is, if anything, worse.  It gives us
botches like the ssh definition, which is unimplementable as written on
every Unix variant I've run, from SunOS 3.x to NetBSD 5.2, and probably
a whole bunch of others.  (It's the one piece of nonconformance I'm
aware of in my own implementation.)

> There are plenty of email threads at my day job that would be
> inexpressible in the older encodings because of the ways that Big5,
> Shift-JIS, and CP-1252 collide.

Fine.  I have nothing against those who find it useful using it among
themselves, same as any other charset/encoding.  What bothers me is the
camp which apparently believes that UTF-8 is the One Right Encoding
(and Unicode the One Right Character Set) for all users, for all
purposes, and that anything that can't/doesn't handle it is obviously
broken and needs fixing.  At least, that's what their stance feels like
to me, based on their actions.

> Thompson and Pike were presenting talks on UTF-8 in the early-to-mid
> 1990s.

So?  I can't see that as relevant, unless your stance is something
like, UTF-8 is the best encoding of the best character set for all
users and purposes, so it is reasonable to expect everyone/everything
to support it as soon as it was introduced (modulo implementation
delay).

Perhaps that is your stance, in which case, I have the painful duty to
break it to you that it's not so.  There are lots of users and purposes
for which Unicode, never mind UTF-8, is a wrong answer, even today.
Many of them involve the sort of hardware and software this list
focuses on, hence my remark.

/~\ The ASCII				  Mouse
\ / Ribbon Campaign
 X  Against HTML		mouse at rodents-montreal.org
/ \ Email!	     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


More information about the rescue mailing list