[rescue] UTF-8 [was T5220 update]

Jonathan Patschke jp at celestrion.net
Thu Nov 2 09:14:11 CDT 2017


On Wed, 1 Nov 2017, Mouse wrote:

> Storage compactness is a completely spurious claim except for those
> using mostly-ASCII characters.  UTF-8, as compared to a stream of
> 16-bit codepoints, does not save storage for anything except ASCII,

This was a design goal, and I don't think it's as bad as all that.  Many
of the scripts using those wider characters have a greater information
density per glyph than the western scripts do.  Cyrillic gets shafted, and
that's probably somewhere between coincidence and politics.

> In my opinion - and that's all it is, my opinion, and it's probably
> worth about what you paid for it - UTF-8 is an abomination.  The
> benefits of each character being the same size in memory far outweighs,
> to me, the storage compaction UTF-8 provides for ASCII text (or, if you
> use 24- instead of 16-bit codepoints, the handful of writing systems
> outlined above).

My instinct is to agree, but for applications where that matters (nearly
any in-memory processing), there's UTF-32/UCS-4.  You do the code-point
processing in bulk once instead of iteratively, and your in-core view of
the file has characters of the same size.  UTF-8's compactness is intended
for transfer and storage[0] primarily.

Further, a system that defaults to single-byte storage ends arguments
about byte order; expand the bytes into core however you see fit, but
serialization with other systems won't depend on byte-order marking.

>> For its faults, UTF-8 and Unicode are _FAR_ better than their
>> predecessors.
>
> Maybe they would have been if there were no installed base - though I
> still consider variable-sized (in storage) characters an abomination.

The Big Win for the notion of variable-width characters, if we're talking
about installed bases, is that UTF-8 software can correctly process all
7-bit ASCII text--including control codes.  This is, by far, the single
largest set of legacy electronic textual data.

That facilitates support for wide characters being introduced into
software without a Flag Day when all characters need to be 24 or 32 bits
wide.

>> Thompson and Pike were presenting talks on UTF-8 in the early-to-mid
>> 1990s.
>
> So?  I can't see that as relevant, unless your stance is something
> like, UTF-8 is the best encoding of the best character set for all
> users and purposes, so it is reasonable to expect everyone/everything
> to support it as soon as it was introduced (modulo implementation
> delay).

At 24 years on, that delay could involve conceiving the programmer who
would later implement UTF-8 support and sending him/her through
university.  "My system is more than a year or two old," would be a
perfectly valid excuse if Unicode were a passing fad with niche
applicability and a majority of the planet well-serviced by ASCII.

> Perhaps that is your stance, in which case, I have the painful duty to
> break it to you that it's not so.  There are lots of users and purposes
> for which Unicode, never mind UTF-8, is a wrong answer, even today.
> Many of them involve the sort of hardware and software this list
> focuses on, hence my remark.

The thing about the network is that something doesn't have to be the best
to be nigh-universal, which is how we got Unix to begin with.  There will
probably never be a best-in-all-cases-ever incidence of any technology,
but there will usually be one that's pretty reasonable to support by
default.

That used to be ASCII.  These days, it really looks[1] to be Unicode, for
better or worse.  Looping all the way back to the start of this
divergence, if software needs ASCII, iconv is a much better input filter
than &= 127.


[0] Although filesystem support for lz4 and similar compression schemes
     makes even this a hard claim to defend, but in 1993 the relative
     processing overhead was much higher.
[1] I very likely have a bias in my perception as to how valuable a
     universal character set is due to most of my coworkers speaking
     English (or any Western language) as a second or third language.
-- 
Jonathan Patschke
Austin, TX
USA


More information about the rescue mailing list