xml-dev - RE: [xml-dev] UTF-8+names

RE: [xml-dev] UTF-8+names

[ Lists Home | Date Index | Thread Index ]

To: "'John Cowan'" <cowan@mercury.ccil.org>, "'Mike Champion'" <mc@xegesis.org>
Subject: RE: [xml-dev] UTF-8+names
From: "Alessandro Triglia" <sandro@mclink.it>
Date: Sun, 19 Oct 2003 14:19:06 -0400
Cc: <xml-dev@lists.xml.org>
Importance: Normal
In-reply-to: <20031019172255.GJ20059@mercury.ccil.org>

John Cowan wrote:
> 
> 
> Mike Champion scripsit:
> 
> > Sure! The question is how to do something to make our
> > lives less unpleasant while The System plots forward.
> > Be patient, vote with our feet against crappy software
> > that can't handle Unicode decently, or try to hack up
> > something in the interim?  The whole point of Unicode
> > encodings is to map conveniently enterable text onto
> > codepoints, and whatever the technical virtues or
> > flaws of Tim's strawman proposal, this seems like the
> > right layer to address it.
> 
> Character naming isn't just a hack for 8-bit users; it's
> just as practical for someone using Unicode directly.
> The human issue of referencing characters over a huge
> codespace is just as great whatever the underlying encoding.

Sorry, I am still unconvinced.

It seems to me there is a confusion of layers here, between the displayable
form of a Unicode character and the bit pattern of its encoding.

What you and Tim are proposing is to define additional bit patterns (*) for
certain Unicode characters, which, when re-interpreted as (sequences of)
UTF-8 bit patterns, look like XML entity references.

Therefore at the very heart of your proposal is a re-interpretation trick of
bit patterns between UTF-8 on one side and UTF-8+names on the other side.

Indeed, if one uses UTF-8+names just as an encoding of Unicode (with no
re-interpretation trick), no human user will ever see those  &nbsp;  things.
All that humans will see is some displayable form of the  NON-BREAK SPACE
character, which happened to be encoded as  0x26 0x6E 0x62 0x73 0x70 0x3B
rather than as  0xNN1 0xNN2 (the two bit patterns being equivalent).  

In other words, if the UTF-8+names encoding is used to go from Unicode code
points to bit patterns and vice versa (which is how an encoding is supposed
to be used), the whole point of defining human-readable alternatives is
defeated.  For the human-readable alternatives to be useful, you need to
resort to reinterpretation of this encoding as if it were a different
encoding.

That is, unless you want to modify Unicode itself, by introducing a macro
mechanism that would also affect the displayable form of the characters.  In
other words, I wouldn't see any point in defining a macro mechanism at the
level of the encoding, because it is not reflected in the displayable form.
Who is the end-user of Unicode after all?  I am sure it is the person that
sees the displayable form of the characters, not the person that uses
technical tricks such as a sister encoding to prevent the macros from being
expanded.

I am not actually proposing to add this macro functionality to Unicode, but
I am saying that there are two places where the initial problem can be
addressed:  either at the XML level or at the Unicode level (which involves
the displayable form).  Not at the encoding level.

Alessandro

(*) The byte sequence  0x26 0x6E 0x62 0x73 0x70 0x3B  would be such a bit
pattern.

> 
> -- 
> John Cowan  jcowan@reutershealth.com  www.reutershealth.com  
www.ccil.org/~cowan [R]eversing the apostolic precept to be all things to
all men, I usually [before Darwin] defended the tenability of the received
doctrines, when I had to do with the [evolution]ists; and stood up for the
possibility of [evolution] among the orthodox--thereby, no doubt, increasing
an already current, but quite undeserved, reputation for needless
combativeness.  --T. H. Huxley

-----------------------------------------------------------------
The xml-dev list is sponsored by XML.org <http://www.xml.org>, an initiative
of OASIS <http://www.oasis-open.org>

The list archives are at http://lists.xml.org/archives/xml-dev/

To subscribe or unsubscribe from this list use the subscription
manager: <http://lists.xml.org/ob/adm.pl>

Follow-Ups:
- Re: [xml-dev] UTF-8+names
  - From: Tim Bray <tbray@textuality.com>
- Re: [xml-dev] UTF-8+names
  - From: John Cowan <cowan@mercury.ccil.org>

References:
- Re: [xml-dev] UTF-8+names
  - From: John Cowan <cowan@mercury.ccil.org>

Prev by Date: Re: The myth of the clean slate (was Re: a discussion proposal for addressing 'the character entity problem')
Next by Date: Re: [xml-dev] UTF-8+names
Previous by thread: Re: [xml-dev] UTF-8+names
Next by thread: Re: [xml-dev] UTF-8+names
Index(es):
- Date
- Thread