OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   RE: [xml-dev] UTF-8+names

[ Lists Home | Date Index | Thread Index ]

James Clark wrote:
> But with +names you don't want to work at the encoding level. 
>  For example, if you have a ΓΌ in your text file, that will be 
> two bytes in
> UTF-8+names, but you would want to work with it as a single character.
> To edit a UTF-8+names text file, you need to make your text 
> editor treat it as if it were encoded in UTF-8. In other 
> words, to make things work you have to edit it in the wrong 
> encoding.  This will be extremely confusing to users.

This is precisely what I meant when I wrote:

What you and Tim are proposing is to define additional bit patterns for certain Unicode characters, which, when re-interpreted as (sequences of) UTF-8 bit patterns, look like XML entity references.

Therefore at the very heart of your proposal is a re-interpretation trick of bit patterns between UTF-8 on one side and UTF-8+names on the other side.

Indeed, if one uses UTF-8+names just as an encoding of Unicode (with no re-interpretation trick), no human user will ever see those     things.  All that humans will see is some displayable form of the  NON-BREAK SPACE  character, which happened to be encoded as  0x26 0x6E 0x62 0x73 0x70 0x3B  rather than as  0xNN1 0xNN2 (the two bit patterns being equivalent).  

In other words, if the UTF-8+names encoding is used to go from Unicode code points to bit patterns and vice versa (which is how an encoding is supposed to be used), the whole point of defining human-readable alternatives is defeated.  For the human-readable alternatives to be useful, you need to resort to reinterpretation of this encoding as if it were a different encoding.

> 1. General publishing. This community wants the HTML entity 
> sets.  I think the problem here is a software/education 
> problem which is decreasing all the time.  Almost all modern 
> systems have fonts that can display almost all the characters 
> in these entity sets. The desktop environments that I'm 
> familiar with all offer a character map applet which is 
> sufficient (albeit not very efficient) for entry of 
> characters which you have fonts. The quality of Unicode 
> support offered by standard text editors is improving all the time.
> CJK users have long dealt with the problem of how to enter 
> characters for which their keyboard has no key. CJK software 
> typically provides "input methods" to allow efficient, 
> user-friendly entry of such characters. This sort of 
> technology should be applied for entering Unicode characters. 
>  Input methods can easily leverage the standard Unicode 
> names, rather than having to invent and maintain a competing 
> set of shorter names.

This is what I meant when I said that the whole issue should probably be addressed at the software level, rather than by introducing a new encoding.

We seem to be in agreement on these two basic points.



News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS