OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

 


 

   Re: [xml-dev] Genx

[ Lists Home | Date Index | Thread Index ]

Tim Bray scripsit:

> Almost.  How about we leave it as wchar_t, but *not* UTF-16, so a value  
> that's in a surrogate block is an error.  Then we change the name from  
> codePoint (which could be interpreted as meaning "UTF-16 Code Point" to  
> something more explicit like
> 
> numericValueCorrespondingToAUnicodeCharacterAsInUPlusFourHexDigitsIsThat 
> Clear
> 
> John Cowan has suggested that "codeUnit" might be a good name, I'd be  
> inclined to "uniChar", any other ideas?

I must have unintentionally misled you.  A "code point" is an integer
in the range 0-0x10FFFF; Unicode maps characters to code points.  "Code
units" are chunks o' bits:  UTF-8, UTF-16, and UTF-32 map code points to
8-bit code units, 16-bit code units, and 32-bit code units respectively.
"UTF-16 code point" is a contradiction in terms.

However, on reflection I think that the Right Thing is to use
wchar_t directly in the API, since the whole point of using it is for
compatibility with other wchar_t-aware routines, either standardized
or platform-specific.  There is no point in hiding it behind a type name.
(As I said, if your platform has 8-bit wchar_t's, you deserve to lose.)

> If someone wants to put a generic UTF-16 processor on top of genx, that  
> would be fine.  I don't see the demand for supporting it at the input  
> end of genx because the UTF-16 centric languages like Java and C# have  
> decent xml-writing software already. -Tim

C and C++ on the Windows platform *are* UTF-16 centric.  If you put
a Gothic character into a "..."L string, for example, it will produce
a string which is three wchar_t's long on Windows, whereas on Unix it
will be two wchar_t long (including the trailing 0 in both cases).  As I
said, the additional code for converting UTF-16 (as opposed to UTF-32)
into UTF-8 is very small, and can be conditionalized on sizeof(wchar_t).

-- 
As you read this, I don't want you to feel      John Cowan 
sorry for me, because, I believe everyone       jcowan@reutershealth.com
will die someday.    -- From a Nigerian-type    http://www.reutershealth.com
                        scam spam I got         http://www.ccil.org/~cowan




 

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS