[
Lists Home |
Date Index |
Thread Index
]
On Jan 21, 2004, at 11:57 AM, jcowan@reutershealth.com wrote:
>> The 'codePoint' typedef may be problematic:
>>
>> // Unicode code points (4-byte int on most systems)
>> typedef wchar_t codePoint;
>>
>> ...
> I have argued privately that wchar_t is in fact the Right Thing here
> despite its variability in size (UTF-32 on Unix platforms, UTF-16 on
> Windows), because it makes genx compatible with both standardized and
> non-standardized facilities, most especially "..."L strings. Some
> conditional logic will be needed to interpret the input as UTF-16 or
> UTF-32, which can be based on sizeof(wchar_t). Hypothetical platforms
> where sizeof(wchar_t) == 1 can be neglected.
Almost. How about we leave it as wchar_t, but *not* UTF-16, so a value
that's in a surrogate block is an error. Then we change the name from
codePoint (which could be interpreted as meaning "UTF-16 Code Point" to
something more explicit like
numericValueCorrespondingToAUnicodeCharacterAsInUPlusFourHexDigitsIsThat
Clear
John Cowan has suggested that "codeUnit" might be a good name, I'd be
inclined to "uniChar", any other ideas?
If someone wants to put a generic UTF-16 processor on top of genx, that
would be fine. I don't see the demand for supporting it at the input
end of genx because the UTF-16 centric languages like Java and C# have
decent xml-writing software already. -Tim
|