OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   Re: [xml-dev] ANN: Gorille 0.3

[ Lists Home | Date Index | Thread Index ]

Elliotte Rusty Harold wrote:

> It could be worse, though. You could be using C, and trying to decode 
> UTF-8. :-)

?? It's about 10 lines of code, and has been written lots of
times now.  Last time I needed it I couldn't find one with the
exact buffer interface I needed so I coded it up from scratch
sometime in the course of an afternoon and it worked first time.
The spec is hardly unclear.  And it's a set of shift/mask
operations that are processor-friendly.  You need to use a
loop iterator rather than a for (i = 0; string[i]; i++) idiom,
big deal.

UTF8 only really causes extra work when you want per-character
addressing into big strings, because then you need an indirect
table - the most common case I can think of is maintaining
on-screen render state.

But in most apps it's more common to point into text at a
few places (tags, word-starts, search matches) in which case
you needed that indirect array anyhow.

Conclusion: somewhat to my surprise, I find that for a lot
of C tasks, you can keep your text in UTF-8 and work with
it that way very efficiently.

Elliote is right about the irritating fact that a Java
"char" isn't an XML character.  The nasty fact is that
I suspect many Java application programmers will end up
simply blowing off non-BMP text either through ignorance
or based on a decision that it's not cost-effective.  -Tim


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS