Lists Home |
Date Index |
Elliotte Rusty Harold wrote:
> It could be worse, though. You could be using C, and trying to decode
> UTF-8. :-)
?? It's about 10 lines of code, and has been written lots of
times now. Last time I needed it I couldn't find one with the
exact buffer interface I needed so I coded it up from scratch
sometime in the course of an afternoon and it worked first time.
The spec is hardly unclear. And it's a set of shift/mask
operations that are processor-friendly. You need to use a
loop iterator rather than a for (i = 0; string[i]; i++) idiom,
UTF8 only really causes extra work when you want per-character
addressing into big strings, because then you need an indirect
table - the most common case I can think of is maintaining
on-screen render state.
But in most apps it's more common to point into text at a
few places (tags, word-starts, search matches) in which case
you needed that indirect array anyhow.
Conclusion: somewhat to my surprise, I find that for a lot
of C tasks, you can keep your text in UTF-8 and work with
it that way very efficiently.
Elliote is right about the irritating fact that a Java
"char" isn't an XML character. The nasty fact is that
I suspect many Java application programmers will end up
simply blowing off non-BMP text either through ignorance
or based on a decision that it's not cost-effective. -Tim