xml-dev - Re: [xml-dev] ANN: Gorille 0.3

Re: [xml-dev] ANN: Gorille 0.3

[ Lists Home | Date Index | Thread Index ]

To: Elliotte Rusty Harold <elharo@metalab.unc.edu>
Subject: Re: [xml-dev] ANN: Gorille 0.3
From: Tim Bray <tbray@textuality.com>
Date: Thu, 10 Jan 2002 12:22:32 -0800
Cc: xml-dev@lists.xml.org
References: <4.2.0.58.20020110131719.012c5f00@pop3.east.ora.com> <p04330108b8639049050c@[192.168.254.4]>
User-agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:0.9.7) Gecko/20011221

Elliotte Rusty Harold wrote:

> It could be worse, though. You could be using C, and trying to decode 
> UTF-8. :-)

?? It's about 10 lines of code, and has been written lots of
times now.  Last time I needed it I couldn't find one with the
exact buffer interface I needed so I coded it up from scratch
sometime in the course of an afternoon and it worked first time.
The spec is hardly unclear.  And it's a set of shift/mask
operations that are processor-friendly.  You need to use a
loop iterator rather than a for (i = 0; string[i]; i++) idiom,
big deal.

UTF8 only really causes extra work when you want per-character
addressing into big strings, because then you need an indirect
table - the most common case I can think of is maintaining
on-screen render state.

But in most apps it's more common to point into text at a
few places (tags, word-starts, search matches) in which case
you needed that indirect array anyhow.

Conclusion: somewhat to my surprise, I find that for a lot
of C tasks, you can keep your text in UTF-8 and work with
it that way very efficiently.

Elliote is right about the irritating fact that a Java
"char" isn't an XML character.  The nasty fact is that
I suspect many Java application programmers will end up
simply blowing off non-BMP text either through ignorance
or based on a decision that it's not cost-effective.  -Tim

Follow-Ups:
- Re: [xml-dev] ANN: Gorille 0.3
  - From: Uche Ogbuji <uche.ogbuji@fourthought.com>
- Re: [xml-dev] ANN: Gorille 0.3
  - From: Richard Tobin <richard@cogsci.ed.ac.uk>
- Re: [xml-dev] ANN: Gorille 0.3
  - From: Ronald Bourret <rpbourret@rpbourret.com>
- Re: [xml-dev] ANN: Gorille 0.3
  - From: "Jonathan Borden" <jborden@mediaone.net>
- Re: [xml-dev] ANN: Gorille 0.3
  - From: John Cowan <jcowan@reutershealth.com>

References:
- ANN: Gorille 0.3
  - From: "Simon St.Laurent" <simonstl@simonstl.com>
- Re: [xml-dev] ANN: Gorille 0.3
  - From: Elliotte Rusty Harold <elharo@metalab.unc.edu>

Prev by Date: Re: [xml-dev] ANN: Gorille 0.3
Next by Date: RE: [xml-dev] [ANN] XML Limerick Competition
Previous by thread: Re: [xml-dev] ANN: Gorille 0.3
Next by thread: Re: [xml-dev] ANN: Gorille 0.3
Index(es):
- Date
- Thread