xml-dev - Re: [xml-dev] ANN: Gorille 0.3

Re: [xml-dev] ANN: Gorille 0.3

[ Lists Home | Date Index | Thread Index ]

To: "Rick Jelliffe" <ricko@allette.com.au>,"John Cowan" <cowan@mercury.ccil.org>
Subject: Re: [xml-dev] ANN: Gorille 0.3
From: "Rob Lugt" <roblugt@elcel.com>
Date: Fri, 11 Jan 2002 11:55:16 -0000
Cc: <xml-dev@lists.xml.org>
References: <E16OzUl-0007ro-00@mercury.ccil.org>

John Cowan 'scripted it'
> > At least in UTF-8 you can just count bytes <0x80 to count characters.
> Make that 0xC0.

No, it's easy, but not quite that easy.

Unicode code-points up to U+007F are represented as 8-bit bytes with the
same value, so counting bytes <0x80 gives you the number of US-ASCII
characters.  Characters above U+007F are represented with multiple bytes,
the first >= 0xC0, the trailing bytes are all >= 0x80.

So to count characters, one way is to count all bytes less <= 0x7F or >=
0xC0.

Regards
~Rob

--
Rob Lugt
ElCel Technology
http://www.elcel.com

Follow-Ups:
- Re: [xml-dev] ANN: Gorille 0.3
  - From: John Cowan <cowan@mercury.ccil.org>

References:
- Re: [xml-dev] ANN: Gorille 0.3
  - From: John Cowan <cowan@mercury.ccil.org>

Prev by Date: Re: Unicode basics (was Re: ANN: Gorille 0.3)
Next by Date: Re: [xml-dev] ANN: Gorille 0.3
Previous by thread: Re: [xml-dev] ANN: Gorille 0.3
Next by thread: Re: [xml-dev] ANN: Gorille 0.3
Index(es):
- Date
- Thread