xml-dev - RE: Character classification

RE: Character classification

[ Lists Home | Date Index | Thread Index ]

From: Istvan Cseri <istvanc@microsoft.com>
To: xml-dev@ic.ac.uk, 'Tim Bray' <tbray@textuality.com>
Date: Wed, 3 Sep 1997 15:24:48 -0700

For better speed I would suggest an alternative solution: use a quick
array lookup for characters below 256 and go to the more expensive
method above... It will do wonders with your parser.

Istvan

> ----------
> From: 	Tim Bray[SMTP:tbray@textuality.com]
> Reply To: 	Tim Bray
> Sent: 	Wednesday, September 03, 1997 12:51 PM
> To: 	xml-dev@ic.ac.uk
> Subject: 	Character classification
> 
> <<File: CharClasses.java.txt>>
> I've been working on making Lark really do Unicode.  JDK 1.1 is
> supposed
> to have, unlike 1.0, a usable input method; thus the problem is to
> check,
> when you're reading a GI or Attribute name, whether the characters are
> legal namestart/name characters.
> 
> It turns out to be quite a lot of work, so this is an offer to share.
> I wrote a program (based on Lark) that pulls the relevant character
> classes out of the XML spec, picks apart the markup, and writes
> another
> Java class that has some static arrays and offers two methods:
> 
> package textuality.lark;
> public class CharClasses
> {
>  public static boolean isNameC(char c)
>  public static boolean isNameStart(char c)
> }
> 
> It needs about 4k of tables (which it binary-searches); it might be
> faster
> with 128k of byte-addressable tables or 16K of bitmaps, neither of
> which
> would be hard to implement.
> 
> (a) is this a waste of time, i.e. are there Unicode library calls that
>     do it?
> (b) if not, has everyone else already done this?
> (c) if not, if I'm going to publish this, is the API above OK?
> 
> I've attached the current Java source file for those who find the 
> explanation above insufficiently clear.
> 
> Cheers, Tim Bray
> tbray@textuality.com http://www.textuality.com/ +1-604-708-9592
> 

xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo@ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa@ic.ac.uk)

Follow-Ups:
- Re: Character classification
  - From: James Clark <jjc@jclark.com>
- Re: Character classification
  - From: Chris Olds <colds@nwlink.com>

Prev by Date: RE: Character classification
Next by Date: Re: Character classification
Previous by thread: RE: Character classification
Next by thread: Re: Character classification
Index(es):
- Date
- Thread