[
Lists Home |
Date Index |
Thread Index
]
- From: nisse@lysator.liu.se (Niels Möller)
- To: Tim Bray <tbray@textuality.com>
- Date: 19 Nov 1999 10:52:57 +0100
Tim Bray <tbray@textuality.com> writes:
> At 09:57 AM 11/18/99 -0800, David Brownell wrote:
> >The technique I used in Sun's parser may be good for many folk to steal.
> >It involves using the standard Character.getType() method (which has
> >access to lots of Unicode tables, and in recent JVMs uses native code
> >to quickly access them) and then filtering that output by the rules in
> >the XML spec.
>
> Fine, but the Lark technique *doesn't* require storing any Unicode
> tables and thus uses an order of magnitude (probably) less memory; or
> am I missing something? -T.
Unicode tables are not *that* huge. I wrote some C code some months
ago. It associates 32 bits of character class information with each
unicode character (of which 22 are used). That includes most
properties of the unicode standard. I use a two level lookup table.
I.e. I first use the upper eight bits to index one primary table (256
bytes or 256 pointers). Each entry in this table points out one of 41
distinct subtables with the character class information for a block of
256 characters. So the tables sum up at slightly less than 42K. If you
can do with a subset of the unicode properties, say 4 bits, this
shrinks to about 6K, probably even less if the number of distinct
subtables decreases as well.
Compared to the binary search implementation that you estimate at
between 3.5K and 35K, I don't think the tables are excessive.
I don't count the code size, as the lookup function is trivial:
int has_property(int mask, unicode c)
{ return secondary[primary[c/256]][c%256] & mask; }
I think this is a standard way to implement unicode character
properties. There might be more clever schemes that use less memory
for equally fast lookups; I chose this one because it was easy to
generate the needed tables automatically.
Regards,
/Niels
xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To unsubscribe, mailto:majordomo@ic.ac.uk the following message;
unsubscribe xml-dev
To subscribe to the digests, mailto:majordomo@ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa@ic.ac.uk)
|