[
Lists Home |
Date Index |
Thread Index
]
Tim Bray wrote,
> Rich Salz wrote:
> >> Only benchmarking/profiling can tell
> >
> > It's hard to beat this:
>
> Not that hard I'd say given that an XML processor has to recognize
> 'quot' and 'apos' and 'gt'.
[snip: linear time algorithm]
Ooh ... I think we can do a little better than that ;-)
How about a hand-coded multi-pattern backward DAWG automaton,
http://citeseer.nj.nec.com/246061.html
The trick is similar to Boyer-Moore (search backwards from the end of
the pattern, skip forward as far as possible on a mismatch) but
modified to handle multiple patterns. It should use roughly 1/4 of the
memory accesses and comparisons of a linear search.
Nb. as is traditional, this might not even compile, never mind execute
correctly.
char *p = src+3;
char *limit = p+size;
char c;
while(p < limit)
{
switch(*p) {
case ';':
if(*(p-1) == 't') {
char c = *(p-2);
if(c == 'g') {
if(*(p-3) == '&')
// found ">"
}
else if(c == 'l') {
if(*(p-3) == '&')
// found "<"
}
}
p += 4;
break;
case 'o':
if(p+2 >= limit)
goto done;
if(*(p+2) == ';') {
char c = *(p+1);
if(c == 's') {
if(*(p-1) == 'p' && *(p-2) == 'a' && *(p-3) == '&')
// found "'"
}
else if(c == 't') {
if(*(p-1) == 'u' && *(p-2) == 'q' && *(p-3) == '&')
// found """
}
}
p += 6;
break;
case 'p':
if(p+1 >= limit)
goto done;
if(*(p+1) != ';') {
// watch out for collision with "'"
base += 2;
break;
}
if(*(p-1) == 'm' && *(p-2) == 'a' && *(p-3) == '&')
// found "&"
base += 5;
break;
default:
base += 4;
}
}
done:
The state transition diagram is a little easier to follow, but there's
no way I'm going to try and do the ascii art here.
Now, what was it Knuth said? ;-)
Cheers,
Miles
|