OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

 


 

   Re: [xml-dev] Internal entities removed from XML?

[ Lists Home | Date Index | Thread Index ]

Tim Bray wrote,
> Rich Salz wrote:
> >> Only benchmarking/profiling can tell
> >
> > It's hard to beat this:
>
> Not that hard I'd say given that an XML processor has to recognize
> 'quot' and 'apos' and 'gt'.
[snip: linear time algorithm]

Ooh ... I think we can do a little better than that ;-)

How about a hand-coded multi-pattern backward DAWG automaton,

  http://citeseer.nj.nec.com/246061.html

The trick is similar to Boyer-Moore (search backwards from the end of 
the pattern, skip forward as far as possible on a mismatch) but 
modified to handle multiple patterns. It should use roughly 1/4 of the 
memory accesses and comparisons of a linear search.

Nb. as is traditional, this might not even compile, never mind execute 
correctly.

  char *p = src+3;
  char *limit = p+size;
  char c;

  while(p < limit)
  {
    switch(*p) {
      case ';':
        if(*(p-1) == 't') {
          char c = *(p-2);
          if(c == 'g') {
            if(*(p-3) == '&')
              // found "&gt;"
          }
          else if(c == 'l') {
            if(*(p-3) == '&')
              // found "&lt;"
          }
        }

        p += 4;
        break;

      case 'o':
        if(p+2 >= limit)
          goto done;

        if(*(p+2) == ';') {
          char c = *(p+1);
          if(c == 's') {
            if(*(p-1) == 'p' && *(p-2) == 'a' && *(p-3) == '&')
              // found "&apos;"
          }
          else if(c == 't') {
            if(*(p-1) == 'u' && *(p-2) == 'q' && *(p-3) == '&')
              // found "&quot;"
          }
        }

        p += 6;
        break;

      case 'p':
        if(p+1 >= limit)
          goto done;

        if(*(p+1) != ';') {
          // watch out for collision with "&apos;"
          base += 2;
          break;
        }
        
        if(*(p-1) == 'm' && *(p-2) == 'a' && *(p-3) == '&')
          // found "&amp;"
        
        base += 5;
        break;

      default:
        base += 4;
    }
  }

  done:

The state transition diagram is a little easier to follow, but there's 
no way I'm going to try and do the ascii art here.

Now, what was it Knuth said? ;-)

Cheers,


Miles




 

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS