OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

 


 

   Re: [xml-dev] Internal entities removed from XML?

[ Lists Home | Date Index | Thread Index ]

From: "Rich Salz" <rsalz@datapower.com>
 
> > Well, assuming SAX-style parsing that is: just deliver entity expansions
> > as a separate characters() callback ... no copies or writes needed at
> > all.
> 
> The intent was to show in-place expansion can be way efficient.

Here is a version of Rich's C code that is exactly the same speed-efficiency if there
are no entity references, and no less space-efficient if there are entity
references. If we find a non-built-in reference, we replace the 
& delimiter with the Unicode Object Replacement character.

Afterwards,  "&" in text is just a regular character and U+FFFC means 
the delimiter "entity reference open". 

Entity expansion would happen lazily, by deferencing the name
when it is needed: no tree structures actually are built. We defer
merging buffers until later: if "later" is a stream, then we never incur
a space-cost of merging buffers or building trees.  (If you are not using 
wchar_t,  but say UTF-8 then you would substitute use 0x1A or some 
appropriate unused control point such as a flow control character. )

int  expand_entities_in_text_node(char* buff, int size)
{
     wchar_t *start, *src;
     for (start = src = buff; --size >= 0; )
     {
         if ((*buff++ = *src++) == '&')
         {
             if (size >= 3
             && src[0] == 'l' && src[1] == 't' && src[2] == ';')
                 buff[-1] = '<', src += 3, size -= 2;
             else if (size >= 4
                  && src[0] == 'a' && src[1] == 'm' && src[2] == 'p'
                  && src[3] == ';')
                 src += 4, size -= 3;
            else buff[-1] = 0xFFFC;  /* flag this as an entity reference */
         }
     }
     return buff - src;
}

(As Tim mentioned, for real code we would also need to cope with the
other builtin references and numeric character references, and there
is no error-handling either. )


Cheers
Rick Jelliffe




 

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS