Lists Home |
Date Index |
From: "Rich Salz" <email@example.com>
> > Well, assuming SAX-style parsing that is: just deliver entity expansions
> > as a separate characters() callback ... no copies or writes needed at
> > all.
> The intent was to show in-place expansion can be way efficient.
Here is a version of Rich's C code that is exactly the same speed-efficiency if there
are no entity references, and no less space-efficient if there are entity
references. If we find a non-built-in reference, we replace the
& delimiter with the Unicode Object Replacement character.
Afterwards, "&" in text is just a regular character and U+FFFC means
the delimiter "entity reference open".
Entity expansion would happen lazily, by deferencing the name
when it is needed: no tree structures actually are built. We defer
merging buffers until later: if "later" is a stream, then we never incur
a space-cost of merging buffers or building trees. (If you are not using
wchar_t, but say UTF-8 then you would substitute use 0x1A or some
appropriate unused control point such as a flow control character. )
int expand_entities_in_text_node(char* buff, int size)
wchar_t *start, *src;
for (start = src = buff; --size >= 0; )
if ((*buff++ = *src++) == '&')
if (size >= 3
&& src == 'l' && src == 't' && src == ';')
buff[-1] = '<', src += 3, size -= 2;
else if (size >= 4
&& src == 'a' && src == 'm' && src == 'p'
&& src == ';')
src += 4, size -= 3;
else buff[-1] = 0xFFFC; /* flag this as an entity reference */
return buff - src;
(As Tim mentioned, for real code we would also need to cope with the
other builtin references and numeric character references, and there
is no error-handling either. )