Lists Home |
Date Index |
- To: email@example.com
- Subject: Escaping XML? Flattening the parse tree?
- From: reinhard gantar <firstname.lastname@example.org>
- Date: Wed, 13 Aug 2003 21:42:16 +0200
- User-agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.3) Gecko/20030312
XML might contain .HTML that is not supposed to be processed
as XML. It should be treated "as is", i.e. a string of
characters. Real-World content management systems based
on XML (import/export, like OpenCMS) store items like title,
introduction, bodytext, etc. and mark such tidbits up in XML.
The text itself (the content) might be adorned by
.(x)html-Markup like <em>, <b>, <h3> and the like. Despite
the fact that the CMS is interested in the XML-markup like
(e.g.) <introduction>, <bodytext> or <title> to squirrel it away in
some RDBMS, it is NOT interested in crunching the
potentially complex html-tree. The html-
markup travels from the editor of the document into the
database, from there thru some template-mechanism and
ends up "as such" in the browser in front of some eyeballs.
So why parse it?
In other words, in such a scenario a parser should stop parsing
within a certain branch of the tree, ignore any markup (especially
bad markup), and return a string of content instead of a tree.
Does XML facilitate some mechanism to tell a parser
that being between <introduction>
and </introduction> (for example) means she has reached a
leaf-node, so don't bother crunching "<em>massive</em>"
into ('em', None, ['massive'], None) and give me the string
as is instead?
More precisely, is there some way to signal "no more parsing" in the DTD?
This looks like the way to go. Maybe I've missed it in my
tutorials, but I don't know how to do this. Such a DTD-statement
would signal the parser that <title> is as deep as it gets in
a document, xml-wise. Any markup within such a node is not
treated as a stray bullet, but plain text:
Is there something like "<!LEAFNODE title>"?
The alternatives look disgusting:
0.) Parse and flatten the tree.
Using muscle for doing all the work and than using extra
muscle to undo it does not sound like a good way to go.
Besides that: A simple DTD for simple structures requires
a more or less complete html-DTD -- and XHTML at
that. Every unbalanced <p>, every <p> balanced by
<P>, will make the parser choke.
1.) Obfuscate source
Pre-process the source.
Replace html-markup by some
placeholder like <em> to 5813151, </em> to
5813152. Post-Process after parsing to get the original
.html back. Is it sufficient to say that I don't like that?
2.) Use markup for un-markup
<introduction><title><unmarkup>Why 1984 will not be like
This will drive people nuts who edit (otherwisely simple)
.xml-documents by hand.
So? Any help is appreciated.