OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   Re: [xml-dev] Escaping XML? Flattening the parse tree?

[ Lists Home | Date Index | Thread Index ]

others will know better than me, but my understanding and experience of
xslt (using xsltproc) is that's exactly what happens. If there's no dtd
(?) and no rule for a tag it just gets copied into the output.

which means you don't need to tell it to ignore tags, just don't tell it
what to do with them.


On Thu, 2003-08-14 at 05:42, reinhard gantar wrote:
> Dear XMLers,
> XML might contain .HTML that is not supposed to be processed
> as XML. It should be treated "as is", i.e. a string of
> characters. Real-World content management systems based
> on XML (import/export, like OpenCMS) store items like title,
> introduction, bodytext, etc. and mark such tidbits up in XML.
> The text itself (the content) might be adorned by
> .(x)html-Markup like <em>, <b>, <h3> and the like. Despite
> the fact that the CMS is interested in the XML-markup like
> (e.g.) <introduction>, <bodytext> or <title> to squirrel it away in
> some RDBMS, it is NOT interested in crunching the
> potentially complex html-tree. The html-
> markup travels from the editor of the document into the
> database, from there thru some template-mechanism and
> ends up "as such" in the browser in front of some eyeballs.
> So why parse it?
> In other words, in such a scenario a parser should stop parsing
> within a certain branch of the tree, ignore any markup (especially
> bad markup), and return a string of content instead of a tree.
> Does XML facilitate some mechanism to tell a parser
> that being between <introduction>
> and </introduction> (for example) means she has reached a
> leaf-node, so don't bother crunching "<em>massive</em>"
> into ('em', None, ['massive'], None) and give me the string
> as is instead?
> More precisely, is there some way to signal "no more parsing" in the DTD?
> This looks like the way to go. Maybe I've missed it in my
> tutorials, but I don't know how to do this. Such a DTD-statement
> would signal the parser that <title> is as deep as it gets in
> a document, xml-wise. Any markup within such a node is not
> treated as a stray bullet, but plain text:
> Is there something like "<!LEAFNODE title>"?
> The alternatives look disgusting:
> 0.) Parse and flatten the tree.
> Using muscle for doing all the work and than using extra
> muscle to undo it does not sound like a good way to go.
> Besides that: A simple DTD for simple structures requires
> a more or less complete html-DTD -- and XHTML at
> that. Every unbalanced <p>, every <p> balanced by
> <P>, will make the parser choke.
> 1.) Obfuscate source
> Pre-process the source.
> Replace html-markup by some
> placeholder like <em> to 5813151, </em> to
> 5813152. Post-Process after parsing to get the original
> .html back. Is it sufficient to say that I don't like that?
> 2.) Use markup for un-markup
> <introduction><title><unmarkup>Why 1984 will not be like 
> <em>1984</em></unmarkup>...
> This will drive people nuts who edit (otherwisely simple)
> .xml-documents by hand.
> So? Any help is appreciated.
> Kind regards
> Gantar
> -----------------------------------------------------------------
> The xml-dev list is sponsored by XML.org <http://www.xml.org>, an
> initiative of OASIS <http://www.oasis-open.org>
> The list archives are at http://lists.xml.org/archives/xml-dev/
> To subscribe or unsubscribe from this list use the subscription
> manager: <http://lists.xml.org/ob/adm.pl>


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS