xml-dev - Re: Mixed content considered harmful...

Re: Mixed content considered harmful...

[ Lists Home | Date Index | Thread Index ]

From: Paul Prescod <paul@prescod.net>
To: XML Dev <xml-dev@ic.ac.uk>
Date: Tue, 11 May 1999 14:32:56 -0500

John Cowan wrote:
> 
> Can you sketch an algorithm that will convert SGML-style (or &-less
> SGML-style) content models involving #PCDATA into content models
> involving #PCDATA and #WS, where #WS is a data type that matches
> only white space, such that random white space around tags will be properly
> accounted for?

Thanks for asking.

I don't think that you would convert the content models. You leave the
content models alone and just change your matching algorithm slightly.

#PCDATA is a token that matches any character data. Given A,#PCDATA,B,
#PCDATA matches the longest stretch of character data between A and B. 
#WS matches a stretch of whitespace.

When you are parsing, you always try to match (all) characters against
#PCDATA. If that fails AND the characters are whitespace then you ignore
or suppress them. If it files but the characters are NOT whitespace then
of course you have a validity error. 

Token               Text              Result

#PCDATA             "abc"              "abc"
#PCDATA	            "   "              "abc"
#PCDATA not allowed "abc"              ERROR
#PCDATA not allowed "   "              "ignorable:[   ]"

---

The only danger is if you put datatype nodes beside each other or datatype
nodes beside PCDATA. Then you could have problems with ambiguity in the
formal grammars sense of the word (which IS a real problem). We could
handle this by disallowing content models that allow datatypes to be
adjacent or by requiring schema processors to detect and report a possible
ambiguity based on the actual definitions of the datatype.

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for only himself
 http://itrc.uwaterloo.ca/~papresco

Diplomatic term: "Emerging Markets"
Translation: Poor countries. The great euphemism of the Asian financial
             meltdown. Investors got much more excited when they thought 
they could invest in up-and-comers than when they heard they could invest 
in the Third World.(Brills Content, Apr. 1999)

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo@ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo@ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa@ic.ac.uk)

Follow-Ups:
- Re: Mixed content considered harmful...
  - From: John Cowan <cowan@locke.ccil.org>

References:
- Mixed content considered harmful...
  - From: Paul Prescod <paul@prescod.net>
- Re: Mixed content considered harmful...
  - From: John Cowan <cowan@locke.ccil.org>

Prev by Date: Entities and Expat (was Re: Confused about & in entity literal)
Next by Date: Are these Wrox Press Books good?
Previous by thread: Re: Mixed content considered harmful...
Next by thread: Re: Mixed content considered harmful...
Index(es):
- Date
- Thread