Lists Home |
Date Index |
The issue regularily comes up when protocols use XML for marshalling and do
not really have control about the string values.
For instance, in WebDAV custom properties are marshalled as text node, for
<author xmlns="whatever">Joe User<author>
The value of the author property often is outside the control of the WebDAV
protocol handler (it may be extracted from the resource or from a backend
database). So if it encounters a value with characters outside the allowed
XML character set, it has several choices:
- drop the property (claim it wasn't found)
- drop the property (report that it exists but cannot be marshalled)
- "fix" the property (remove offending characters)
Another option comes to mind: choose a format that's as close to PCDATA as
possible, but preserves the "other" characters. For instance, to marshall
ASCII 31, one could think of:
a) Joe<?ctrl 1F?>User
b) Joe<x1f xmlns="namespace-for-control-characters"/>User
To applications that aren't aware of this format, (a) would look like
"JoeUser" (and it would still be valid, if a DTD is involved). (b) uses
namespaces instead of PIs, but would probably create more problems.
I think the pressure to extend the XML character set to control characters
could be minimized if there would be a "common" way to embed these
characters without changing XML. Maybe the SOAP WG, the WebDAV WG and other
affected parties should try to come up with a common proposal (which could
then be implemented as SAX filter).
> -----Original Message-----
> From: Michael Rys [mailto:email@example.com]
> Sent: Wednesday, December 19, 2001 1:55 AM
> To: Tim Bray; firstname.lastname@example.org
> Subject: RE: [xml-dev] Some comments on the 1.1 draft
> Tim, with all due respect, but allowing #x0-@x1F inside element and
> attribute content would tremedeously help users of XML that use non-XML
> string sources for their data and map it into XML without loosing
> fidelity and without having to base 64 encode otherwise normal strings.
> Most of these applications do not care about the semantics of ETX or
> EOM, but just that they are being preserved over the XML serialization.
> Example applications are: SOAP, XML database serializations etc.
> Best regards
> > -----Original Message-----
> > From: Tim Bray [mailto:email@example.com]
> > Sent: Friday, December 14, 2001 17:03 PM
> > To: firstname.lastname@example.org
> > Subject: [xml-dev] Some comments on the 1.1 draft
> > I sent this to the public blueberry-coments address, but
> > thought some of them might usefully be discussed here. If
> > someone wants to start an argument about one or more of
> > these, please pull it out and give it a separate subject
> > line.
> > ========================================================
> > 1. The principle of decoupling the XML spec from successive
> > revisions of Unicode is the only sensible way forward.
> > 2. If no consensus can be built around the details of this
> > set of changes, it would be acceptable to declare defeat and
> > go on with XML 1.0 2nd ed as-is. This would be a regrettable
> > outcome but not fatal at a deep level.
> > 3. Issue 18: The costs of allowing #x1-#x1F appear to me to
> > exceed the benefits. Among other things, many of these
> > ASCII control chars, despite being several decades old, have
> > little consensus concerning their semantics, e.g. EOT and EOM
> > (#x3 and #x4). I think from the XML point of view these things
> > are actively pernicious; specifically the notion that semantics
> > are embedded in characters rather than being expressed by markup.
> > The case of "textual content that may contain such characters
> > (but typically does not)" is pretty non-convincing. In *many*
> > cases the occurrence of these characters is evidence of an error.
> > 4. Issue 21: The cost of allowing null bytes in XML content is
> > very high and the benefits hard to understand.
> > 5. I strongly feel that #x85 (NEXT LINE) should not be added to
> > the S production. The reason is a simple cost-benefit analysis;
> > the proportion of computing installations where this is an issue
> > is not large and is shrinking as a proportion of the
> > infrastructure. Supporting this change imposes significant
> > conversion costs on the rest of the world; the total global
> > net cost would be significantly less if the mainframe software
> > infrastructure took the necessary corrective measures to deal
> > with XML 1.0 as specified.
> > 6. I strongly feel, even more so than in the case of #x85,
> > that #x2028 is inappropriate for inclusion in S. Here are
> > some reasons:
> > - If LINE SEPARATOR is to be included, why not the many
> > other Unicode characters with spacing semantics? A
> > coherent explanation needs to be provided on this
> > point and I am unconvinced that one exists.
> > - This would be the only core XML syntax character that
> > can't fit in a byte. This would complicate several
> > automaton-driven parser construction strategies. One
> > of the key design goals of XML is to make programmers'
> > lives simpler, so this objection should have weight.
> > - "For completeness" is a really flimsy argument.
> > 7. In , #x37a is included, which is a combining
> > character and shouldn't be in NameStart
> > 8. In , #xf7 is included (division sign), but the
> > rest of the mathematical operators (starting at
> > #x2200) are excluded.
> > 9. The inclusion of a block #x202A-#218f is kind
> > of puzzling... it starts in the middle of one of the
> > punctuation blocks, and the first few chars seem
> > really unsuitable. What's the intent... wanting to
> > include the currency symbols? This definitely
> > needs some explanation.
> > 10. There are some problems in the #x2800-#xD7FF block.
> > Do we really want CJK radicals (#x2e80...), compatibility
> > Jamo, ideographic description chars, and so on?
> > 11. SHould that block end at #xD7aF or #xD7FF?
> > 12. [#xFDE0-#xFFEF] includes the private use area and lots
> > of compatibility characters which XML 1.0 actually
> > deprecates for use at all, let alone as names. This
> > is astounding and needs some defense. If this is OK,
> > why not throw in all the punctuation?
> > 13. What's wrong with ASCII digits as name start chars, given
> > that all sorts of other digits are going in?
> > 14. There really needs to be some deep discussion in this
> > document of why this alternative was chosen. When I
> > look at some of the wildly unlikely things that are
> > allowed to appear in names, the obvious question is:
> > Why not rely on the Unicode properties database. In
> > particular, this allows lots of Name characters that
> > are not in fact Unicode characters at all and probably
> > never will be.
> > 15. Issue 11:
> > I can see both sides of this question. My intuition is
> > that the computational cost of doing this is unacceptably
> > high for high-throughput applications of XML, but we need
> > some research to establish if this is the case. If it can
> > be done cheaply and compactly, it's probably a good idea.
> > -----------------------------------------------------------------
> > The xml-dev list is sponsored by XML.org <http://www.xml.org>, an
> > initiative of OASIS <http://www.oasis-open.org>
> > The list archives are at http://lists.xml.org/archives/xml-dev/
> > To subscribe or unsubscribe from this list use the subscription
> > manager: <http://lists.xml.org/ob/adm.pl>
> The xml-dev list is sponsored by XML.org <http://www.xml.org>, an
> initiative of OASIS <http://www.oasis-open.org>
> The list archives are at http://lists.xml.org/archives/xml-dev/
> To subscribe or unsubscribe from this list use the subscription
> manager: <http://lists.xml.org/ob/adm.pl>