XML.orgXML.org
FOCUS AREAS |XML-DEV |XML.org DAILY NEWSLINK |REGISTRY |RESOURCES |ABOUT
OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
What to escape when serializing XML


Hi all,

I'm doing some bug fixing in a piece of code that does XML serialization(sort 
of), and could use some help in determining what characters that needs to be 
escaped with character references. It's all in the realm of XML 1.0.

The code in question is not intended to conform to XSLT 2.0 and XQuery 1.0 
Serialization, but that spec is neverthless informative. For example, section 
5 XML Output Method, reads:

<quote>
A consequence of this rule is that certain characters MUST be output as 
character references, to ensure that they survive the round trip through 
serialization and parsing. Specifically, CR, NEL and LINE SEPARATOR 
characters in text nodes MUST be output respectively as "&#xD;", "&#x85;", 
and "&#x2028;", or their equivalents; while CR, NL, TAB, NEL and LINE 
SEPARATOR characters in attribute nodes MUST be output respectively as 
"&#xD;", "&#xA;", "&#x9;", "&#x85;", and "&#x2028;", or their equivalents. In 
addition, the non-whitespace control characters #x1 through #x1F and #x7F 
through #x9F in text nodes and attribute nodes MUST be output as character 
references.

XML 1.0 did not permit an XML processor to normalize NEL or LINE SEPARATOR 
characters to a LINE FEED character. However, if a document entity that 
specifies version 1.1 invokes an external general parsed entity with no text 
declaration or a text declaration that specifies version 1.0, the external 
parsed entity is processed according to the rules of XML 1.1. For this 
reason, NEL and LINE SEPARATOR characters in text and attribute nodes must 
always be escaped using character references, regardless of the value of the 
version parameter.

XML 1.0 permitted control characters in the range #x7F through #x9F to appear 
as literal characters in an XML document, but XML 1.1 requires such 
characters, other than NEL, to be escaped as character references. An 
external general parsed entity with no text declaration or a text declaration 
that specifies a version pseudo-attribute with value 1.0 that is invoked by 
an XML 1.1 document entity must follow the rules of XML 1.1. Therefore, the 
non-whitespace control characters in the ranges #x1 through #x1F and #x7F 
through #x9F must always be escaped, regardless of the value of the version 
parameter.
</quote>

These paragraphs gives good hints to the complexity in this, but it's not very 
exact("Specifically, CR, NEL ..."). 

Does anyone know or know how to determine exactly what characters that needs 
to be escaped? I could set my brain to work and read the XML spec from start 
to finish, but I could easily get something wrong.


Cheers,

		Frans


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS