RE: XML Quiz - canonicalization

Hi Roger, parsing is always curious.

As you know, there are many XML parsers available. Here are 2 levers for XML parsing questions like these.

It would be interesting to find a state diagram for an XML parser. Rudimentary search didn’t turn one up for me.
We have found that canonicalization (c14n) is a valuable and important preprocessing step.

Normalizes syntax
Post Schema Validation Infoset
Allows diffing and version control
Allows removal of default values (according to schema or doctype) without loss of information
Enables consistent digital signature and encryption
Gives a more meaningful comparison of compaction possible using EXI or zip/gzip

References

[1] Canonical XML Version 1.1, W3C Recommendation 2 May 2008, https://www.w3.org/TR/xml-c14n11

[2] Extensible 3D (X3D) encodings, Part 3: Compressed binary encoding, 4 Concepts, 4.2.3 X3D canonical form

https://www.web3d.org/documents/specifications/19776-3/V3.3/Part03/concepts.html#X3DCanonicalForm

X3D is highly numeric/geometric. Here are those details on C14N additions we found useful. (Although written with Fast Inforset compression, our next version will reference EXI).

4.2.3 X3D canonical form

Conceptually, the X3D scene input to the Fast infoset encoder is an XML-encoded document with certain restrictions. X3D canonical form eliminates file ambiguities that have no impact on the 3D content but which otherwise would negatively impact security issues, compression or parsing performance.

X3D canonical form is based on Canonical XML (see 2.[XML-Canonicalization]) which specifically allows modification to the default XML canonicalization rules. This provides the ability to establish equivalence between differently formatted (but functionally identical) XML documents. This capability is required for the application of XML Encryption (see 2.[XML-Encryption]) or XML Signature (see 2.[XML-Signature]) syntax and processing techniques.

The following X3D canonicalization restrictions are applied to an X3D scene (or scene fragment) prior to encryption, signature or compression:

Whitespace rules:

Whitespace is defined as carriage-return, line-feed, space, tab, and comma characters. Whitespace separates all MF-type array values, including individual element values within MFString arrays.
All whitespace characters are converted to a normalized (single-occurrence) blank character with no leading or trailing whitespace.
All literal characters within an SFString value are retained verbatim.
All literal characters within MFString array values are retained verbatim.

Double-quote and single-quote characters:

Individual MFString array values are bounded by "double-quote" characters, each separated by a single space.
The overall MFString attribute is contained within 'single-quote' characters.

EXAMPLE 1 <NavigationInfo type='"WALK" "EXAMINE" "ANY"'/>

Single-quote characters within an MFString value are replaced by the ′ character entity.
Double-quote characters within an MFString value are replaced by the " character entity (and escaped by a leading backslash "\" character).
XML " character references that delimit individual strings in an MFString array are converted to double-quote characters.

EXAMPLE 1 <Text string=' "\"Hello, quotation marks\"" "Line 2 has no quotation marks" '/> displays the following two lines:

“Hello, quotation marks”
Line 2 has no quotation marks

A default or substitute DTD as specified in Annex A of ISO/IEC 19776-1 is included following the <?xml version="1.0" encoding="UTF-8"?> header in canonical form.

NOTE 1 The default DTD is not included in the final Compressed binary encoding, only substitute DTD values are compressed.

Default or substitute X3D schema (see Annex B of ISO/IEC 19776-1) attributes are included in the root <X3D> element.

NOTE 2 The default X3D Schema attributes are not included in the final Compressed binary encoding, only substitute X3D schema attribute values are compressed.

Floating point and double precision.

3. Floating point values are not converted to or from scientific notation; instead they retain their original form.

Numeric values using scientific notation use a lower-case 'e' character.
Leading plus signs are omitted, both for mantissa and exponent.

EXAMPLE 2 2.004e3 (equal to 2004.0).

Decompressed values shall be numerically equal to, but need not necessarily match the form of, the original values.
Excess leading zeros, tens place or higher, are omitted.
Trailing zeros are omitted in the mantissa.
If there are no digits following the decimal point, the decimal point shall be omitted.

Attributes with empty values are eliminated. Attributes with default values are eliminated in order to reduce encoded file size. Default values can be determined from the X3D DTD (see Annex A of ISO/IEC 19776-1) or X3D schema (see Annex B of ISO/IEC 19776-1). This rule supersedes the XML canonicalization rule that all attribute values are provided.
Attribute-value pairs for DEF, USE and (non-default) containerField shall appear before other attributes, which then follow in alphabetic order. This ordering typically provides higher parsing performance during subsequent decoding. This step supersedes the XML canonicalization rule that all attribute values are provided in alphabetic order.
All MFNode content for a child field shall be provided in a contiguous block with no intermixed containerField usage.

EXAMPLE 3 The following code exhibits X3D not in this form:

      <Collision>

         <Shape containerField="children" />

         <Shape containerField="proxy" />

         <Shape containerField="children" />

      </Collision>

The proper child-element grouping for canonical form is:

      <Collision>

         <Shape containerField="proxy" />

         <Shape containerField="children" />

         <Shape containerField="children" />

      </Collision>

Comments are always preserved. Since default values for the Compressed binary encoding are lossless, comments are retained by default.
SFImage data is written in hexadecimal form separated by normalized whitespace.
SF/MFInt32 values will be converted to decimal form if in hexadecimal form.
Empty tag pairs consisting solely of a start-tag element and an end-tag element are replaced with a single empty-element tag.

EXAMPLE 4 The construct:

<Group DEF="someDEF" class="someClass"></Group>

is converted to:

<Group DEF="someDEF" class="someClass"/>

CDATA sections (typically, ECMAScript or shader source code) are not converted into character entities. This step supersedes the XML canonicalization rule that CDATA sections are replaced with their character content.
Except where specifically overridden by the preceding rules, apply the rules of XML canonicalization, summarized here:

comments may optionally be included
normalize line feeds
normalize attribute values
resolve character and parsed entity references
sort attributes lexicographically

X3D scenes in canonical form shall be well-formed, validated XML. This property is a prerequisite to subsequent XML-based compression techniques.

all the best, Don

Don Brutzman Naval Postgraduate School, Code USW/Br brutzman@nps.edu

Watkins 270, MOVES Institute, Monterey CA 93943-5000 USA +1.831.656.2149

X3D graphics, virtual worlds, Navy robotics https:// faculty.nps.edu/brutzman

From: Roger L Costello <costello@mitre.org>
Sent: Thursday, January 27, 2022 3:59 PM
To: xml-dev@lists.xml.org
Subject: [xml-dev] XML Quiz

Assume the XML document has no CDATA sections, PIs, comments, or DOCTYPE.

1. You are shown just a slice of an XML document:

> some text (possibly whitespace) not containing the less than symbol </

That is, you see a greater-than symbol, some text, and then a less-than symbol followed by a forward slash. You are not shown the stuff before > nor the stuff after </

What is it? Does the slice signify an element: the part before > is the start tag, the part after </ is its end tag, and text is the content of the element?

2. You are shown another slice of an XML document:

> whitespace <C

C = letter of the alphabet, colon, or underscore.

Does that slice signify the end of one element and the start of another element: the part before > is an end tag, the C in <C is the first character of a start tag, and whitespace separates the end tag from the start tag?

3. Is an end tag always followed by a less-than symbol (possibly with whitespace separating them)?

Scroll down to see the answers …

1. You are shown just a slice of an XML document:

> some text (possibly whitespace) not containing the less than symbol </

That is, you see a greater-than symbol, some text, and then a less-than symbol followed by a forward slash. You are not shown the stuff before > nor the stuff after </

What is it? Does the slice signify an element: the part before > is the start tag, the part after </ is its end tag, and text is the content of the element?

Answer: It might signify an element (start tag, content, end tag), e.g., <greeting>Hello, world</greeting>

But it might not. It might signify an end tag followed by another end tag, e.g., </D> </A>

2. You are shown another slice of an XML document:

> whitespace <C

C = letter of the alphabet, colon, or underscore.

Answer: It might signify the end of one element and the start of another element (with some whitespace between them), e.g., </book> <magazine>

But it might not. It might signify an element embedded in another element (with some whitespace between them), e.g., <document> <paragraph>

3. Is an end tag always followed by a less-than symbol (possibly with whitespace separating them)?

Answer: Yes, with one exception: the end tag of the root element is not followed by a less-than symbol.