RE: Two hugely significant conversions that XML parsers do

Hi Folks,

I find it totally fascinating that XML parsers convert its input into a standard character encoding scheme (Unicode) and line endings to linefeed characters. Applications that operate (reason) on the post-parsed input know exactly what they are working on.

Wicked neat!

Do other data format specifications specify that their parsers perform similar conversions?

Do JSON parsers convert its input into a standard character encoding scheme (Unicode) and line endings to linefeed characters?

Do CSV parsers (Comma Separated Value parsers) convert its input into a standard character encoding scheme (Unicode) and line endings to linefeed characters?

Do YAML parsers (Yet Another Markup Language parsers) convert its input into a standard character encoding scheme (Unicode) and line endings to linefeed characters?

Do Protocol Buffer parsers convert its input into a standard character encoding scheme (Unicode) and line endings to linefeed characters?

Or, does XML stand apart from other text data formats in this regard?

/Roger

From: Roger L Costello <costello@mitre.org>
Sent: Thursday, April 15, 2021 11:25 AM
To: xml-dev@lists.xml.org
Subject: Two hugely significant conversions that XML parsers do

Hi Folks,

An XML parser does two hugely significant conversions.

Suppose we provide input to an XML parser. Here are the conversions that the parser does to the input:

1. The parser converts the characters in the input to Unicode.

2. The parser converts line endings in the input to a linefeed character (hex 0A).

What are the consequences of these conversions?

Answer: your applications can operate on the parsed input with the understanding that the characters are Unicode and the lines end with a linefeed character.

I like the term that Amy used: your applications can _reason_ about the parsed input with the understanding that the characters are Unicode and the lines end with a linefeed character.

/Roger