Re: [xml-dev] Line ending normalization

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

From: "G. Ken Holman" <gkholman@CraneSoftwrights.com>
To: xml-dev@lists.xml.org
Date: Mon, 04 May 2009 15:13:33 -0400

At 2009-05-04 12:14 -0400, Bob Kline wrote:
>I'm having a hard time finding the language in the 1.0 spec [1] 
>which would make it clear whether the line ending normalization 
>which XML processors must perform (more precisely, "must behave as 
>if it normalized all line breaks ...") happens before or after the 
>replacement of character entities.

A line end sequence is comprised only of naked characters, not 
composed parsed numeric character references.

>In other words, for the following document:
>
><a>x&#x000d;&#x000a;y</a>
>
>is the value returned by the XML parser for the text content of 
>element e "x\r\ny" or "x\ny"?

"x\r\ny" because that is what is in the element ... there are no line 
end sequences in the element.

>Could someone point to the language which would address this timing 
>question?

Here:

   http://www.w3.org/TR/2008/REC-xml-20081126/#sec-line-ends

   XML parsed entities are often stored in computer files which,
   for editing convenience, are organized into lines. These lines
   are typically separated by some combination of the characters
   CARRIAGE RETURN (#xD) and LINE FEED (#xA).

   To simplify the tasks of applications, the XML processor MUST
   behave as if it normalized all line breaks in external parsed
   entities (including the document entity) on input, before
   parsing, by translating both the two-character sequence #xD #xA
   and any #xD that is not followed by #xA to a single #xA character.

Note that the "#xA" and "#xD" bits of text are *not* parsed numeric 
character references, they are only prose character references.  It 
is an unambiguous way of referring to the characters, but it is the 
naked characters that are being referred to.

Note the bit "before parsing" ... so the naked characters get 
replaced by a naked #xA and *then* the parsed numeric character 
references of your example would be parsed as content.

>And do the major XML parser implementations handle this issue consistently?

I haven't tripped over a problem with this with various 
implementations ... have you recognized inconsistent 
behaviour?  Certainly the specification seems unambiguous.

I hope this helps.

. . . . . . . . . . Ken

--
XQuery/XSLT/XSL-FO hands-on training - Los Angeles, USA 2009-06-08
Crane Softwrights Ltd.          http://www.CraneSoftwrights.com/x/
Training tools: Comprehensive interactive XSLT/XPath 1.0/2.0 video
Video lesson:    http://www.youtube.com/watch?v=PrNjJCh7Ppg&fmt=18
Video overview:  http://www.youtube.com/watch?v=VTiodiij6gE&fmt=18
G. Ken Holman                 mailto:gkholman@CraneSoftwrights.com
Male Cancer Awareness Nov'07  http://www.CraneSoftwrights.com/x/bc
Legal business disclaimers:  http://www.CraneSoftwrights.com/legal

Follow-Ups:
- Re: [xml-dev] Line ending normalization
  - From: Bob Kline <bkline@rksystems.com>

References:
- Line ending normalization
  - From: Bob Kline <bkline@rksystems.com>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]