Lists Home |
Date Index |
From: "Amelia A.Lewis" <firstname.lastname@example.org>
> I asked about this, and was told that it's supposed to be normalized to
> LF before whitespace processing happens. At which point I asked why CR
> was part of the S production, and was given this hideous hack, using
> parameter entities, that allows one to force an un-normalized CR into
> attribute content.
XML 1.1 has this problem even more so because of the (important) restrictions
on direct representation of the C1 control characters (which lets you know
in many common cases whether your are about to corrupt your nice databases).
In particular, the xml 1.n specs could stand being clarified about whether the
productions refer to external entities, internal entities, or
the post-parse document. In the case of the CRs in data, it is because
a CR could end up in the infoset, not because it can appear directly in an
I think the specs should be recast in terms of a (notional) preprocessing filter
on external entities that
0) converts encodings
1) barfs if a non-allowed character is present, such as a C1
2) normalizes newlines
3) normalizes data (SHOULD)
and which then removes all these considerations from impinging on the
> Which struck me as a completely bizarre and useless
> form of backward compatibility with SGML (the reason, insofar as I
> understand it, to preserve the hackishness of this particular hack), but
> so it goes.
No, I don't think this comes from SGML. SGML has a completely different
approach to lines: it does not even have CR and LF, but instead brackets
every line inside Record Start (RS) and Record End (RE) characters or signals.
That has the dubious virtue of not really corresponding to any text format,
and the side effect of making it more challenging to implement SGML with
standard libraries that use the now-ubiquitous \n. (RS/RE will make more
sense to regex people, who may be more used to thinking in terms of boundaries
between characters as well as characters themselves. XML didn't need it.)
> Seriously strange corners of XML. CR cannot appear in content when the
> S production is applied, except if you pull some 'rageous nonsense to
> make it do so, at which point one really *wonders* why it ought to be
> considered a space at all.
Not strange at all. The range of characters allowed directly is different from the
range of characters allowed using references. There are characters that are
nice to have that are not nice to use (C1 controls), and there are characters that
are nice to use but not nice to have (CR in markup).