[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: XML Blueberry
- From: Rick Jelliffe <ricko@allette.com.au>
- To: xml-dev@lists.xml.org
- Date: Sun, 24 Jun 2001 17:18:04 +0800
From: "John Cowan" <jcowan@reutershealth.com>
> All Unicode 3.1 code points, including the unassigned ones, are already
> part of the XML document character set. (The trivial exceptions are
> most of the C0 control characters, the surrogate space, and U+FFFE/FF.)
> The issue here is the implicit NAMECHAR and NMSTCHAR declarations,
> if I remember my SGML 8-letterisms correctly.
The XML Second Edition only references Unicode 3.0, not 3.1.
According to the Unicode.org site, Unicode 3.1 "adds a large number of coded
characters."
http://www.unicode.org/unicode/reports/tr27/
Of these, most are CJK Unified Ideographs Extension B. These are characters
which must be considered bad practise for use in markup, perhaps with some
exceptions. They are mostly characters which readers may easily find
confusing,
being archaic, regional, variant, uncommon or non-interoperable.
When we only had DTDs, it was important that as full as possible a range of
characters be allowed in names, because enumerations (which are presented
to the end-user or initial creator) need to support the native language
well.
Now we have XML Schemas Datatype, the need for XML Names to
support any character is tempered by the need for good markup which
avoids obscure characters. In the XML 1.0 design, the need for simple
processing rules (all tokens separation and whitespace can be detected
in ASCII space) is also less important, now that the DPH has Unicode-aware
Perl with XML modules available.
So two important use cases for XML are no longer so important. We don't
need to consider the DPH so much (which would tend to favour expanding
the whitespace rules) and we don't need to allow obscure characters in
enumerations (which would tend to be against extensions to the name rules.)
><flame>If XML had insisted that the
> One True Representation of line-end is LF, and XML processors were
> passing through every CR in character content and coughing on every CR
> in markup, don't you think the situation would have been changed P.D.Q.?
> Justice delayed is justice denied, but better than justice denied
> forever.</flame>
But the change in XML was an explicit simplification of SGML's rules,
where there are Record Start and Record End signals implied into the
document, and the entity management maps them from the incoming
conventions. I thought the reason this simplification was possible
was because of the requirements of sending XML over HTTP, which is
the single most important use-case for "SGML on the Web".
Where does HTTP fit into this?
" When in canonical form, media subtypes of the "text" type use CRLF as
the text line break. HTTP relaxes this requirement and allows the
transport of text media with plain CR or LF alone representing a line
break when it is done consistently for an entire entity-body. HTTP
applications MUST accept CRLF, bare CR, and bare LF as being
representative of a line break in text media received via HTTP. In
addition, if the text is represented in a character set that does not
use octets 13 and 10 for CR and LF respectively, as is the case for
some multi-byte character sets, HTTP allows the use of whatever octet
sequences are defined by that character set to represent the
equivalent of CR and LF for line breaks. This flexibility regarding
line breaks applies only to text media in the entity-body; a bare CR
or LF MUST NOT be substituted for CRLF within any of the HTTP control
structures (such as header fields and multipart boundaries)."
http://www.ietf.org/rfc/rfc2068.txt 3.7.1 (which is referenced by MIME
types in XML
http://www.ietf.org/rfc/rfc2376.txt which is informatively referenced by XML
1.0 2e.)
Because XML is "SGML on the WWW", any requirements imposed by
HTTP must be weighed extremely high (and, indeed, it would be a mistake for
XML to do anything counter to HTTP.)
So I don't think there is any need for anyone to explode in flames. XML's
rules
are aimed at trying to be consonant with HTTP 1.1, which says clearly that
the MIME rule for text/* is CRLF, but that HTTP allows relaxing of this. XML
supports HTTP's relaxing, and so allows a multiplicity of mappings.
What seems quite clear from that passage is that, due to requirements
inherited from
HTTP, the responsibility for mapping from non-CRLF line breaks to
CRLF line breaks (as required by )
is the responsiblity of the sending system. Not the receiving XML
processor.
Section 19.4.1 is also relevant:
"19.4.1 Conversion to Canonical Form
MIME requires that an Internet mail entity be converted to canonical
form prior to being transferred. Section 3.7.1 of this document
describes the forms allowed for subtypes of the "text" media type
when transmitted over HTTP. MIME requires that content with a type of
"text" represent line breaks as CRLF and forbids the use of CR or LF
outside of line break sequences. HTTP allows CRLF, bare CR, and bare
LF to indicate a line break within text content when a message is
transmitted over HTTP.
Where it is possible, a proxy or gateway from HTTP to a strict MIME
environment SHOULD translate all line breaks within the text media
types described in section 3.7.1 of this document to the MIME
canonical form of CRLF. Note, however, that this may be complicated
by the presence of a Content-Encoding and by the fact that HTTP
allows the use of some character sets which do not use octets 13 and
10 to represent CR and LF, as is the case for some multi-byte
character sets."
Note the previous sentence refers, I believe, to when multi-byte encodings
contain in them an octet 13 or 10, rather than CR and LF being in other
code points.
In the MIME rfc http://www.ietf.org/rfc/rfc2046.txt we see
"4.1.1. Representation of Line Breaks
The canonical form of any MIME "text" subtype MUST always represent a
line break as a CRLF sequence. Similarly, any occurrence of CRLF in
MIME "text" MUST represent a line break. Use of CR and LF outside of
line break sequences is also forbidden.
This rule applies regardless of format or character set or sets
involved.
NOTE: The proper interpretation of line breaks when a body is
displayed depends on the media type. In particular, while it is
appropriate to treat a line break as a transition to a new line when
displaying a "text/plain" body, this treatment is actually incorrect
for other subtypes of "text" like "text/enriched" [RFC-1896].
Similarly, whether or not line breaks should be added during display
operations is also a function of the media type. It should not be
necessary to add any line breaks to display "text/plain" correctly,
whereas proper display of "text/enriched" requires the appropriate
addition of line breaks.
NOTE: Some protocols defines a maximum line length. E.g. SMTP [RFC-
821] allows a maximum of 998 octets before the next CRLF sequence.
To be transported by such protocols, data which includes too long
segments without CRLF sequences must be encoded with a suitable
content-transfer-encoding."
So the IBM character MUST NOT be used as a replacement for CRLF,
as a line break. If it is serving as a replacement is MUST be mapped at
the server end.
If it is acting as some different character that is not a newline, why are
we considering it?
Actually, I am going to far: it only means that any XML that is sent
representing
newlines using the IBM character rather than CR and/or LF must be sent
application/*. But it would be a bad design error to introduce a class of
XML
documents that can only be sent application/*, I suspect.
> > 2) state that "XML processors may, at user option, if they detect the
> > IBM newline or any other visual white-space in markup, element
content
> > or in an entity/XML declaration, replace the characters with LF, as
a
> > matter of entity management."
>
> That is what Blueberry does, except that the "user option" is expressed
> in the document, not by some out-of-band means. This is plausible,
> since it is the document creator who knows whether NEL, or post-2.0 name
> characters, or both, are being used.
I thought the proposal was to allow NEL as a distinct character from CRLF
to also act as whitespace. This is different from replacing it with LF.
Cheers
Rick Jelliffe