OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: XML Blueberry



 From: "John Cowan" <jcowan@reutershealth.com>

> All Unicode 3.1 code points, including the unassigned ones, are already
> part of the XML document character set.  (The trivial exceptions are
> most of the C0 control characters, the surrogate space, and U+FFFE/FF.)
> The issue here is the implicit NAMECHAR and NMSTCHAR declarations,
> if I remember my SGML 8-letterisms correctly.

The XML Second Edition only references Unicode 3.0, not 3.1.

According to the Unicode.org site, Unicode 3.1 "adds a large number of coded
characters."
http://www.unicode.org/unicode/reports/tr27/

Of these, most are CJK Unified Ideographs Extension B.  These are characters
which must be considered bad practise for use in markup, perhaps with some
exceptions.   They are mostly characters which readers may easily find
confusing,
being archaic, regional, variant, uncommon or non-interoperable.

When we only had DTDs, it was important that as full as possible a range of
characters be allowed in names, because enumerations (which are presented
to the end-user or initial creator) need to support the native language
well.

Now we have XML Schemas Datatype, the need for XML Names to
support any character is tempered by the need for good markup which
avoids obscure characters.  In the XML 1.0 design, the need for simple
processing rules (all tokens separation and whitespace can be detected
in ASCII space) is also less important, now that the DPH has Unicode-aware
Perl with XML modules available.

So two important use cases for XML are no longer so important. We don't
need to consider the DPH so much (which would tend to favour expanding
the whitespace rules) and we don't need to allow obscure characters in
enumerations (which would tend to be against extensions to the name rules.)


><flame>If XML had insisted that the
> One True Representation of line-end is LF, and XML processors were
> passing through every CR in character content and coughing on every CR
> in markup, don't you think the situation would have been changed P.D.Q.?
> Justice delayed is justice denied, but better than justice denied
> forever.</flame>

But the change in XML was an explicit simplification of SGML's rules,
where there are Record Start and Record End signals implied into the
document, and the entity management maps them from the incoming
conventions.  I thought the reason this simplification was possible
was because of the requirements of sending XML over HTTP, which is
the single most important use-case for "SGML on the Web".

Where does HTTP fit into this?

"   When in canonical form, media subtypes of the "text" type use CRLF as
   the text line break. HTTP relaxes this requirement and allows the
   transport of text media with plain CR or LF alone representing a line
   break when it is done consistently for an entire entity-body. HTTP
   applications MUST accept CRLF, bare CR, and bare LF as being
   representative of a line break in text media received via HTTP. In
   addition, if the text is represented in a character set that does not
   use octets 13 and 10 for CR and LF respectively, as is the case for
   some multi-byte character sets, HTTP allows the use of whatever octet
   sequences are defined by that character set to represent the
   equivalent of CR and LF for line breaks. This flexibility regarding
   line breaks applies only to text media in the entity-body; a bare CR
   or LF MUST NOT be substituted for CRLF within any of the HTTP control
   structures (such as header fields and multipart boundaries)."

http://www.ietf.org/rfc/rfc2068.txt  3.7.1 (which is referenced by  MIME
types in XML
http://www.ietf.org/rfc/rfc2376.txt which is informatively referenced by XML
1.0 2e.)

Because XML is "SGML on the WWW", any requirements imposed by
HTTP must be weighed extremely high (and, indeed, it would be a mistake for
XML to do anything counter to HTTP.)

So I don't think there is any need for anyone to explode in flames.  XML's
rules
are aimed at trying to be consonant with HTTP 1.1, which says clearly that
the MIME rule for text/* is CRLF, but that HTTP allows relaxing of this. XML
supports HTTP's relaxing, and so allows a multiplicity of mappings.

What seems quite clear from that passage is that, due to requirements
inherited from
HTTP, the responsibility for mapping from non-CRLF line breaks to
CRLF line breaks (as required by  )
is the responsiblity of the sending system.  Not the receiving XML
processor.

Section 19.4.1 is also relevant:
"19.4.1 Conversion to Canonical Form

   MIME requires that an Internet mail entity be converted to canonical
   form prior to being transferred.  Section 3.7.1 of this document
   describes the forms allowed for subtypes of the "text" media type
   when transmitted over HTTP. MIME requires that content with a type of
   "text" represent line breaks as CRLF and forbids the use of CR or LF
   outside of line break sequences.  HTTP allows CRLF, bare CR, and bare
   LF to indicate a line break within text content when a message is
   transmitted over HTTP.

   Where it is possible, a proxy or gateway from HTTP to a strict MIME
   environment SHOULD translate all line breaks within the text media
   types described in section 3.7.1 of this document to the MIME
   canonical form of CRLF. Note, however, that this may be complicated
   by the presence of a Content-Encoding and by the fact that HTTP
   allows the use of some character sets which do not use octets 13 and
   10 to represent CR and LF, as is the case for some multi-byte
   character sets."

Note the previous sentence refers, I believe, to when multi-byte encodings
contain in them an octet 13 or 10, rather than CR and LF being in other
code points.

In the MIME rfc  http://www.ietf.org/rfc/rfc2046.txt   we see

"4.1.1.  Representation of Line Breaks

   The canonical form of any MIME "text" subtype MUST always represent a
   line break as a CRLF sequence.  Similarly, any occurrence of CRLF in
   MIME "text" MUST represent a line break.  Use of CR and LF outside of
   line break sequences is also forbidden.

   This rule applies regardless of format or character set or sets
   involved.

   NOTE: The proper interpretation of line breaks when a body is
   displayed depends on the media type. In particular, while it is
   appropriate to treat a line break as a transition to a new line when
   displaying a "text/plain" body, this treatment is actually incorrect
   for other subtypes of "text" like "text/enriched" [RFC-1896].
   Similarly, whether or not line breaks should be added during display
   operations is also a function of the media type. It should not be
   necessary to add any line breaks to display "text/plain" correctly,
   whereas proper display of "text/enriched" requires the appropriate
   addition of line breaks.

   NOTE: Some protocols defines a maximum line length.  E.g. SMTP [RFC-
   821] allows a maximum of 998 octets before the next CRLF sequence.
   To be transported by such protocols, data which includes too long
   segments without CRLF sequences must be encoded with a suitable
   content-transfer-encoding."

So the IBM character MUST NOT be used as a replacement for CRLF,
as a line break. If it is serving as a replacement is MUST be mapped at
the server end.

If it is acting as some different character that is not a newline, why are
we considering it?

Actually, I am going to far: it only means that any XML that is sent
representing
newlines using the IBM character rather than CR and/or LF must be sent
application/*.  But it would be a bad design error to introduce a class of
XML
documents that can only be sent application/*, I suspect.

> >  2) state that "XML processors may, at user option, if they detect the
> >     IBM newline or  any other visual white-space in markup, element
content
> >     or in an entity/XML declaration, replace the characters with LF, as
a
> >     matter of entity management."
>
> That is what Blueberry does, except that the "user option" is expressed
> in the document, not by some out-of-band means.  This is plausible,
> since it is the document creator who knows whether NEL, or post-2.0 name
> characters, or both, are being used.

I thought the proposal was to allow NEL as a distinct character from CRLF
to also act as whitespace. This is different from replacing it with LF.

Cheers
Rick Jelliffe