OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

 


 

   Suggestion for an alternative XML 1.1

[ Lists Home | Date Index | Thread Index ]

 I am preparing an alternative proposal for XML 1.1, and I would appreciate
any help from sympathetic people on this list.

I think it is more productive to have concrete alternative proposals rather
than merely raising issues.

The basic idea of this is, following the idea attributed to James Clark,
that we may as well put in some kind of layer to bring out character issues
in XML. Actually, I take the reverse idea: we pull out character issues, to
make the a lightweight version of XML. Where the current draft is very wrong
is that it thows out the naming rules entirely, rather than shifting them to
where they are appropriate: as part of validation.

I suggest something along these lines:

   1) It is called XML 1.1, if needed.
   2) It converts NEL on input to #A.
   3) No changes to XML whitespace rules
   4) The definitions for WF and Valid XML be altered:
        i) WF XML is simplified: same as current WF except that
             naming rules are not used to parse the data, instead
             delimiters and whitespace are used. The data NEED NOT be
            normalized or checked  for normalization.* Encoding errors
SHOULD
           cause failure. Name errors NEED NOT be reported, except
           for the presence of control characters (as in the current
Blueberry
           draft.)
        ii) Valid XML is made stricter and future-proofed:
            same as current validity, except that normalization
            must be performed before comparing identifiers. The
           current naming list should be made advisory, and a formula
           for creating the specific list using the Unicode 3.* identifier
           properties should be drawn up: this way the XML 1.1 spec
           formally tracks Unicode, and it should mention that because
           there is scope for the libraries on a particular system to
           not be on a previous version of Unicode, use of characters
           introduced into Unicode in the previous two or three years
          (i.e. in Unicode 3.1 ) as markup is deprecated as unsafe.
          So after two or three years, when libraries are presumed to
         been updated,  those novel characters are automatically
         undeprecated, and the Unciode Consortium can keep on
         upgrading Unicode 3.*.  (I would say that if Unicode wants
         a 4.0 sometime, that would indicate some major change or
         consolidation that would require special attention, such as
         an errata.)  Encoding errors  MUST cause failure. Incoming
         data MUST be checked for normalized, or (preferably)
         normalized.*


I think moving this way would:
   1) Provide the least disruptive way to satisfy the requirement
       for NEL
    2) Track Unicode changes in a rational way, allowing use
       of the characters for people in controlled or regional environments
       (e.g. Japan), while spelling out the risks clearly and promoting
       a timetable for Unicode upgrades to be deployed.
    3) Simplify  XML by not need character tables or
      Unicode libraries for WF-checking. Obviously the current XML WG
      is keen on simplifying things that don't affect them (which may be
taken
      as an accusation as much as an observation) or they wouldn't have
      made their current proposal, and the changes I suggest would made a
      real difference in parsing rates, especially for non-ASCII names.**

One wrinkle that should be addressed is then we can have  documents with no
DTD that can be "valid" and "invalid" (because of name-checking). If this is
anomolous, then a three layer model could be introduced instead:
"well-formed", "strictly well-formed" and "valid".   The strictly
well-formed would pretty much correspond to current WF, and the WF would be
the lightweight WF I suggest above. The strictly WF is also a convenient
slot for namespace naming rules in an XML 2.0.   XML Schemas etc. should
specify that they require an infoset from a "strictly WF" document.

Cheers
Rick Jelliffe

*  The reason it should not be an error to find unnormalized data
in the simple cases is that the normalization state of data coming into the
parser is dependent on the transcoder used, if any, and out
of control of lay programmers to repair.  For validation, I cannot see why
the pupported security risk in allowing normalization coming into
a generic XML parser (as distinct from a c14n-specific parser) should
outweigh the advantages of normalizing incoming data.

** Why am I proposing simplification, when I am often on the
ultra-conservative
   side? Well, it is simplification of parsing techniques that brings out
something
  that was designed into XML 1.0: that whitespace and delimiters are all
that really is needed for parsing. And we already have
a mode for debugging and QA of XML: validation. Name-checking
(and its vitally important side-effect, that transcoding is verified
by name-checking) can be made part of validation without sacrificing
much. We are not changing the language, just refactoring where checks
should occur in a way that better suits high-volume processing and
small devices.





 

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS