[
Lists Home |
Date Index |
Thread Index
]
I am preparing an alternative proposal for XML 1.1, and I would appreciate
any help from sympathetic people on this list.
I think it is more productive to have concrete alternative proposals rather
than merely raising issues.
The basic idea of this is, following the idea attributed to James Clark,
that we may as well put in some kind of layer to bring out character issues
in XML. Actually, I take the reverse idea: we pull out character issues, to
make the a lightweight version of XML. Where the current draft is very wrong
is that it thows out the naming rules entirely, rather than shifting them to
where they are appropriate: as part of validation.
I suggest something along these lines:
1) It is called XML 1.1, if needed.
2) It converts NEL on input to #A.
3) No changes to XML whitespace rules
4) The definitions for WF and Valid XML be altered:
i) WF XML is simplified: same as current WF except that
naming rules are not used to parse the data, instead
delimiters and whitespace are used. The data NEED NOT be
normalized or checked for normalization.* Encoding errors
SHOULD
cause failure. Name errors NEED NOT be reported, except
for the presence of control characters (as in the current
Blueberry
draft.)
ii) Valid XML is made stricter and future-proofed:
same as current validity, except that normalization
must be performed before comparing identifiers. The
current naming list should be made advisory, and a formula
for creating the specific list using the Unicode 3.* identifier
properties should be drawn up: this way the XML 1.1 spec
formally tracks Unicode, and it should mention that because
there is scope for the libraries on a particular system to
not be on a previous version of Unicode, use of characters
introduced into Unicode in the previous two or three years
(i.e. in Unicode 3.1 ) as markup is deprecated as unsafe.
So after two or three years, when libraries are presumed to
been updated, those novel characters are automatically
undeprecated, and the Unciode Consortium can keep on
upgrading Unicode 3.*. (I would say that if Unicode wants
a 4.0 sometime, that would indicate some major change or
consolidation that would require special attention, such as
an errata.) Encoding errors MUST cause failure. Incoming
data MUST be checked for normalized, or (preferably)
normalized.*
I think moving this way would:
1) Provide the least disruptive way to satisfy the requirement
for NEL
2) Track Unicode changes in a rational way, allowing use
of the characters for people in controlled or regional environments
(e.g. Japan), while spelling out the risks clearly and promoting
a timetable for Unicode upgrades to be deployed.
3) Simplify XML by not need character tables or
Unicode libraries for WF-checking. Obviously the current XML WG
is keen on simplifying things that don't affect them (which may be
taken
as an accusation as much as an observation) or they wouldn't have
made their current proposal, and the changes I suggest would made a
real difference in parsing rates, especially for non-ASCII names.**
One wrinkle that should be addressed is then we can have documents with no
DTD that can be "valid" and "invalid" (because of name-checking). If this is
anomolous, then a three layer model could be introduced instead:
"well-formed", "strictly well-formed" and "valid". The strictly
well-formed would pretty much correspond to current WF, and the WF would be
the lightweight WF I suggest above. The strictly WF is also a convenient
slot for namespace naming rules in an XML 2.0. XML Schemas etc. should
specify that they require an infoset from a "strictly WF" document.
Cheers
Rick Jelliffe
* The reason it should not be an error to find unnormalized data
in the simple cases is that the normalization state of data coming into the
parser is dependent on the transcoder used, if any, and out
of control of lay programmers to repair. For validation, I cannot see why
the pupported security risk in allowing normalization coming into
a generic XML parser (as distinct from a c14n-specific parser) should
outweigh the advantages of normalizing incoming data.
** Why am I proposing simplification, when I am often on the
ultra-conservative
side? Well, it is simplification of parsing techniques that brings out
something
that was designed into XML 1.0: that whitespace and delimiters are all
that really is needed for parsing. And we already have
a mode for debugging and QA of XML: validation. Name-checking
(and its vitally important side-effect, that transcoding is verified
by name-checking) can be made part of validation without sacrificing
much. We are not changing the language, just refactoring where checks
should occur in a way that better suits high-volume processing and
small devices.
|