[
Lists Home |
Date Index |
Thread Index
]
Simon St.Laurent wrote:
> I'm sorry, Alaric, but this is the classic story that's done so much to
> pollute XML and turn what was once a pleasant simplfication into an
> industrial-strength nightmare. That it's frequently told by people who
> believe it doesn't do anything to help it.
Ok;
> I wish you could have been at the Extreme Markup Languages conference
> when Jeni Tennison gave a presentation on the impact of typing on XSLT
> and XPath 2.0. As C. Sperberg-McQueen summarized it, "I was watching
> all these faces, all of them asking 'if Jeni Tennison can't deal with
> this, how am I ever going to?'"
I think that the difference in typing between XSLT/XPath 1 and 2 is more
about the fact that they ripped out the old XPath type system (it had
its integers and strings and booleans and stuff) to replace them with
XML Schema-compatible ones, than just *adding* typing.
They didn't add stuff - they CHANGED stuff! So the old stuff isn't there
any more!
Although one perspective no this could be that switching from XSLT 1 to
XSLT 2 is like turning on an option - the programmer chooses to do so if
they want to, otherwise sticks with XSLT 1.0 - but the unfornate fact
that the difference is in a *version number* rather than an *option
flag* is that everyone assumes that 2.0 must be inherently better than
1.0 :-(
> 1) We don't all get to choose. We don't all get to choose our tools,
> and even fewer of us get to choose the data we work with. As these
> things spread across the landscape, they become unavoidable.
>
> All of the tools I write for processing XML now support namespaces.
> That isn't because I think namespaces are a good idea - in fact, I think
> they were the first sign that the people running XML had no clue what
> they were doing. I support them because I have to, both to make my
> tools usable by others and because I have to deal with namespaced
> information. I create it myself sometimes, a habit I got into when
> using other people's tools.
Ok. In order to see if this same pattern could cause problems with a
typed extension to SAX, I'm going to try to map from the namespace
issues into this.
Namespaces became everyone's problem because all the important XML
vocabularies started using them extensively, right? And because of the
transfer syntax of namespaces - with the prefixes - a processor that
isn't namespace aware really can't make much sense of namespaced
elements and attributes, since they have this random prefix shoved on
them most of the time, and non namespace aware applications would be
using literal string comparisions between element names and constants
such as "first-name" to see what element was what. Namespaces still
aren't a problem for applications dealing with XML vocabularies that
don't use namespaces - just there are very few of those.
Part of the problem is down to the syntax used for namespaces; perhaps
it would have been better if the Namespaces rec didn't introduce
prefixes, but instead worked along the lines of:
1) Attributes don't get namespaces, only elements do
2) The attribute xml-namespace="URI" means that the containing element
is in that namespace, and so are all its children unless another
declaration states otherwise
That way, a non namespace aware application would still be able to rely
on the first name element being called "first-name", making it more
backwards compatible at the cost of greater verbosity due to repeating
that namespace URI every time you switch namespaces (ugly in XSLT for
example...).
This is clear in hindsight. I'm sure that if the Namespaces rec authors
had thought about backwards compatability they would have come up with
something similar, however. I presume, therefore, that they were not
particularly worried about non-namespace aware applications, for one
reason or another.
SO - learning from the mistakes made with Namespaces, what lessons can
we take into account when doing a feasability study of a type-aware SAX?
"Really really think about what life will be like for people who don't
want to use your optional extension, *even when they border with systems
that DO*."
Now, since this is just an API extension, it will have zero effect on
the interchanged bits on the wire, so we needn't worry about issues
there. All we need to ensure is that applications that do not need the
extension be totally free of needing to change if their SAX parser
started providing the option. Luckily, the SAX people are smart - they
use URIs in strings to identify extensions in a way that avoids these
issues.
Is there a danger that, like with namespaces, lots of important XML
vocabularies might start to depend on this SAX extension in such a way
that applications are forced to use it to work with them? That's more of
a potential issue, but the SAX extension just automatically handles
something that you're already manuall doing anyway - parsing strings to
get dates/integers/whatnot. You can still do it manually if you wish,
meaning that the optional extension is not the only way to read in dates
and so on - so it can't become a dependency if it's trivially removable.
Lots of XML specifications already rely on *something* parsing integers,
since they represent integers in decimal in XML!
Perhaps the biggest danger is that there might be a slow creeping wave
of highly complex syntaxes used in XML content - like SVG path
expressions, XPaths and so on - and that everyone gravitates towards
writing parsers for these as part of typed SAX parsers. So after a
while, to parse XPath, you have the choice of:
1) Use a typed SAX parser, which will return you an abstract syntax tree
for the XPath expression - and thus indirectly forcing you to use the
typed SAX parser for all of your document whether you like it or not!
2) Write your own XPath parser from scratch
Prevention of (1) is why I agree with the original poster's idea of
having an option to the SAX engine to make it return *both* the original
unmolested text *and* its attempt at parsing it. So you can just not use
the parsing part (and ideally prevent it from wasting its parsing stuff
you'll ignore in one of many ways) and keep accepting the plain
characters for most of your application, while using the parsing part
where you need it.
This is a strong argument FOR making this as an extension to SAX rather
than a new API - if you had to switch to a totally new API for all of
your XML reading to parse XPaths, changing bits of your code that really
needn't change, that would suck!
> 2) Communicating expectations is harder than communicating data. Good
> documentation and schemas can provide more information, but there's a
> lot of experience behind "loosely coupled" vs. "tightly bound",
> especially where participants are widely distributed.
Yep - that's why I suggested the API handle the lack of type
information, or the failure of type information to match what's in the
document, by falling back to the existing SAX behaviour, in order to
avoid this problem.
> 3) Bad ideas that start in one place frequently wander elsewhere. W3C
> XML Schema is probably the classic example of this. It's widely
> despised, even at conferences - like last summer's Applied XML show -
> where everyone claims to need that kind of tool. Nonetheless, it
> continues to make life difficult for people from Word users to data
> binding implementers to XSLT developers.
Yeah :-(
The problem here, of course, is the original badness of the idea
combined with the unforunate fact that it was proposed by a voice of
authority.
However, good ideas from voices of authority ALSO tend to spread :-)
> I'm happy to see ASN.1 working to make itself more accessible to
> developers with different expectations, and I'm still happy to see ASN.1
> at work for people who actually want schema-first tightly-coupled
> development. I'm not happy to see ASN.1-flavored proposals for
> revamping XML APIs because they don't fit ASN.1 expectations. Building
> bridges between the two worlds is good, but there's definitely a limit.
Think about the usefulness of typed SAX beyond ASN.1, however - typed
SAX events could be generated from an XML document with reference to a
schema in the schema language of your choice.
> XML has suffered enough here from types that you might want to pack up
> that circus wagon and find another freak show where it'll be more
> welcome. Please don't tell that bogus story about types being a
> harmless option if you want me to take you seriously.
How has XML suffered from types? As I see it:
1) The official language for attaching types to XML sucks
2) This has had knock-on effects, such as the XPath/XSLT type system
changing to align with XML Schema
But types in XML are *still* an optional add-on, in ways that namespaces
aren't! You only *need* to write code that knows anything about XML
schema languages if you're writing a schema validator or an XSLT 2.0
engine, right? You can ignore references to schemas and xsi:type
attributes to your heart's content, and your application that reads XML
purchase orders and handles them will still be able to work, yes?
A non-type-aware application that encounters <numFingers
xsi:type="integer">010</numFingers> (or, equivelantly, without the
xsi:type and instead with a schemaLocation attribute pointing to a
schema saying the same thing) will either:
1) If it has no hardcoded knowledge about the element, just ignore it or
pass it through itself verbatim, as applicable - preserving the leading
0, since it does not know of any interpretation rules concerning the
element content, so MUST NOT ATTEMPT TO BE CLEVER.
2) Have hardcoded knowledge from the programmer (who had a copy of the
specification for the vocabulary in front of them) that numFingers
contains a positive decimal integer, and treat the content as the number
ten; the xsi:type is just redundant extra information here.
3) Incorrectly (because it's broken) assume that the contents of
numFingers is an integer written *backwards* with the least significant
bit first, and remove the 'insignificant' zero at the end, and as such
do something like output <numFingers xsi:type="integer">01</numFingers>.
Case (3) is the one that people who fear type-aware systems stripping
their 'apparently redundant' information away and breaking things seem
to fear. However, only the *obviously broken* code does this...
But getting back to the point - typed SAX.
Type-aware interpretation of XML is a fact of life as soon as you start
passing anything other than human-language text in XML. As soon as you
have something like version="1.0" lurking around, software is going to
start doing things like converting that to a pair of integers and
performing integer comparisons to see if this is a version it can support.
Typed schema languages (like XML Schema, not so much like DTDs) tend to
set out a library of types, and a way of assigning those types to parts
of an XML document, in an attempt to try and formalise this typing.
Without such schema languages, we would instead say "The version
attribute contains the version number", thus non-formally assigning a
type. HTML is strongly typed; some attributes must have a valid URI in
them, or an integer (width= and so on). This is not, in itself, a problem.
The problems seem to have arisen in the area of the schema languages.
But typed SAX - although it would DEPEND on some external schema
language or something like xsi:type to get its type information in the
first place - would not introduce any dependency on that source of type
information into the application... and as I have visualised the
interface, it would 'fail safe' in the absence of a schema by just
reporting character data, thus not introducing a dependency on schemas
or whatnot into the documents it processed.
So, I ask, what could go wrong? :-)
I might have missed something, some unforseen consequence... but I think
the fundamental nature of this thing (doing something the programmer
would do by hand automatically, but only if explicitly asked to do so,
and giving up gracefully if it can't be done automatically) means that
it can't possibly cause a problem.
ABS
|