xml-dev - Re: [xml-dev] SAX for Binary Encodings (SAD-SAX)

Re: [xml-dev] SAX for Binary Encodings (SAD-SAX)

[ Lists Home | Date Index | Thread Index ]

To: "Simon St.Laurent" <simonstl@simonstl.com>
Subject: Re: [xml-dev] SAX for Binary Encodings (SAD-SAX)
From: Alaric B Snell <alaric@alaric-snell.com>
Date: Sun, 09 Nov 2003 02:15:05 +0000
Cc: xml-dev@lists.xml.org
In-reply-to: <r02000200-1028-348BA7FE125111D88BED0003937A08C2@[192.168.124.11]>
References: <r02000200-1028-348BA7FE125111D88BED0003937A08C2@[192.168.124.11]>
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4) Gecko/20030704 Debian/1.4-1

Simon St.Laurent wrote:

> I'm sorry, Alaric, but this is the classic story that's done so much to
> pollute XML and turn what was once a pleasant simplfication into an
> industrial-strength nightmare.  That it's frequently told by people who
> believe it doesn't do anything to help it.

Ok;

> I wish you could have been at the Extreme Markup Languages conference
> when Jeni Tennison gave a presentation on the impact of typing on XSLT
> and XPath 2.0.  As C. Sperberg-McQueen summarized it, "I was watching
> all these faces, all of them asking 'if Jeni Tennison can't deal with
> this, how am I ever going to?'"

I think that the difference in typing between XSLT/XPath 1 and 2 is more 
about the fact that they ripped out the old XPath type system (it had 
its integers and strings and booleans and stuff) to replace them with 
XML Schema-compatible ones, than just *adding* typing.

They didn't add stuff - they CHANGED stuff! So the old stuff isn't there 
any more!

Although one perspective no this could be that switching from XSLT 1 to 
XSLT 2 is like turning on an option - the programmer chooses to do so if 
they want to, otherwise sticks with XSLT 1.0 - but the unfornate fact 
that the difference is in a *version number* rather than an *option 
flag* is that everyone assumes that 2.0 must be inherently better than 
1.0 :-(

> 1) We don't all get to choose.  We don't all get to choose our tools,
> and even fewer of us get to choose the data we work with.  As these
> things spread across the landscape, they become unavoidable.  
 >
> All of the tools I write for processing XML now support namespaces.
> That isn't because I think namespaces are a good idea - in fact, I think
> they were the first sign that the people running XML had no clue what
> they were doing.  I support them because I have to, both to make my
> tools usable by others and because I have to deal with namespaced
> information.  I create it myself sometimes, a habit I got into when
> using other people's tools.

Ok. In order to see if this same pattern could cause problems with a 
typed extension to SAX, I'm going to try to map from the namespace 
issues into this.

Namespaces became everyone's problem because all the important XML 
vocabularies started using them extensively, right? And because of the 
transfer syntax of namespaces - with the prefixes - a processor that 
isn't namespace aware really can't make much sense of namespaced 
elements and attributes, since they have this random prefix shoved on 
them most of the time, and non namespace aware applications would be 
using literal string comparisions between element names and constants 
such as "first-name" to see what element was what. Namespaces still 
aren't a problem for applications dealing with XML vocabularies that 
don't use namespaces - just there are very few of those.

Part of the problem is down to the syntax used for namespaces; perhaps 
it would have been better if the Namespaces rec didn't introduce 
prefixes, but instead worked along the lines of:

1) Attributes don't get namespaces, only elements do
2) The attribute xml-namespace="URI" means that the containing element 
is in that namespace, and so are all its children unless another 
declaration states otherwise

That way, a non namespace aware application would still be able to rely 
on the first name element being called "first-name", making it more 
backwards compatible at the cost of greater verbosity due to repeating 
that namespace URI every time you switch namespaces (ugly in XSLT for 
example...).

This is clear in hindsight. I'm sure that if the Namespaces rec authors 
had thought about backwards compatability they would have come up with 
something similar, however. I presume, therefore, that they were not 
particularly worried about non-namespace aware applications, for one 
reason or another.

SO - learning from the mistakes made with Namespaces, what lessons can 
we take into account when doing a feasability study of a type-aware SAX?

"Really really think about what life will be like for people who don't 
want to use your optional extension, *even when they border with systems 
that DO*."

Now, since this is just an API extension, it will have zero effect on 
the interchanged bits on the wire, so we needn't worry about issues 
there. All we need to ensure is that applications that do not need the 
extension be totally free of needing to change if their SAX parser 
started providing the option. Luckily, the SAX people are smart - they 
use URIs in strings to identify extensions in a way that avoids these 
issues.

Is there a danger that, like with namespaces, lots of important XML 
vocabularies might start to depend on this SAX extension in such a way 
that applications are forced to use it to work with them? That's more of 
a potential issue, but the SAX extension just automatically handles 
something that you're already manuall doing anyway - parsing strings to 
get dates/integers/whatnot. You can still do it manually if you wish, 
meaning that the optional extension is not the only way to read in dates 
and so on - so it can't become a dependency if it's trivially removable. 
Lots of XML specifications already rely on *something* parsing integers, 
since they represent integers in decimal in XML!

Perhaps the biggest danger is that there might be a slow creeping wave 
of highly complex syntaxes used in XML content - like SVG path 
expressions, XPaths and so on - and that everyone gravitates towards 
writing parsers for these as part of typed SAX parsers. So after a 
while, to parse XPath, you have the choice of:

1) Use a typed SAX parser, which will return you an abstract syntax tree 
for the XPath expression - and thus indirectly forcing you to use the 
typed SAX parser for all of your document whether you like it or not!

2) Write your own XPath parser from scratch

Prevention of (1) is why I agree with the original poster's idea of 
having an option to the SAX engine to make it return *both* the original 
unmolested text *and* its attempt at parsing it. So you can just not use 
the parsing part (and ideally prevent it from wasting its parsing stuff 
you'll ignore in one of many ways) and keep accepting the plain 
characters for most of your application, while using the parsing part 
where you need it.

This is a strong argument FOR making this as an extension to SAX rather 
than a new API - if you had to switch to a totally new API for all of 
your XML reading to parse XPaths, changing bits of your code that really 
needn't change, that would suck!

> 2) Communicating expectations is harder than communicating data. Good
> documentation and schemas can provide more information, but there's a
> lot of experience behind "loosely coupled" vs. "tightly bound",
> especially where participants are widely distributed.

Yep - that's why I suggested the API handle the lack of type 
information, or the failure of type information to match what's in the 
document, by falling back to the existing SAX behaviour, in order to 
avoid this problem.

> 3) Bad ideas that start in one place frequently wander elsewhere.  W3C
> XML Schema is probably the classic example of this.  It's widely
> despised, even at conferences - like last summer's Applied XML show -
> where everyone claims to need that kind of tool.   Nonetheless, it
> continues to make life difficult for people from Word users to data
> binding implementers to XSLT developers.

Yeah :-(

The problem here, of course, is the original badness of the idea 
combined with the unforunate fact that it was proposed by a voice of 
authority.

However, good ideas from voices of authority ALSO tend to spread :-)

> I'm happy to see ASN.1 working to make itself more accessible to
> developers with different expectations, and I'm still happy to see ASN.1
> at work for people who actually want schema-first tightly-coupled
> development.  I'm not happy to see ASN.1-flavored proposals for
> revamping XML APIs because they don't fit ASN.1 expectations.  Building
> bridges between the two worlds is good, but there's definitely a limit. 

Think about the usefulness of typed SAX beyond ASN.1, however - typed 
SAX events could be generated from an XML document with reference to a 
schema in the schema language of your choice.

> XML has suffered enough here from types that you might want to pack up
> that circus wagon and find another freak show where it'll be more
> welcome.  Please don't tell that bogus story about types being a
> harmless option if you want me to take you seriously.

How has XML suffered from types? As I see it:

1) The official language for attaching types to XML sucks

2) This has had knock-on effects, such as the XPath/XSLT type system 
changing to align with XML Schema

But types in XML are *still* an optional add-on, in ways that namespaces 
aren't! You only *need* to write code that knows anything about XML 
schema languages if you're writing a schema validator or an XSLT 2.0 
engine, right? You can ignore references to schemas and xsi:type 
attributes to your heart's content, and your application that reads XML 
purchase orders and handles them will still be able to work, yes?

A non-type-aware application that encounters <numFingers 
xsi:type="integer">010</numFingers> (or, equivelantly, without the 
xsi:type and instead with a schemaLocation attribute pointing to a 
schema saying the same thing) will either:

1) If it has no hardcoded knowledge about the element, just ignore it or 
pass it through itself verbatim, as applicable - preserving the leading 
0, since it does not know of any interpretation rules concerning the 
element content, so MUST NOT ATTEMPT TO BE CLEVER.

2) Have hardcoded knowledge from the programmer (who had a copy of the 
specification for the vocabulary in front of them) that numFingers 
contains a positive decimal integer, and treat the content as the number 
ten; the xsi:type is just redundant extra information here.

3) Incorrectly (because it's broken) assume that the contents of 
numFingers is an integer written *backwards* with the least significant 
bit first, and remove the 'insignificant' zero at the end, and as such 
do something like output <numFingers xsi:type="integer">01</numFingers>.

Case (3) is the one that people who fear type-aware systems stripping 
their 'apparently redundant' information away and breaking things seem 
to fear. However, only the *obviously broken* code does this...

But getting back to the point - typed SAX.

Type-aware interpretation of XML is a fact of life as soon as you start 
passing anything other than human-language text in XML. As soon as you 
have something like version="1.0" lurking around, software is going to 
start doing things like converting that to a pair of integers and 
performing integer comparisons to see if this is a version it can support.

Typed schema languages (like XML Schema, not so much like DTDs) tend to 
set out a library of types, and a way of assigning those types to parts 
of an XML document, in an attempt to try and formalise this typing. 
Without such schema languages, we would instead say "The version 
attribute contains the version number", thus non-formally assigning a 
type. HTML is strongly typed; some attributes must have a valid URI in 
them, or an integer (width= and so on). This is not, in itself, a problem.

The problems seem to have arisen in the area of the schema languages.

But typed SAX - although it would DEPEND on some external schema 
language or something like xsi:type to get its type information in the 
first place - would not introduce any dependency on that source of type 
information into the application... and as I have visualised the 
interface, it would 'fail safe' in the absence of a schema by just 
reporting character data, thus not introducing a dependency on schemas 
or whatnot into the documents it processed.

So, I ask, what could go wrong? :-)

I might have missed something, some unforseen consequence... but I think 
the fundamental nature of this thing (doing something the programmer 
would do by hand automatically, but only if explicitly asked to do so, 
and giving up gracefully if it can't be done automatically) means that 
it can't possibly cause a problem.

ABS

Follow-Ups:
- Re: [xml-dev] SAX for Binary Encodings (SAD-SAX)
  - From: John Cowan <cowan@mercury.ccil.org>
- RE: [xml-dev] SAX for Binary Encodings (SAD-SAX)
  - From: "Alessandro Triglia" <sandro@mclink.it>
- Re: [xml-dev] SAX for Binary Encodings (SAD-SAX)
  - From: "Simon St.Laurent" <simonstl@simonstl.com>

References:
- Re: [xml-dev] SAX for Binary Encodings (SAD-SAX)
  - From: "Simon St.Laurent" <simonstl@simonstl.com>

Prev by Date: Re: [xml-dev] SAX for Binary Encodings (SAD-SAX)
Next by Date: Re: [xml-dev] SAX for Binary Encodings (SAD-SAX)
Previous by thread: Re: [xml-dev] SAX for Binary Encodings (SAD-SAX)
Next by thread: Re: [xml-dev] SAX for Binary Encodings (SAD-SAX)
Index(es):
- Date
- Thread