OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [xml-dev] Interoperability [long]

At 10:31 AM 15/11/01 +0000, Sean McGrath wrote:

I apologize in advance for excerpting lots of Sean's
excellent prose for the sake of context.

>(Warning: free format attempt at documenting the
>problems I've been through in the last two weeks
>just getting some simple XML into a browser, validated
>and in/out of some simple filtering programs, follows:)

BTW, "just getting some simple XML into a browser" is a 
dream that I've been talking up since 1997, but we ain't
there yet.  I'm impressed that you're getting this to work
at all, and also that you're not getting shot down because
people insist on using IE, which has effectively zero
XML+CSS support - really irritating since it has pretty
damn good HTML+CSS support.

Anyhow, it seems that a lot of the problems you're having 
boil down to: "Opera doesn't support XML 1.0 very well."  

>1) Round-tripping problems
>Most of my XML processing is XML to XML processing.
>A variety of nasty things happen to things like entitiy refs,
>encodings, comments, cdata secs etc. The usual stuff
>I get in a fluff about on this list.

This is really interesting stuff.  A bit more detail I
think would be helpful to all of us here.  I've never 
built an application that depends on preserving CDATA
sections or comments downstream.  I can see the problem -
if you want to ship stuff from author to author, I 
imagine a certain amount of XML software will produce
logically equivalent output while losing stuff that's
important to the authoring process.

E.g. Perl will, unless you go to some work to preserve it.
How about recent python?

On encodings, see below.

>2) Display problems
>Its amazingly hard to get a good result rendering
>XML with CSS2 . 
>Its not that CSS2 isn't up to it, it is that
>things like attribute defaults, entity expansions etc. that you
>want to keep external to the instance go unnoticed by
>XML browsers that don't read the external DTD.
>This stuff is real important for things like qualified
>styles. You end up adding things to your instance that you
>would prefer to leave external just to get the content
>to display right.

For general-purpose browser applications, I have a 
hard time believing that they'll ever be willing to
rely on downloading external entities to pull together
a page display.  That was the whole reason that Netscape
& MS demanded (& got) the right for a processor to 
bypass external entities back in 1996.   The explanation 
at the time was that the multithread parsing techniques 
they have to use to get acceptable page display performance 
simply did not allow for the possibility of having to go 
out inline and get arbitrarily-deeply-nested recursive 
external entity structures.  So *for display only* I think 
we're kind of stuck with that one. 

For an *authoring* application it's clearly necessary 
to handle external entities includng DTDs & other 

>3) Namespace problems
>Back in the SGML days with things like Panorama (based
>on Synex Viewport) it was possible to get tabular display
>of arbitrary markup. In Opera 5 for example, you get
>tabular display by using the table model from the
>"http://www.w3.org/TR/REC-html40"; namespace.
>But Opera don't read no external DTD, so I cannot do this:
><ATTLIST table
>             xmlns   CDATA               #FIXED "http://www.w3.org/TR/REC-html40";
>I must add the attribute to *every* instance of the table in my documents.
>Then my authors complain saying "what the f&*k is this polluting
>my table markup".

Wouldn't it be OK to add this on the way out to the browser,
so the authors don't have to see it?  And it's worth mentioning
that there were big problems with Synex and other packages
in fetching DTDs and suchlike over the net.  In fact the 
most successful such product, from EBT, took an XML-like
everything-in-the-instance view if I recall correctly.

>Now although I want to get tables for editing/browsing I don't
>want to throw away DTD validation. DTDs don't support
>namespaces. Bummer. One solution is to fix the prefix
>in the instance like this:
>        xmlns:x="http://www.w3.org/TR/REC-html40";
>and in the DTD like this:
>          xmlns:x   CDATA               #FIXED "http://www.w3.org/TR/REC-html40";
>Now I can validate but have wired the prefix. Bummer. Could use
>parameter entities to avoid that but then I scare my para-techs  with
>a DTD that looks rather complicated with all those percents
>%allovertheplace; (I told them XML would be easy!)
>I could just abandon validation. Don't like that option. Would end
>up coding too much data-validation in business logic. Could
>jump for a complete namespace aware schema language.
>Don't like the sound of that. People way smarter than me
>are not even sure that XML Schema is implementable!

This is a real problem.  Validation is A Good Thing, although
unlike you I never do at run-time, just at design and authoring 
time.  Namespaces are also A Good Thing [yes, I know some here 
disagree].  DTD's don't do enough - in particular don't handle 
namespaces well - and we need something better.  It's not clear 
we have it yet.

>Hey! I could add the FIXED attributes into the internal subset.
>Cannot find any documentation on what Opera might
>do with such an approach... I know for sure that my filter developers
>writing SAX filters that have handlers for startElement(), endElement(), pis()
>and characters() will be unhappy if I tell them they need
>to round-trip the stuff in the internal subset. In fact I can tell
>you now for sure that that stuff will just get lost. I can here
>the screams from the content manager now...

if Opera doesn't respect the internal subset it's just broken and 
in fact doesn't handle XML.  And I think is in a minority of

Yes, you clearly have to do extra work to round-trip XML
while keeping intact information that's important to authors.
The DOM could have been a small fraction of its size if this
hadn't been a problem.

There's a tension between the large number of people sending
around data who are unwilling to pay the extra cost and
complexity required for authoring-support capabilities, and 
those who want to support authors and see this as part of the 
basic package.  

I think the answer is, easy things should be easy and complex
things should be possible.  Shipping around a structured
document object with all the supporting material required 
for further authoring is *not* a simple process in any 
reasonable sense. 

>4) Locating DTDs
>I want to put DTDs somewhere central. I don't want to lug them around
>each directory I have XML files in so that:
><!DOCTYPE foo SYSTEM "foo.dtd">
>I could use a full URI but then I need HTTP running locally or live with
>the hit of pulling this stuff across an unreliable network. Not good.

XML as written requires that you either use local URIs or 
rely on the network.  I tend to prefer the latter, but then
again, I don't fetch DTDs at runtime.

>Could use SOCAT but patchy support on the ground for this. So
>much for freely interchangeable tools.

There has *always* been patchy support for this.  There has
*never* been industry consensus at the implementation level
for how to handle PUBLIC identifiers.  If there had been, it
would have been in XML 1.0.

As I said, I think this is one of the big outstanding
irritants and I'm suprised we've never actually managed to
get some momentum behind one of the alternatives. 

>5) Creating simple hypertext effects
>The ball has been dropped on linking for years. This is not XML's fault
>but it sure doesn't help creating simple viewers for XML, which
>then reflects badly on XML. 

Yep.  It is unforgiveable that XLink, which shouldn't have been
hard to specify, took so many years to get out the door.  Not
an XML problem, a politics/people/process problem.

>6) Character encodings
>I want to ensure that my documents do not use characters
>outside the ISO-8859-1 range. But I don't want to
>use an iso-8859-1 encoding declaration because parsers
>are not required to support it.

Every parser I've ever seen supports 8859-1.  Is there a
single counterexample?  But <snicker> that doesn't help you 
though, because I can always put &#x20ac; (Euro) in my 8859-1 
text.  BTW Sean, how do you do Euros?  SGML had SDATA 
entities, but they had poor interoperability and flaky
product support.

Here's one area where SGML (kind of) wins.  You could in
theory limit the charset to 8859-1 in the SGML declaration.  
Mind you, I never heard of anyone ever doing this on a 
production basis... toolset problems?

I guess the modern schema datatypes kind of allow you 
to do this via the regexp tools?

>Oh, BTW, Opera and lots of other tools out there that
>call themselves XML compliant, don't do Unicode. Worse, they
>silently don't do Unicode. You find these things out
>the hard way.

Then they're NOT XML TOOLS and this is NOT XML's FAULT.
BTW, the browsers actually do a pretty good job in my
experience.  Hey Sean, let's name some names and put
some pressure on the vendors.

>Call me a fuddy-duddy but simple stuff like this
>was simpler with the *complex* SGML standard
>than it is with the *simple* XML standard.

I'll certainly buy into the premise that SGML tools
tend to be heavily authoring-focused.  One reason is
that in large part, all that ever happened with SGML
was you authored it and then you printed it.  The great
virtue was you could still print it 10 years later...
try that with MS Office.

>To return to the original spark of this, I believe that a significant
>part of the problem is that XML's definition is just syntax
>and compliance with the syntax doesn't tell you a lot
>when it comes to tying components together into complete

You've pointed useful fingers at some gaps in our tool
repertoire, particularly in the authoring-support and
content-management spaces.  It's not obvious to me that 
a focus on structure rather than syntax would really 
be that important in fixing these problems.  

And I stand by my claim, based only on my personal
experience, that in heterogeneous distributed environments,
it's easier to agree on syntax than on data structures.  
And way more robust.  Clearly there are those who have 
different experiences. -Tim