Re: [xml-dev] MicroXML

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

From: Amelia A Lewis <amyzing@talsever.com>
To: xml-dev@lists.xml.org
Date: Tue, 14 Dec 2010 01:00:28 -0500

On Tue, 14 Dec 2010 11:35:31 +0700, James Clark wrote:
>> How do I tell whether it's safe to use my uXML parser instead of my
>> (heavier) XML 1.0 + Namespace in XML + XML:Base + XML:ID + whatever
>> parser?
> 
> Given that MicroXML is designed to be a subset, how could there be a
> reliable in-band mechanism to tell you?  Anything you might put in the

Well, if MicroXML hadn't ruled out the use of most of the available 
indicators, then certainly something like a PI would be possible.

> document, has to be legal XML 1.0, so it can't be a reliable indicator that
> it's MicroXML rather than XML 1.0. Similar problem with telling how to use
> MicroXML rather than HTML.  I don't think this is any different from
> problems we have today. How do you choose between an HTML, XML or an SGML

<html>  Not in a namespace?  It's HTML.  In the XHTML namespace?  
XHTML.  Not <html>?  XML.  There's potential confusion for XML vs SGML 
if there is no XML declaration and there is a doctype declaration 
containing at least a system ID.  Hmmm.  Well, the available BNF 
suggests that the SGML declaration is not optional, either.  
http://xml.coverpages.org/sgmlsyn/index.htm, and especially sgmlsyn.htm 
there.

> parser? There's no reliable in-band mechanism.  In the end, you have to rely
> on out-of-band information.

Perhaps.  I've been involved (somewhat peripherally) in SGML-related 
code (for a parser/validator capable of handling XML and SGML (and some 
other things) (proprietary software)).  For performance, we used 
standard XML processors; even in 2000/2001 (when it was a live product) 
instances of XML outnumbered instances of SGML encountered by a 
significant factor (in particular environments, the reverse was true, 
but they didn't mind adding XML parsing to SGML--whereas adding SGML 
support to XML lost the value of XML, most thought).

I don't think MicroXML reaches that standard.

I'm not concerned about distinguishing (Micro)XML from HTML--or from 
images, or from other easily recognizable file types.  The question, 
which I think is important, is how to safely use a small, fast MicroXML 
parser--rather than starting to use it, throwing away the results, and 
falling back to an XML parser.

> Nonetheless I can think of some heuristics.  I suspect it is very unusual in
> XML to have a DOCTYPE declaration with neither an internal nor an external
> subset.  Thus such a DOCTYPE declaration (regardless of the DOCTYPE name)
> could be a good indicator.

All right.  This means that MicroXML cannot be embedded, unless the 
doctype declaration is stripped (or the Root Element Type validity 
constraint is to be ignored 
(http://www.w3.org/TR/REC-xml/#vc-roottype)).  The only potential for 
confusion is for HTML5 polyglot markup; almost any other use case for 
XML is going to include either a system id with URI or a public id 
(with FPI and uri).

> I think the general policy has to be that if you don't have out of band
> information, then use the more liberal format (ie XML or HTML5 rather than
> MicroXML).

Oops.  Is MicroXML actually attractive enough to see significant takeup 
if the recommendation is that safe parsing in the absence of an 
out-of-band indicator is to use something else?

Ah, well.  It appears that this proposal is targeted 
primarily-nearly-exclusively toward bridging XML with HTML5?  Is that a 
fair characterization?  If so, I'll slide off and stop being annoying 
(I don't have any interest worth mentioning in the behavior of 
browsers).

I'd like to see a 'next generation'.  I'm starting to wonder if we 
haven't got at least a couple or three different use cases:

a) the confluence-with-the-browser case, where JSON and HTML5 are going 
to be mentioned and targeted, where removing namespaces is accepted as 
a near-given, but extensibility doesn't seem particularly important;

b) the xml-over-the-network case (including exempli gratia SOAP, but 
also less RPC-ish document/resource interactions), where the doctype 
decl has long been forbidden, and namespace improvements would be grand 
but no one can afford to throw out the distributed-authority baby with 
the prefix-mapping bathwater, and 'typing' (fsvo 'typing') is liable to 
be an issue;

c) the document/store case (likely including databases), where the 
entire prolog causes discomfort, and namespace simplification is 
regarded as unattainable utopianism, again with at least a part of the 
crowd concerned with 'typing'; this case may also include those doing 
extensive pipeline processing (or perhaps that's yet another use case).

I'm probably not outlining the groups well.  What all have in common, I 
think: elements good, attributes good; things that aren't elements or 
attributes bad (comments seem to be tolerated better than PIs or 
anything from the prolog).  I don't know that the browser-case 
antipathy to extensibility via a distributed authority can be 
reconciled with the much less drastic desire to address the various 
(and variously interpreted or understood) shortcomings of the 
namespaces specification for establishing a distributed authority for 
vocabularies.

Hmmmmm.  *shrug*

Amy!
-- 
Amelia A. Lewis                    amyzing {at} talsever.com
Yankees are compelled by some mysterious force to imitate Southern 
accents and they're so damn dumb they don't know the difference beween
a Tennessee drawl and a Charleston clip.
                -- Rita Mae Brown, "Rubyfruit Jungle"

Follow-Ups:
- Versioning MicroXML (Was: MicroXML)
  - From: "Pete Cordell" <petexmldev@codalogic.com>
- Re: [xml-dev] MicroXML
  - From: James Clark <jjc@jclark.com>

References:
- MicroXML
  - From: James Clark <jjc@jclark.com>
- Re: [xml-dev] MicroXML
  - From: Amelia A Lewis <amyzing@talsever.com>
- Re: [xml-dev] MicroXML
  - From: James Clark <jjc@jclark.com>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]