Re: [xml-dev] nextml

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
From: Michael Sokolov <sokolov@ifactory.com>
To: Amelia A Lewis <amyzing@talsever.com>
Date: Thu, 09 Dec 2010 00:37:03 -0500
I'm with you most of the way; can I slip in a couple more additions (and 
a few cavils)?

I think this tiny change would be new to this discussion (and easy to 
implement and backwards compatible since it doesn't affect 
well-formedness?): let's not mandate line-terminator normalization.  It 
hobbles one use case of interest to me: I want to be able to record the 
position of elements as byte offsets in an original source file and use 
those to extract well-formed fragments as text (think extracting 
snippets and highlighting in search results).  This can't be done 
reliably in a SAX or StaX handler if the parser alters text in a 
non-reversible manner: you can make a guess if you know what the 
original line endings were, but if they're mixed all bets are off.  
Currently one has to use HTML parsers for this.

One more mini-addition: would it be possible to have parsers ignore the 
BOM at the start of a UTF-8 file?  Some editors seem to insist on 
creating them, they are allowed by the UTF-8 spec, and probably ought to 
be considered external to the actual file content.  Also, maybe if we're 
going to allow multiple root elements we could also allow whitespace in 
the prolog?   People often put it there, and it seems like something 
that could be tolerated easily enough.

Yeah, I disagree about entities (and therefore DTDs).  Let me try to 
explain why, briefly, and then I promise to stop whining about it. The 
problem w/DTDs (and entity decls defined in them) as I see it is they 
introduce a dependence on an external file.  I'd have to say that (aside 
from namespaces) the biggest single barrier to XML  nirvana for me and 
my colleagues as we began this odyssey oh so many years ago was dealing 
with XML that declares a DTD that may or may not (really) be required, 
that isn't present, or is stored on someone else's server somewhere, or 
that has been supplied but doesn't get resolved by the parser in any 
obvious way (at least to a new user). Took me back to the days of 
chasing down C include files.  (There's also the antiquated non-XML 
syntax of DTDs, but that is only an esthetic issue for me: it's never 
actually caused me hours of hair-pulling). If entities were defined by 
the standard (and built in to parsers), or were required to be defined 
inline, that would remove my objections.  I think my desire would be to 
have a built-in set: this could be a large set, including the iso ones, 
the html ones, whatever big grab bag we like, but not extensible. Do we 
really need a macro processing capability for anything else?  I know 
folks can use entities for including external files and fragments and so 
on, but in XML wouldn't something like xinclude be better?

On restriction to UTF-8 (16 if we insist, but really do folks store 
*files* as UTF-16?): is this really a problem for non-western 
languages?  My impression was that it encoded them fine.  I admit it's 
been many years since I did i18n for a living (back then it was all SJIS 
and EUC), but I would've thought CJK folks were much happier to have put 
that all behind them.

-Mike Sokolov

On 12/8/2010 11:27 PM, Amelia A Lewis wrote:
> Heylas!
>
> Well, I've read a bunch of interesting web pages and proposals.
>
> For me, anything that requires W3C to jump on board (in order to permit
> "<?xml version="!1.0" ?>") is ... *now* ... a non-starter.  I've
> participated in W3C working groups.  *Time*.
>
> Conversely, anything that is a "best practice" for XML 1.0 (all
> conforming documents can yield full information in current
> namespace-aware parsers) is also a non-starter.  *yawn*  I might
> encourage people to write documents that way, but I can't get excited.
>
> The "sweet spot," so far as I am concerned: a revision that can be
> supported in current processors via simple transformation, but that
> would parse, with information loss (or failure to load in
> namespace-required applications) in current processors and parsers.
>
> What fits?  Well, Michael Kay's namespace proposal, or a variant.  The
> critical bit: every element has a "fully qualified name" and
> potentially a contextually-defined abbreviated name.  Comment nesting
> sort-of qualifies (a clever transform could make the nested comments
> not match the 'comment' production; the problem is that without the
> transform a document with these comments would be ill-formed).  I've
> seen a number of "only UTF" comments, and I think that they're rather
> western-centric, so I'm thinking "no," there (if someone whose native
> language *isn't* west european proposes it, I might rethink).  Removing
> DTD?  Well, if it's tied to pre-defining a richer set of entities,
> perhaps--or provide a non-DTD entity definition mechanism (don't like
> entities?  So what?  I think they're valuable).  On the other hand, a
> less-horrible means of distributed authority for vocabularies would
> make namespaces a dead letter, and possibly revitalize DTDs (I can't
> help but think that RNG is a better solution, though).
>
> Remove mixed content?  No.  Provide "simple types"?  No.  No one can
> agree on them (I should publish DRVL, even though I haven't got the
> time to do the proof of concept implementation).  I knew folks on the
> original XML Schema Working Group; that spec is so difficult in part
> because it had to satisfy so many different interests.  Something like
> DRVL+FRVL might provide simple typing + extensibility, but then, it's
> possible that I'm simply overestimating a pet project.  Remove CDATA?
> Ambivalent.  It might help the parser writers, but who cares, at this
> point?  They already deal with it.  Add minimization (simplified
> end-tags)?  Moderately opposed; I understand from the grey-haired SGML
> types that this was a major, perhaps even primary source of bug reports
> and support requests.
>
> Keep namespaces and impose restrictions on where they can be defined?
> No.  First, it creates a distinction between documents and fragments
> that is going to produce tons of problems; second, the fundamental
> design of namespaces in xml is broken and acknowledging that is the
> first step to solving the multiple-vocabulary problem.  I suppose I
> won't be terribly interested in *any* nextml that doesn't take on the
> namespace morass effectively--mind you, *I* can use the damned things
> with a fair degree of facility.  I just can't get other people up to
> speed effectively, unless they're very strongly motivated.  That
> shouldn't be necessary, in my opinion.
>
> I offer this because we seem to be approaching the end of the "produce
> ideas" point, and are entering the "choose sides" phase.  Where I fall:
> if we can produce, in less than twelve months, a specification that
> allows an instance document to specify a small external transform that
> could then allow the document to be handled by current APIs, that can
> potentially provide enhanced results for new APIs, and that is easier
> to explain (apart from the opaque "put this PI right after the XML
> decl" bit) than current stuff, I'm on board.  If it's just best
> practices (no indicator required) ... meh.  If it's XML 1.0++ ... gods,
> spare me--prove something first.
>
> Amy!
Follow-Ups:
- Re: [xml-dev] nextml
  - From: Peter Flynn <peter@silmaril.ie>
- Re: [xml-dev] nextml
  - From: Uche Ogbuji <uche@ogbuji.net>
- Re: [xml-dev] nextml
  - From: Liam R E Quin <liam@w3.org>
References:
- nextml
  - From: Amelia A Lewis <amyzing@talsever.com>
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]