| [Thread Prev]
| [Thread Next]
| [Date Next]
| [Thread Index]
Re: [xml-dev] nextml
- From: Michael Sokolov <firstname.lastname@example.org>
- To: Amelia A Lewis <email@example.com>
- Date: Thu, 09 Dec 2010 00:37:03 -0500
I'm with you most of the way; can I slip in a couple more additions (and
a few cavils)?
I think this tiny change would be new to this discussion (and easy to
implement and backwards compatible since it doesn't affect
well-formedness?): let's not mandate line-terminator normalization. It
hobbles one use case of interest to me: I want to be able to record the
position of elements as byte offsets in an original source file and use
those to extract well-formed fragments as text (think extracting
snippets and highlighting in search results). This can't be done
reliably in a SAX or StaX handler if the parser alters text in a
non-reversible manner: you can make a guess if you know what the
original line endings were, but if they're mixed all bets are off.
Currently one has to use HTML parsers for this.
One more mini-addition: would it be possible to have parsers ignore the
BOM at the start of a UTF-8 file? Some editors seem to insist on
creating them, they are allowed by the UTF-8 spec, and probably ought to
be considered external to the actual file content. Also, maybe if we're
going to allow multiple root elements we could also allow whitespace in
the prolog? People often put it there, and it seems like something
that could be tolerated easily enough.
Yeah, I disagree about entities (and therefore DTDs). Let me try to
explain why, briefly, and then I promise to stop whining about it. The
problem w/DTDs (and entity decls defined in them) as I see it is they
introduce a dependence on an external file. I'd have to say that (aside
from namespaces) the biggest single barrier to XML nirvana for me and
my colleagues as we began this odyssey oh so many years ago was dealing
with XML that declares a DTD that may or may not (really) be required,
that isn't present, or is stored on someone else's server somewhere, or
that has been supplied but doesn't get resolved by the parser in any
obvious way (at least to a new user). Took me back to the days of
chasing down C include files. (There's also the antiquated non-XML
syntax of DTDs, but that is only an esthetic issue for me: it's never
actually caused me hours of hair-pulling). If entities were defined by
the standard (and built in to parsers), or were required to be defined
inline, that would remove my objections. I think my desire would be to
have a built-in set: this could be a large set, including the iso ones,
the html ones, whatever big grab bag we like, but not extensible. Do we
really need a macro processing capability for anything else? I know
folks can use entities for including external files and fragments and so
on, but in XML wouldn't something like xinclude be better?
On restriction to UTF-8 (16 if we insist, but really do folks store
*files* as UTF-16?): is this really a problem for non-western
languages? My impression was that it encoded them fine. I admit it's
been many years since I did i18n for a living (back then it was all SJIS
and EUC), but I would've thought CJK folks were much happier to have put
that all behind them.
On 12/8/2010 11:27 PM, Amelia A Lewis wrote:
> Well, I've read a bunch of interesting web pages and proposals.
> For me, anything that requires W3C to jump on board (in order to permit
> "<?xml version="!1.0" ?>") is ... *now* ... a non-starter. I've
> participated in W3C working groups. *Time*.
> Conversely, anything that is a "best practice" for XML 1.0 (all
> conforming documents can yield full information in current
> namespace-aware parsers) is also a non-starter. *yawn* I might
> encourage people to write documents that way, but I can't get excited.
> The "sweet spot," so far as I am concerned: a revision that can be
> supported in current processors via simple transformation, but that
> would parse, with information loss (or failure to load in
> namespace-required applications) in current processors and parsers.
> What fits? Well, Michael Kay's namespace proposal, or a variant. The
> critical bit: every element has a "fully qualified name" and
> potentially a contextually-defined abbreviated name. Comment nesting
> sort-of qualifies (a clever transform could make the nested comments
> not match the 'comment' production; the problem is that without the
> transform a document with these comments would be ill-formed). I've
> seen a number of "only UTF" comments, and I think that they're rather
> western-centric, so I'm thinking "no," there (if someone whose native
> language *isn't* west european proposes it, I might rethink). Removing
> DTD? Well, if it's tied to pre-defining a richer set of entities,
> perhaps--or provide a non-DTD entity definition mechanism (don't like
> entities? So what? I think they're valuable). On the other hand, a
> less-horrible means of distributed authority for vocabularies would
> make namespaces a dead letter, and possibly revitalize DTDs (I can't
> help but think that RNG is a better solution, though).
> Remove mixed content? No. Provide "simple types"? No. No one can
> agree on them (I should publish DRVL, even though I haven't got the
> time to do the proof of concept implementation). I knew folks on the
> original XML Schema Working Group; that spec is so difficult in part
> because it had to satisfy so many different interests. Something like
> DRVL+FRVL might provide simple typing + extensibility, but then, it's
> possible that I'm simply overestimating a pet project. Remove CDATA?
> Ambivalent. It might help the parser writers, but who cares, at this
> point? They already deal with it. Add minimization (simplified
> end-tags)? Moderately opposed; I understand from the grey-haired SGML
> types that this was a major, perhaps even primary source of bug reports
> and support requests.
> Keep namespaces and impose restrictions on where they can be defined?
> No. First, it creates a distinction between documents and fragments
> that is going to produce tons of problems; second, the fundamental
> design of namespaces in xml is broken and acknowledging that is the
> first step to solving the multiple-vocabulary problem. I suppose I
> won't be terribly interested in *any* nextml that doesn't take on the
> namespace morass effectively--mind you, *I* can use the damned things
> with a fair degree of facility. I just can't get other people up to
> speed effectively, unless they're very strongly motivated. That
> shouldn't be necessary, in my opinion.
> I offer this because we seem to be approaching the end of the "produce
> ideas" point, and are entering the "choose sides" phase. Where I fall:
> if we can produce, in less than twelve months, a specification that
> allows an instance document to specify a small external transform that
> could then allow the document to be handled by current APIs, that can
> potentially provide enhanced results for new APIs, and that is easier
> to explain (apart from the opaque "put this PI right after the XML
> decl" bit) than current stuff, I'm on board. If it's just best
> practices (no indicator required) ... meh. If it's XML 1.0++ ... gods,
> spare me--prove something first.
- From: Amelia A Lewis <firstname.lastname@example.org>
| [Thread Prev]
| [Thread Next]
| [Date Next]
| [Thread Index]