xml-dev - Re: [xml-dev] Opinions

Re: [xml-dev] Opinions

[ Lists Home | Date Index | Thread Index ]

To: Amelia A Lewis <amyzing@talsever.com>, "'xml-dev'" <xml-dev@lists.xml.org>
Subject: Re: [xml-dev] Opinions
From: Paul Prescod <paul@prescod.net>
Date: Thu, 20 Mar 2003 13:25:50 -0800
In-reply-to: <20030320193925.GA5216@talsever.com>
References: <3E7A1006.5010408@prescod.net> <20030320193925.GA5216@talsever.com>
User-agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en-US; rv:1.3a) Gecko/20021212

Thanks for taking the time to think about it.

Amelia A Lewis wrote:
> On Thu, Mar 20, 2003 at 11:01:26AM -0800, Paul Prescod wrote:
> 
> I don't quite understand how this is to work.
> 
> The algorithm describing how one can understand the xml declaration *before*
> the encoding is known (decoding both the character encoding and the fact
> that this is XML at the same time) depends upon the magic-ness (as in
> /etc/magic) of the string "<xml", which must appear at position 0, unless it
> is preceded by one of 0xFEFF 0xFFEE.

There is still a "known prefix". It is "<?". My back-of-the-envelope 
thinking says that this is enough. My logic goes like this. Start here:

http://www.w3.org/TR/REC-xml#sec-guessing

Most of the encodings discover the "base" encoding (ASCII-based, 
EBCDIC-based, two-byte, four-byte, big or little endian) before they get 
to the "xm" part of "<?xml". The ones that go the whole distance do so 
just to use 4-bytes as the rest do. As soon as you see 3C 3F ("<?") you 
know that you're working with something ASCII based. That said, I don't 
claim to know anything about EBCDIC or any really "out-there" encodings. 
But if I can handle all of the UCS's, UTF's and ASCII-pluses I think 
I've hit the 95/5 point easily.

> The ability to figure out the encoding is dependent upon the restriction of
> the identifier to a known set.  XML parsers are, then, simply *verifying*
> that this is XML as they discover the encoding, not solving for two
> variables at once.

Similarly, XDH processors are verifying that they are dealing with XDH 
data. XDH's first four bytes are _almost_ as regular as XML's. And, I 
believe, regular enough.

> It seems to me that if you don't have a magic sequence, you have a much more
> difficult problem; you can't figure out whether this: <?kzy irefvba="1.0"
> rapbqvat="ebg13" ?> is XML or the "kzy" media type unless you already know
> the encoding; you can't learn the encoding unless you know what the media
> type is (so you can figure out that it's been rotated, in this case).

I don't think my solution will be able to handle truly bizarre encodings 
(like rotated text) but I don't think XML does either. I could define a 
Unicode encoding that makes "<?xml" look like EBCDIC or ASCII and yet is 
not EBCDIC or ASCII.

<?xml version="1.0" encoding="funkazoid"?>
<QDOCTYPE ...>
<Q-- In funkazoid, "Q" and "!" are swapped. -->

The underlying question, which is worth struggling with, is whether to 
restrict the set of encodings to ones I know I can deal with or just let 
the market handle weird ones. XML's gotten by suprisingly well with 
being liberal. There is really not a big constituency out there for 
idiosyncratic encodings.

> You might be able to make something out of the existence of the "/" in the
> media type, but I have some doubts, because the length of the type
> designation is variable.  You might be able to specify that the media types
> have to be ASCII, but that's awkward for the EBCDIC crowd, and quite
> possibly for others as well (quite a few encodings *do* use ASCII as the
> bottom 7 bits, after all, so perhaps it would be okay, as long as we don't
> mind marginalizing (further) the ones that *don't*).

I believe that the first two bytes will reliably be "4C 6F" in EBCDIC. 
And we can use the BOM for the various UTF's and UCS's...and even handle 
documents without the BOM as XML does.

As an aside, given that the mapping from bytes to characters is 
completely undefined, and not even required to be a proper N-bytes->char 
mapping, a cynic could make a case that a GIF is XML in a very 
compressed encoding (without even resorting to saying "well really its 
an infoset"). As long as there could be a program that translates the 
bits into Unicode characters, its "an encoding." But that really isn't 
very interesting in practice. AFAIK, an encoding is just a function and 
some functions are very complicated.

  Paul Prescod

References:
- Opinions
  - From: Paul Prescod <paul@prescod.net>
- Re: [xml-dev] Opinions
  - From: Amelia A Lewis <amyzing@talsever.com>

Prev by Date: Re: [xml-dev] XML-dev futures discussion
Next by Date: Re: [xml-dev] Semantic Web and First Order Logic
Previous by thread: Re: [xml-dev] Opinions
Next by thread: Re: [xml-dev] Opinions
Index(es):
- Date
- Thread