Lists Home |
Date Index |
Thanks for taking the time to think about it.
Amelia A Lewis wrote:
> On Thu, Mar 20, 2003 at 11:01:26AM -0800, Paul Prescod wrote:
> I don't quite understand how this is to work.
> The algorithm describing how one can understand the xml declaration *before*
> the encoding is known (decoding both the character encoding and the fact
> that this is XML at the same time) depends upon the magic-ness (as in
> /etc/magic) of the string "<xml", which must appear at position 0, unless it
> is preceded by one of 0xFEFF 0xFFEE.
There is still a "known prefix". It is "<?". My back-of-the-envelope
thinking says that this is enough. My logic goes like this. Start here:
Most of the encodings discover the "base" encoding (ASCII-based,
EBCDIC-based, two-byte, four-byte, big or little endian) before they get
to the "xm" part of "<?xml". The ones that go the whole distance do so
just to use 4-bytes as the rest do. As soon as you see 3C 3F ("<?") you
know that you're working with something ASCII based. That said, I don't
claim to know anything about EBCDIC or any really "out-there" encodings.
But if I can handle all of the UCS's, UTF's and ASCII-pluses I think
I've hit the 95/5 point easily.
> The ability to figure out the encoding is dependent upon the restriction of
> the identifier to a known set. XML parsers are, then, simply *verifying*
> that this is XML as they discover the encoding, not solving for two
> variables at once.
Similarly, XDH processors are verifying that they are dealing with XDH
data. XDH's first four bytes are _almost_ as regular as XML's. And, I
believe, regular enough.
> It seems to me that if you don't have a magic sequence, you have a much more
> difficult problem; you can't figure out whether this: <?kzy irefvba="1.0"
> rapbqvat="ebg13" ?> is XML or the "kzy" media type unless you already know
> the encoding; you can't learn the encoding unless you know what the media
> type is (so you can figure out that it's been rotated, in this case).
I don't think my solution will be able to handle truly bizarre encodings
(like rotated text) but I don't think XML does either. I could define a
Unicode encoding that makes "<?xml" look like EBCDIC or ASCII and yet is
not EBCDIC or ASCII.
<?xml version="1.0" encoding="funkazoid"?>
<Q-- In funkazoid, "Q" and "!" are swapped. -->
The underlying question, which is worth struggling with, is whether to
restrict the set of encodings to ones I know I can deal with or just let
the market handle weird ones. XML's gotten by suprisingly well with
being liberal. There is really not a big constituency out there for
> You might be able to make something out of the existence of the "/" in the
> media type, but I have some doubts, because the length of the type
> designation is variable. You might be able to specify that the media types
> have to be ASCII, but that's awkward for the EBCDIC crowd, and quite
> possibly for others as well (quite a few encodings *do* use ASCII as the
> bottom 7 bits, after all, so perhaps it would be okay, as long as we don't
> mind marginalizing (further) the ones that *don't*).
I believe that the first two bytes will reliably be "4C 6F" in EBCDIC.
And we can use the BOM for the various UTF's and UCS's...and even handle
documents without the BOM as XML does.
As an aside, given that the mapping from bytes to characters is
completely undefined, and not even required to be a proper N-bytes->char
mapping, a cynic could make a case that a GIF is XML in a very
compressed encoding (without even resorting to saying "well really its
an infoset"). As long as there could be a program that translates the
bits into Unicode characters, its "an encoding." But that really isn't
very interesting in practice. AFAIK, an encoding is just a function and
some functions are very complicated.