[
Lists Home |
Date Index |
Thread Index
]
John Cowan wrote:
> Alaric B Snell scripsit:
>
>
>>The ASN.1 'equivelant' of a normal XML parser would just need to support
>>BER, which is the current conventional "minimal" encoding. An ASN.1
>>toolkit that supported "BER, PER, CER, DER, XER, and probably LWER, OER,
>>and SER" would be more closely related to an XML parser that supported
>>US-ASCII, UTF-8, UTF-7, UTF-16, EBCDIC, ISO-8859-[1..15], Shift-JIS,
>>Baudot, etc...
>
> Hardly. Except for some feedback from the encoding declaration, which
> can be handled by a sniffer, charset decoding is a completely separate
> layer from parsing in XML. The differences between BER, PER, and XER
> parsing are so profound as to cause the three parsers to have essentially
> nothing in common.
Oh yes, there is certainly a bigger code difference between the
different ASN.1 encoding rulesets than between character encodings.
My point, though was that the original poster claimed that the ASN.1
notion of multiple encodings is worse than the XML world of a single
encoding because it meant that the recipient might not have the right
decoder, requiring interactive negotiation mechanisms, and leaving you
in trouble if it's a situation where you can't interactively negotiate.
So I pointed out that the XML world is just as bad since you may not
have the required decoder. Some XML written in EBCDIC will look like
gibberish when viewed as ASCII :-)
For example, UTF-7:
+ADw?xml version+AD0AIg-1.0+ACI charset+AD0AIg-UTF-7+ACI?+AD4
+ADw-document+AD4
+ADw-title+AD4-Hello World+ADw-/title+AD4
+ADw-/document+AD4
If it wasn't for the "?xml" which has survived in line 1, you could be
forgiven for mistaking it for something like RTF :-) And another I dare
only represent as hex because it contains 'binary' characters:
0000000 6f4c 94a7 4093 85a5 a299 9689 7e95 f17f
0000010 f04b 407f 8883 9981 85a2 7ea3 c97f d4c2
0000020 f0f1 f7f4 6f7f 256e 4c25 9684 a483 8594
0000030 a395 256e 4040 a34c a389 8593 c86e 9385
0000040 9693 e640 9996 8493 614c 89a3 93a3 6e85
0000050 4c25 8461 8396 94a4 9585 6ea3 0025
That's a charset called "IBM1047", an EBCDIC variant.
Both of these are the IANA registered names of the charsets. As I read
the XML 1.0 spec, they are valid XML 1.0 documents (I've even declared
the charset name in the XML declaration), but according to:
http://www.w3.org/TR/REC-xml#charencoding
...a parser isn't required to be able to read them. "processors are, of
course, not required to support all IANA-registered encodings" "It is a
fatal error when an XML processor encounters an entity with an encoding
that it is unable to process."
But the XML world hasn't exactly come tumbling down because of this, has
it? It's not as big a problem as you might think. Anybody sending XML
knows that if they are worried about it being understood in unguessable
circumstances they'd better stick with UTF-8, since XML parsers are
required to support it, and it will make at least partial sense wherever
US-ASCII is spoken, too.
Likewise, people in the ASN.1 world who want things to be generally
readable will have used BER in the past, and now they can even use XML
too! Progress, eh?
But yes - I know it's probably more effort to create an ASN.1 decoder
that supports every encoding ever developed than to create an XML
decoder that supports every encoding ever developed (although the IANA
list of encodings is *pretty* long...), but my point is that this isn't
really relevant; nobody every BOTHERS to write a decoder that supports
every possible encoding. You support the commonly agreed baseline
encoding(s), and then support others if your closed-system niche
application requires it. If you're doing anything outside of a fixed
niche, then you try to stick to the baseline, for maximum interoperabliity.
> When using XER, is one constrained to a specific encoding?
If I remember correctly, when I was involved with discussions about
this, we were going with what the XML 1.0 spec says, in order to be
compatible with it; since an XER decoder is a compliant XML parser, it
has to support at least UTF-8 and UTF-16. IIRC, we may have been more
restrictive about output and mandated UTF-8, although I can think of
arguments against that (ideographic languages generally take more than
two bytes per character in UTF-8, so UTF-16 is more efficient there), so
I doubt that was approved.
> Also, I'm curious about which encoding-rules transformations one can
> perform without knowledge of the schema:
>
> BER to PER?
> XER to PER?
> BER to XER?
> XER to BER?
In general, none of them - PER contains no information that can be found
in the schema, since in the world it works in - where both ends know the
schema - sending information that's available already to both ends is
pointless. BER and XER both have the actual field boundaries in them so
both of them can be converted into tree structures, but in the XER, you
have no way of knowing how to interpret the textual content, and in BER,
you are told how to interpret them (the type information is there) but
you don't know what their names are :-)
However, there are a few caveats. For a start, there is an ASN.1 type
for the Infoset being produced, so arbitrary XML that can be parsed into
an Infoset could then be encoded in PER or BER. But this isn't actually
converting the abstract value itself into PER or BER - the result is not
"Here is a person with name Alaric and email address
alaric@alaric-snell.com", it's "Here is an element called Alaric with
two children, an element called Name with content Alaric, and an element
called EmailAddress with content alaric@alaric-snell.com".
ABS
|