OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   Re: [xml-dev] ASN.1 is an XML Schema Language (Fix those lists!)andBinar

[ Lists Home | Date Index | Thread Index ]

John Cowan wrote:
> Alaric B Snell scripsit:
>>The ASN.1 'equivelant' of a normal XML parser would just need to support 
>>BER, which is the current conventional "minimal" encoding. An ASN.1 
>>toolkit that supported "BER, PER, CER, DER, XER, and probably LWER, OER, 
>>and SER" would be more closely related to an XML parser that supported 
>>US-ASCII, UTF-8, UTF-7, UTF-16, EBCDIC, ISO-8859-[1..15], Shift-JIS, 
>>Baudot, etc...
> Hardly.  Except for some feedback from the encoding declaration, which
> can be handled by a sniffer, charset decoding is a completely separate
> layer from parsing in XML.  The differences between BER, PER, and XER
> parsing are so profound as to cause the three parsers to have essentially
> nothing in common.

Oh yes, there is certainly a bigger code difference between the 
different ASN.1 encoding rulesets than between character encodings.

My point, though was that the original poster claimed that the ASN.1 
notion of multiple encodings is worse than the XML world of a single 
encoding because it meant that the recipient might not have the right 
decoder, requiring interactive negotiation mechanisms, and leaving you 
in trouble if it's a situation where you can't interactively negotiate.

So I pointed out that the XML world is just as bad since you may not 
have the required decoder. Some XML written in EBCDIC will look like 
gibberish when viewed as ASCII :-)

For example, UTF-7:

+ADw?xml version+AD0AIg-1.0+ACI charset+AD0AIg-UTF-7+ACI?+AD4

   +ADw-title+AD4-Hello World+ADw-/title+AD4

If it wasn't for the "?xml" which has survived in line 1, you could be 
forgiven for mistaking it for something like RTF :-) And another I dare 
only represent as hex because it contains 'binary' characters:

0000000 6f4c 94a7 4093 85a5 a299 9689 7e95 f17f
0000010 f04b 407f 8883 9981 85a2 7ea3 c97f d4c2
0000020 f0f1 f7f4 6f7f 256e 4c25 9684 a483 8594
0000030 a395 256e 4040 a34c a389 8593 c86e 9385
0000040 9693 e640 9996 8493 614c 89a3 93a3 6e85
0000050 4c25 8461 8396 94a4 9585 6ea3 0025

That's a charset called "IBM1047", an EBCDIC variant.

Both of these are the IANA registered names of the charsets. As I read 
the XML 1.0 spec, they are valid XML 1.0 documents (I've even declared 
the charset name in the XML declaration), but according to:


...a parser isn't required to be able to read them. "processors are, of 
course, not required to support all IANA-registered encodings" "It is a 
fatal error when an XML processor encounters an entity with an encoding 
that it is unable to process."

But the XML world hasn't exactly come tumbling down because of this, has 
it? It's not as big a problem as you might think. Anybody sending XML 
knows that if they are worried about it being understood in unguessable 
circumstances they'd better stick with UTF-8, since XML parsers are 
required to support it, and it will make at least partial sense wherever 
US-ASCII is spoken, too.

Likewise, people in the ASN.1 world who want things to be generally 
readable will have used BER in the past, and now they can even use XML 
too! Progress, eh?

But yes - I know it's probably more effort to create an ASN.1 decoder 
that supports every encoding ever developed than to create an XML 
decoder that supports every encoding ever developed (although the IANA 
list of encodings is *pretty* long...), but my point is that this isn't 
really relevant; nobody every BOTHERS to write a decoder that supports 
every possible encoding. You support the commonly agreed baseline 
encoding(s), and then support others if your closed-system niche 
application requires it. If you're doing anything outside of a fixed 
niche, then you try to stick to the baseline, for maximum interoperabliity.

> When using XER, is one constrained to a specific encoding?

If I remember correctly, when I was involved with discussions about 
this, we were going with what the XML 1.0 spec says, in order to be 
compatible with it; since an XER decoder is a compliant XML parser, it 
has to support at least UTF-8 and UTF-16. IIRC, we may have been more 
restrictive about output and mandated UTF-8, although I can think of 
arguments against that (ideographic languages generally take more than 
two bytes per character in UTF-8, so UTF-16 is more efficient there), so 
I doubt that was approved.

> Also, I'm curious about which encoding-rules transformations one can
> perform without knowledge of the schema:
> BER to PER?
> XER to PER?
> BER to XER?
> XER to BER?

In general, none of them - PER contains no information that can be found 
in the schema, since in the world it works in - where both ends know the 
schema - sending information that's available already to both ends is 
pointless. BER and XER both have the actual field boundaries in them so 
both of them can be converted into tree structures, but in the XER, you 
have no way of knowing how to interpret the textual content, and in BER, 
you are told how to interpret them (the type information is there) but 
you don't know what their names are :-)

However, there are a few caveats. For a start, there is an ASN.1 type 
for the Infoset being produced, so arbitrary XML that can be parsed into 
an Infoset could then be encoded in PER or BER. But this isn't actually 
converting the abstract value itself into PER or BER - the result is not 
  "Here is a person with name Alaric and email address 
alaric@alaric-snell.com", it's "Here is an element called Alaric with 
two children, an element called Name with content Alaric, and an element 
called EmailAddress with content alaric@alaric-snell.com".



News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS