[
Lists Home |
Date Index |
Thread Index
]
Elliotte Rusty Harold wrote:
> At 3:49 PM +1100 11/11/03, Rick Jelliffe wrote:
>
>> So the argument against literal C0 characters is that inband control
>> characters
>> are transmission artifacts that have no place literally in data. Use
>> references
>> to get the character but escape the control semantic.
>
> There's also the argument that C0 controls may accidentally control
> something. There are still a few old printers here and there that will
> break a page on a form feed. There might even be some gateways that
> use the C0's for other purposes.
I know of people still using serial terminals (perhaps with Xon/Xoff.)
In Taiwan one
place that had computerized early had terminals, because their
mainframes used character
sets that terminal emulation programs did not accept. They cannot must
wait till
their mainframe applications become obsolete before they get rid of
them. (But they
are not sending XML anyway.) Modems still sometimes use Xon/Xoff, but
because
people run PPP etc rather than sending files directly, control
characters in data
is now not a problem for serial comms AFAIK. If XML just uses application/
then I think there is no RFC problem with literal controls.
Mislabelled UTF-16 encodings will always be detected in XML, because the
presence of
the 0x00 bytes and/or BOM*
In UTF-8, all the bytes for characters > U+007F are bytes > 0x7F, so
again this
will be detected.
Cheers
Rick Jelliffe
* The only exception I can think of is if we have
- an external parsed entity in UTF-16
- with no encoding header defaulting to UTF-8 or html's 8859-1
- which has only data and no markup
- and no ASCII/Latin1 data including spaces (these would cause a 0x00
byte),
- and whose UTF-16 bytes also correspond to a valid UTF-8 or 8859-1
patterns,
Except for monkeys typing XML, data usually has meaning; so most potential
strings never in fact could appear, which may lessen this edge case anyway.
|