[
Lists Home |
Date Index |
Thread Index
]
Tim Bray wrote:
>
>
> Alessandro Triglia wrote:
>
> > As I understand, in UTF-8+name, an ampersand is represented
> as &&;
> > which means that, if UTF-8+name is used for XML, "normal" entity
> > references will look like:
> >
> > &&;myentity;
> >
> > and numeric character references will look like:
> >
> > &&;#12345;
>
> No. &&; represents an ampersand. Normally it wouldn't be
> used in text
> you were going to feed to an XML processor because XML
> processors don't
> like that. & represents just "&" because UTF-8+names doesn't
> assign a replacement. ü represents a single u+umlaut character,
> inhereited from HTML.
If my understanding is correct, UTF-8+names is just another encoding of
Unicode, like UTF-8 or UTF-16.
What an encoding (of Unicode) should do is define a mapping between Unicode
characters (code points) and bit/byte patterns. Your document implies that
AMPERSAND is encoded as the following sequence of 3 bytes:
0x26 0x26 0x3B
(which, when interpreted as a UTF-8 encoding, looks like & & ;)
and (for example) the character NO-BREAK SPACE (160) is encoded as the
following sequence of 6 bytes:
0x26 0x6E 0x62 0x73 0x70 0x3B
(which, when interpreted as a UTF-8 encoding, looks like & n b s p ;)
I don't see this as fundamentally different from what (say) UTF-8 does,
which encodes AMPERSAND as the single byte:
0x26
and NO-BREAK SPACE as a sequence of two bytes:
first-byte second-byte (didn't spend time to determine them)
Now, I see that in XML 1.0, an entity reference or numeric character
reference is introduced by an AMPERSAND character. The actual bytes that
represent the AMPERSAND character depend on the encoding used, and may or
may not be a single 0x26 byte.
Since in UTF-8+names AMPERSAND is encoded as 0x26 0x26 0x3B , an entity
reference will be encoded as:
0x26 0x26 0x3B followed by the bytes encoding the characters of the
name plus a semicolon
which, when interpreted as a UTF-8 encoding, looks like
& & ; m y e n t i t y ;
I have indeed noticed in the I-D that a sequence of bytes that looks like a
reference but is not recognized as a reference must be left as is by the
codec, byte by byte. Therefore I will be able to use, as you say:
& m y e n t i t y ;
as an alternative to the full form:
& & ; m y e n t i t y ;
if and only if no replacement is defined for & m y e n t i t y ; in
UTF-8+names and I know this.
However, if a replacement is defined for & m y e n t i t y ; in
UTF-8+names, I need to use the full form & & ; m y e n t i t y ; to
prevent the codec from replacing my entity reference with its own
replacement.
What would be the recommended behavior of a program generating a UTF-8+names
encoding from a string of Unicode characters? Whenever it encounters an
AMPERSAND character in the string, what byte(s) should it generate for it?
Should it look at the (XML 1.0) context to see if this ampersand is the
first character of an XML entity reference or numeric character reference,
and then generate a single 0x26 byte or the three bytes 0x26 0x26 0x3B
depending on the context and depending on whether it has encountered an XML
entity name that is identical to a replacement, and depending on whether the
definition of that XML entity is identical to the replacement?
This also means that the rules to be followed by the codec on encoding would
depend on its knowledge of XML 1.0 (one layer above it), which I don't see
as a desirable property of a codec.
Would you recommend this complex behavior, or the simple and safe behavior
of encoding all AMPERSANDs as 0x26 0x26 0x3B?
Alessandro
>
> --
> Cheers, Tim Bray (http://www.tbray.org/ongoing/)
>
>
>
> -----------------------------------------------------------------
> The xml-dev list is sponsored by XML.org
> <http://www.xml.org>, an initiative of OASIS
<http://www.oasis-open.org>
The list archives are at http://lists.xml.org/archives/xml-dev/
To subscribe or unsubscribe from this list use the subscription
manager: <http://lists.xml.org/ob/adm.pl>
|