[
Lists Home |
Date Index |
Thread Index
]
David Carlisle wrote:
>
>
> > As I understand, in UTF-8+name, an ampersand is represented
> as &&;
> > which means that, if UTF-8+name is used for XML, "normal" entity
> > references will look like:
> >
> > &&;myentity;
>
> Not necessarily, &myentity; would also work so long as it
> wasn't one of the predefined names. If the entity isn't
> "known" then it expands to itself in the character encoding,
> leaving the entity to be expanded by the XML parser in the usual way.
I agree, but please see what I wrote in my previous email about a program
that is to produce a UTF-8+names encoding from a string of Unicode
characters. What would you think should be the recommended behavior of such
a program wrt. how to encode AMPERSAND characters?
>
> > and numeric character references will look like:
> >
> > &&;#12345;
>
> similarly only one & is needed here as well.
>
> > <
> >
> > but this can be confusing because it would denote a **literal** <
> > character,
>
> No it's defined to have the definition in xhtml and mathml
> which is the definition given in the xml spec, double
> escaped, so it would expand to a character reference to a <
> character, not a literal <.
Yes, I noticed that I had missed this. Anyway, what you say above may mean
one of two different things:
1) < is defined as a replacement name in UTF-8+names, which implies that
the bytes will be decoded into the characters & # 6 0 ; (following XML
1.0) and the XML processor will substitute the character < on parsing
those characters
2) < is *not* defined as a replacement name in UTF-8+names, which implies
that the bytes will be decoded one by one into the characters & l t ; and
the XML processor will "include" the predefined entity lt and eventually
substitute the character <
Although the effect of (1) and (2) will be the same when parsing an XML
document, it will not be the same when decoding a sequence of bytes in a
non-XML context. I am not sure the document is clear on this. At any rate,
I don’t think it would be a good idea to decode & l t ; into the
characters & # 6 0 ; because this sequence of characters is meaningless
outside of XML. So < should really not be a defined replacement name
in UTF-8+names.
I have a question about all the other entities defined in XHTML and MathML.
Do all of them resolve to actual characters, or do some of them resolve to
escaped references (like < does)? If some entities resolve to escaped
character references, they need an XML context to work correctly, and
therefore should not be included among the defined replacements in
UTF-8+names (because a Unicode encoding should not rely on XML to work
correctly).
Alessandro
>
> > It is not very clear to me where UTF-8+name would be useful, as I
> > don't think it is useful in XML. Is it being proposed for use in
> > areas where, for some reason, XML cannot be used?
>
> No its whole point is to allow the use of → or
> é _with_ XML but _without_ a DTD to allow for relax or
> xsd schema use, or just simply well formed fragments with no
> schema at all.
>
>
> some other people have suggested not using & as the delimiter
> but again that would break the main use case of this, the
> FFFFAQ question on xsl-list asking why "& n b s p ;"
> generates an error in xsl.
>
> David
>
|