OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   RE: [xml-dev] UTF-8+names

[ Lists Home | Date Index | Thread Index ]

David Carlisle wrote: 
> > As I understand, in UTF-8+name, an ampersand is represented 
> as  &&;  
> > which means that, if UTF-8+name is used for XML, "normal" entity 
> > references will look like:
> > 
> > 	&&;myentity;
> Not necessarily, &myentity; would also work so long as it 
> wasn't one of the predefined names. If the entity isn't 
> "known" then it expands to itself in the character encoding, 
> leaving the entity to be expanded by the XML parser in the usual way.

I agree, but please see what I wrote in my previous email about a program
that is to produce a UTF-8+names encoding from a string of Unicode
characters.  What would you think should be the recommended behavior of such
a program wrt. how to encode AMPERSAND characters?

> > and numeric character references will look like:
> > 
> > 	&&;#12345;
> similarly only one & is needed here as well.
> > 	<
> > 
> > but this can be confusing because it would denote a **literal** < 
> > character,
> No it's defined to have the definition in xhtml and mathml 
> which is the definition given in the xml spec, double 
> escaped, so it would expand to a character reference to a < 
> character, not a literal <.

Yes, I noticed that I had missed this.  Anyway, what you say above may mean
one of two different things:

1) &lt; is defined as a replacement name in UTF-8+names, which implies that
the bytes will be decoded into the characters  & # 6 0 ;  (following XML
1.0) and the XML processor will substitute the character  <  on parsing
those characters

2) &lt; is *not* defined as a replacement name in UTF-8+names, which implies
that the bytes will be decoded one by one into the characters  & l t ;   and
the XML processor will "include" the predefined entity lt and eventually
substitute the character  <

Although the effect of (1) and (2) will be the same when parsing an XML
document, it will not be the same when decoding a sequence of bytes in a
non-XML context.  I am not sure the document is clear on this.  At any rate,
I donít think it would be a good idea to decode   & l t ;  into the
characters   & # 6 0 ;   because this sequence of characters is meaningless
outside of XML.  So  &lt;  should really not be a defined replacement name
in UTF-8+names.

I have a question about all the other entities defined in XHTML and MathML.
Do all of them resolve to actual characters, or do some of them resolve to
escaped references (like &lt; does)?  If some entities resolve to escaped
character references, they need an XML context to work correctly, and
therefore should not be included among the defined replacements in
UTF-8+names (because a Unicode encoding should not rely on XML to work


> > It is not very clear to me where UTF-8+name would be useful, as I 
> > don't think it is useful in XML.  Is it being proposed for use in 
> > areas where, for some reason, XML cannot be used?
> No its whole point is to allow the use of &rightarrow; or 
> &eacute; _with_ XML but _without_ a DTD to allow for relax or 
> xsd schema use, or just simply well formed fragments with no 
> schema at all.
> some other people have suggested not using & as the delimiter 
> but again that would break the main use case of this, the 
> FFFFAQ question on xsl-list asking why "& n b s p ;" 
> generates an error in xsl.
> David


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS