xml-dev - RE: CDATA by any other name... (was The raw and the cooked)

RE: CDATA by any other name... (was The raw and the cooked)

[ Lists Home | Date Index | Thread Index ]

From: "Rick Jelliffe" <ricko@allette.com.au>
To: "XML Dev" <xml-dev@ic.ac.uk>
Date: Wed, 4 Nov 1998 06:10:32 +1100

> From: John Cowan

> Rick Jelliffe wrote:
>
> > (An optimistic view of ISO10646: there are dozens of new Han ideographs
> > created every day, apart from other scripts.)
>
> True but irrelevant, since no specifiable character set can hold these.

Not so. The additions are use composed of standard radicals and
combinations. There are various projects around (such as C.C.Hsieh in
Taiwan) to figure out encodings to "spell" Han ideographs by component
radicals. This would allow any number of characters and even variant forms.
But this is not in ISO 10646 yet.

I guess the point is that John thinks that if an XML system can produce
characters which a recipient system cannot process, because it does not use
ISO 10646, that is not something that CDATA sections should be used to
address. I think his reasons are that he cannot see it in the spec. Dave M
thinks that xml:lang is appropriate. My point about CDATA elements was that
there is no standard mechanism to lock CDATA marked sections. I think a lot
of people now think that any non-ISO10646 system is for losers anyway
(except for whatever character set they use, probably).

> .. the repertoire of a language is
> a sticky wicket.  In the domain of "xml:lang='en-US'", am I to be
> forbidden to write "naïve" or "coöperate"?  How about "résumé" or
> "Québéc"?

The primary purpose of xml:lang, as far as I am concerned, should be to
convey the information lost by ISO 10646 unification: where the Japanese and
Chinese glyphs (or Polish and Russian) for a unified character differ, then
I think transcoding and unifying the characters into ISO 10646 can lose
information unless the xml:lang attribute is set. After that, xml:lang can
be used to label text for the purposes of variant character selection, and
after that for marking up the natural language.

But I am not trying to fix the repertoire of a language (TEI WSD can declare
it, though). I am just thinking about how to constrain XML documents so that
they will not contain characters which will break non-ISO10646 target
systems.

Rick Jelliffe

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo@ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo@ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa@ic.ac.uk)

Follow-Ups:
- Unicode, xml:lang, and variant glyphs
  - From: John Cowan <cowan@locke.ccil.org>

References:
- Re: CDATA by any other name... (was The raw and the cooked)
  - From: John Cowan <cowan@locke.ccil.org>

Prev by Date: Re: Walking the DOM (was: XML APIs)
Next by Date: Re: Walking the DOM (was: XML APIs)
Previous by thread: Re: CDATA by any other name... (was The raw and the cooked)
Next by thread: Unicode, xml:lang, and variant glyphs
Index(es):
- Date
- Thread