Lists Home |
Date Index |
Tim Bray wrote:
> John Cowan wrote:
> > >Why not Unicode.org? It could create short name "aliases"
> > >of the long name descriptions.
> > Are you really prepared to create short names
No -- that's why I suggested Unicode.org. :)
> > (other than ones involving hex digits) for all 95,156
> > characters in Unicode 3.2? Or even if we leave out the Han
> > and Hangul characters, the 13,791 characters that are left?
> > It is a biiiiiiiiiiiiiiiiiiiig job.
> Yes, but it sure would be nice if it were done. If this were done,
> I think that a lot of people would be willing to focus support
> on this and nothing else. I wonder how much could be automated?
> Hmm... -Tim
None of the Latin, Greek, or Math used in today's markup should, IMO, be
automated. Those should come from the XHTML, Docbook, MathML
traditions, as "unified" by David C. & Co.
As to the rest, the writing groups are, well, different -- especially as
to case, letters, characters, vowel signs, intent, etc. A few random
samples from UnicodeData.txt:
BOX DRAWINGS RIGHT LIGHT AND LEFT VERTICAL HEAVY
RECYCLING SYMBOL FOR TYPE-4 PLASTICS
UPWARDS HARPOON WITH BARB LEFT BESIDE DOWNWARDS HARPOON WITH
CYRILLIC CAPITAL LETTER GHE WITH UPTURN
ARABIC LETTER DAL WITH DOT BELOW AND SMALL TAH
ARABIC LIGATURE FEH WITH KHAH WITH MEEM INITIAL FORM
SINHALA LETTER MAHAAPRAANA PAYANNA
SINHALA VOWEL SIGN KOMBUVA HAA AELA-PILLA
TIBETAN MARK GTER YIG MGO -UM GTER TSHEG MA
TIBETAN SUBJOINED LETTER CHA
TIBETAN VOWEL SIGN REVERSED II
TIBETAN SIGN NYI ZLA NAA DA
HANGUL CHOSEONG CEONGCHIEUMSSANGCIEUC
HANGUL JUNGSEONG SSANGARAEA
HANGUL LETTER KAPYEOUNSSANGPIEUP
PARENTHESIZED HANGUL MIEUM A
You might possibly automate *some* of it group by group. A lot of them
don't seem to yield very well to "entification", automatic or otherwise.
:) And the alternative underscore trick could cause too many to end it
all with an
It's probably best to start with a single unified western set from
XHTML, Docbook, and MathML that people can bring in -- *if they desire*
-- and ten years or so from now, we'll rarely need it (or any other
entified Unicode) anyway.