OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   Re: [xml-dev] Microsoft FUD on binary XML...

[ Lists Home | Date Index | Thread Index ]

Alaric B Snell scripsit:

> So languages like 
> Arabic, which are alphabet-based but not very compact in UTF-8 due to 
> being composed of high-numbered characters (although I'm not sure how 
> high so don't know if they would mainly be 2 or 3 bytes or whatever), 

The 2-byte scripts are Latin (including IPA but excluding ASCII), Greek,
Cyrillic, Armenian, Hebrew, Arabic, Syriac, and Thaana.  N'Ko is not
yet encoded but will also probably fall into this range.  All of these
scripts have a small number of characters.

All other modern-use scripts are 3-byte, as are the archaic scripts
Ogham, Runic, and Tagalog (the Tagalog language is now written in the
Latin script).  A few other archaic scripts will probably be encoded in
this range.

All 4-byte scripts are archaic, except that some modern Chinese characters
appear in this range.  The modern-use scripts Blissymbols and Sutton
Signwriting are not yet encoded but will fall into this range,
because of the large number of characters required for each.

> would be better served by an encoding that mainly uses a shiftable 
> window with single-byte characters, I guess.

That's what SCSU is all about.

John Cowan                              jcowan@reutershealth.com
http://www.reutershealth.com            http://www.ccil.org/~cowan
Humpty Dump Dublin squeaks through his norse
                Humpty Dump Dublin hath a horrible vorse
But for all his kinks English / And his irismanx brogues
                Humpty Dump Dublin's grandada of all rogues.  --Cousin James


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS