Lists Home |
Date Index |
Alaric B Snell scripsit:
> So languages like
> Arabic, which are alphabet-based but not very compact in UTF-8 due to
> being composed of high-numbered characters (although I'm not sure how
> high so don't know if they would mainly be 2 or 3 bytes or whatever),
The 2-byte scripts are Latin (including IPA but excluding ASCII), Greek,
Cyrillic, Armenian, Hebrew, Arabic, Syriac, and Thaana. N'Ko is not
yet encoded but will also probably fall into this range. All of these
scripts have a small number of characters.
All other modern-use scripts are 3-byte, as are the archaic scripts
Ogham, Runic, and Tagalog (the Tagalog language is now written in the
Latin script). A few other archaic scripts will probably be encoded in
All 4-byte scripts are archaic, except that some modern Chinese characters
appear in this range. The modern-use scripts Blissymbols and Sutton
Signwriting are not yet encoded but will fall into this range,
because of the large number of characters required for each.
> would be better served by an encoding that mainly uses a shiftable
> window with single-byte characters, I guess.
That's what SCSU is all about.
John Cowan email@example.com
Humpty Dump Dublin squeaks through his norse
Humpty Dump Dublin hath a horrible vorse
But for all his kinks English / And his irismanx brogues
Humpty Dump Dublin's grandada of all rogues. --Cousin James