OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   Fwd: Re: encoding converters?

[ Lists Home | Date Index | Thread Index ]
  • From: "Simon St.Laurent" <simonstl@simonstl.com>
  • To: XML-Dev Mailing list <xml-dev@xml.org>
  • Date: Sat, 19 Feb 2000 18:30:32 -0500

Rick Jelliffe asked that I forward this to the list - it's yet more answers
on the encoding converter question.

>Date: Sun, 20 Feb 2000 06:32:25 +0800 (CST)
>From: Rick Jelliffe <ricko@gate.sinica.edu.tw>
>Subject: Re request on XML-DEV
>GLUE and XML-TCS Transcoding Utility Software
>I have made an XML-aware version of TCS. The diff package is available at
>the Chinese XML Now site. It implements "lossless" transcoding, which is
>what I talked about that the XML Conference we met at last year. It
>basically means that you should convert unknown characters to NCRS.
>I can only provides diffs for because Bell has not AFAIK made tcs
>available for redistribution, even though at least one version of Linux
>does include it. I don't think they care particularly, but without
>confirmation I cannot make up binaries or a unified source
>distribution, unfortunately. The people involved cannot be contacted; the
>project leader is Dennis Ritchie (i.e., UNIX and C) who undoubtedly has
>more pressing matters to attend to.
>*HOWEVER* at my site you will also see "The GLUE Project Transcoders"
>GLUE (= "GLUE Loses User's Encodings") is a transcoder library I wrote.
>It is specified using XML and converted to C.  At the moment, only the
>x->UTF-8 is available, but that seems to be all you want.
>I made it because the existing transcoders had problems: the GNU iconv
>ones required their new glibc; and so on. Since then, IBM has released
>their excellent C++ libraries ICU, but it too do not do lossless
>transcoding. Also, Java now generates an exception if a character is
>missing instead of just silently swallowing the character; these are steps
>in the right direction.
>The mapping tables at Unicode.org have the problem that many encodings are
>better mapped by algorithm rather than by a table. So I made an XML format
>that could express declaratively certain relationships in a way 
>that can be simply translated into code.  Also, many encodings have
>variants, which can be represented well in XML.
>GLUE home page is at:
>	http://www.ascc.net/xml/en/utf-8/glue.html
>GLUE handles the following encodings:
>                  ASCII 
>                        ISO 646de 
>                        ISO 646en 
>                        ISO 646es 
>                        ISO 646fr 
>                        ISO 646it 
>                        ISO 646sv 
>                  ISO 8859-1 (Latin 1)
>                        CP1252 variant (Windows "ANSI") 
>                  ISO 8859-2 (Latin 2)
>                        CP 1250 variant 
>                  ISO 8859-3 (Latin 3) 
>                  ISO 8859-4 (Latin 4) 
>                  ISO 8859-5 (Cyrillic) 
>                  ISO 8859-6 (Arabic) 
>                  ISO 8859-7 (Greek) 
>                  ISO 8859-8 (Hebrew) 
>                  ISO 8859-9 (Latin 5) 
>                  ISO 8859-10 (Latin 6) 
>                  ISO 8859-11 (Thai) 
>                  ISO 8859-13 (Latin 7) 
>                  ISO 8859-14 (Latin 8) 
>                  ISO 8859-15 (Latin 9) 
>                  MacRoman 
>                        MacRoman with Euro 
>                  UTF-8 
>                  UTF-16 (little endian) 
>                  UTF-16 (big endian) 
>                  Big5 (Chinese, including user-defined area) 
>                  VISCII (Vietnamese) 
>(Note: the variants have not been tested thoroughly. Check them to
>confirm. The current implemetnation does not support well ISO 2022
>based encodings or non-Unicode encodings (i.e. the massice CCCII))
>The xml-tcs home page is at
>	http://www.ascc.net/xml/en/utf-8/transcode-index.html
>xml-tcs can generate the following NCRS with single or double delimiting
>                STRIP: no delimiter, 
>                UNKNOWN: put in unknown character indicator "?" or FFFD 
>                UNICODE: Unicode-style U+HHHH 
>                JAVA: Java-style \uHHHH 
>                JAVA_DD: Java-style \\uHHHH 
>                XML: XML-style &#xHHHH; 
>                XML_DD: XML-style &amp;#xHHHH; 
>                SPREAD1: Old SPREAD &U-HHHH; 
>                SPREAD1_DD: Old SPREAD &amp;U-HHHH; 
>                SPREAD2: New SPREAD &UHHHH; 
>                SPREAD2_DD: New SPREAD &amp;UHHHH; 
>                CSS1: CSS1 \HHHH 
>                CSS1_DD: CSS1 \\HHHH 
>                CSS2: CSS2 \\00HHHH (space following is delimiter) 
>                CSS2_DD: CSS2 \\00HHHH (space following is delimiter) 
>                SGML: SGML-, HTML (< 4) and Netscape style 
>			decimal &#DDDDDD; 
>                SGML_DD: SGML-style &amp;#DDDDDD; 
>Rick Jelliffe
Simon St.Laurent
XML Elements of Style / XML: A Primer, 2nd Ed.
Building XML Applications
Inside XML DTDs: Scientific and Technical
Cookies / Sharing Bandwidth

This is xml-dev, the mailing list for XML developers.
To unsubscribe, mailto:majordomo@xml.org&BODY=unsubscribe%20xml-dev
List archives are available at http://xml.org/archives/xml-dev/threads.html


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS