[
Lists Home |
Date Index |
Thread Index
]
From: <nizar.hirani@citicorp.com>
> Is the SAX Parser able to handle Kanji characters? Any help/pointers are
> appreciated.
The problem is probably that your document is encoded in an encoding that uses
escape sequences. When it is read using a different encoding (e.g. the default
encoding of UTF-8) then the ESC character is correctly flagged as being
a problem.
There are three main Japanese encodings in common use: ISO 2022, Shift JIS and
EUC: all of these have various variants and extensions, and also documents can be in
Unicode encodings, which also have variants. It is a very good thing that XML
systems can often detect that your data has been mislabelled, isn't it! Otherwise
if you add the wrong data to a database, that database will have been corrupted.
Your text is probably encoded using ISO-2022-JP (JIS) encoding.
If you are working with Far Eastern data much, I recommend you read Ken
Lunde's "Chinese Japanese Korean Vietnamese Information Processing"
from O'Reilly. It is an amazing book.
On the WWW see http://lfw.org/text/jp.html#iso2022
Cheers
Rick Jelliffe
|