[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
Re: [xml-dev] Copying text (curly quotes) from Word into an XML document (UTF-8): what happens?
- From: "Pete Kirkham" <mach.elf@gmail.com>
- To: xml-dev@lists.xml.org
- Date: Mon, 3 Sep 2007 01:07:19 +0100
On 02/09/07, G. Ken Holman <gkholman@cranesoftwrights.com> wrote:
> >Notepad doesn't understand UTF-8 encoded files.
>
> False ... I just opened Notepad and wrote out a file using UTF-8 and
> opened it up again and it was preserved. An XML processor read the
> file and didn't complain about the encoding. I'm running XP.
If you save as UTF-8 from notepad, it adds a BOM (EF BB BF) which will
let it recognise it as UTF-8 in future, but which isn't recognised by
some XML parsers, such as the default one shipped with Java 1.4
(Crimson). See http://lists.xml.org/archives/xml-dev/200106/msg00358.html
for discussion whether XML should be changed to make such files legal
XML. If you save as UTF-8 from other editors, they often don't add the
BOM and if you open such UTF-8 files in Notepad it doesn't deduce it's
UTF-8 (which there isn't an easy way to do). So notepad isn't able to
produce files which can be processed by some UTF-8 compliant
applications, including spec complient XML parsers, and is not able to
process UTF-8 encoded files created by some other applications. The
same applies to the UTF-8 encoding used by the .net XML writer - it
adds a BOM, which confuses applications expecting UTF-8 encoded XML to
start with '<' or whitespace.
I got the codepoint wrong for the curly quotes.
Pete
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]