OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Re: [xml-dev] Copying text (curly quotes) from Word into an XML document (UTF-8): what happens?

On 02/09/07, G. Ken Holman <gkholman@cranesoftwrights.com> wrote:
> >Notepad doesn't understand UTF-8 encoded files.
> False ... I just opened Notepad and wrote out a file using UTF-8 and
> opened it up again and it was preserved.  An XML processor read the
> file and didn't complain about the encoding.  I'm running XP.
If you save as UTF-8 from notepad, it adds a BOM (EF BB BF) which will
let it recognise it as UTF-8 in future, but which isn't recognised by
some XML parsers, such as the default one shipped with Java 1.4
(Crimson). See http://lists.xml.org/archives/xml-dev/200106/msg00358.html
for discussion whether XML should be changed to make such files legal
XML. If you save as UTF-8 from other editors, they often don't add the
BOM and if you open such UTF-8 files in Notepad it doesn't deduce it's
UTF-8 (which there isn't an easy way to do). So notepad isn't able to
produce files which can be processed by some UTF-8 compliant
applications, including spec complient XML parsers, and is not able to
process UTF-8 encoded files created by some other applications. The
same applies to the UTF-8 encoding used by the .net XML writer - it
adds a BOM, which confuses applications expecting UTF-8 encoded XML to
start with '<' or whitespace.

I got the codepoint wrong for the curly quotes.


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS