John Cowan <cowan@ccil.org> wrote on 05/20/2011 06:59:04 PM:
> Mike Sokolov scripsit:
>
> > BOM in UTF-8 seems to cause problems with some XML parsers
> > (incl. Xerces 2.9.1). They seem to believe it is white space in the
> > prolog. To deal with this, we have had to insert a processor prior to
> > our parser which checks for BOM and strips it out.
>
> Support for the 8-BOM was not explicitly required until the XML 1.0
> Third Edition of 2004. Xerces 2.9.1 may be out of date.
What doesn't work? Xerces has known how to handle the UTF-8 BOM for much longer than that. All releases since 2003 [1] have supported it.
Note that you need to the let parser use its own encoding support for the InputStream.
Don't pass in a UTF-8 Reader from the JDK. The JDK UTF-8 InputStreamReader [2] apparently doesn't recognize the BOM and perhaps never will.
> --
> XQuery Blueberry DOM John Cowan
> Entity parser dot-com cowan@ccil.org
> Abstract schemata http://www.ccil.org/~cowan
> XPointer errata
> Infoset Unicode BOM --Richard Tobin
>
> _______________________________________________________________________
>
> XML-DEV is a publicly archived, unmoderated list hosted by OASIS
> to support XML implementation and development. To minimize
> spam in the archives, you must subscribe before posting.
>
> [Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
> Or unsubscribe: xml-dev-unsubscribe@lists.xml.org
> subscribe: xml-dev-subscribe@lists.xml.org
> List archive: http://lists.xml.org/archives/xml-dev/
> List Guidelines: http://www.oasis-open.org/maillists/guidelines.php
[1] http://svn.apache.org/viewvc/xerces/java/trunk/src/org/apache/xerces/impl/XMLEntityManager.java?r1=318934&r2=318940&diff_format=h
[2] http://bugs.sun.com/view_bug.do?bug_id=4508058
Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrglavas@ca.ibm.com
E-mail: mrglavas@apache.org