OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [xml-dev] To continue parsing after a fatal error.



Anoop,

I would suggest that you figure out exactly what kind of problem you have
before deciding how to fix it.

I had a variation of this problem, due to a simplistic export of contact
(names and addresses) data from a Windows app into an XML file.  The problem
in this case was caused by characters in the high ASCII range, i. e.
accented characters, Norwegian characters, etc, which are valid ASCII, but
not 7-bit ASCII.  Since the source app uses the Latin1 code page, there is a
direct translation to Unicode, which I implemented as a ASCII-to-UTF-8
translation filter in C:

#include <stdio.h>
main()
{
  unsigned int c, c1, c2;
  while( (c=getchar()) != EOF )
  {
    if( c > 0x7F )
      {
        c1 = c/64;
        c1 = 0xC0 + c1;
        putchar(c1);
        c2 = 0x80 + (c - c1*64);
        putchar(c2);
      }
    else
      putchar(c);
  }
}

You also have to declare the encoding as 'UTF-8' in the XML prologue.

The original characters are recoverable by the reverse transform:

#include <stdio.h>
main()
{
  unsigned int c, c1, c2;
  while( (c=getchar()) != EOF )
  {
    if( c > 0x7F )
    {
      c1 = (c << 28) >> 22;
      c=getchar();
      c2 = (c << 26) >> 26;
      c= c1 + c2 ;
      putchar(c);
    }
  else
    putchar(c);
  }
}

Resources:
http://www.cl.cam.ac.uk/~mgk25/unicode.html
http://czyborra.com/charsets/codepages.html
http://czyborra.com/charsets/iso8859.html#ISO-8859-1

Jim

---------------------------
Jim Theriot
mailto:Jim.Theriot@posc.org
POSC -- Energy eStandards
9801 Westheimer, Suite 450
Houston TX USA 77042
+1 713 267 5109 : phone
+1 713 784 9219 : fax
---------------------------


-----Original Message-----
From: Joshua Allen [mailto:joshuaa@microsoft.com]
Sent: Tuesday, October 23, 2001 2:40 PM
To: Anoop A V; xml-dev@lists.xml.org
Cc: Julia Jia
Subject: RE: [xml-dev] To continue parsing after a fatal error.


This error should occur with any conforming XML processor.  It is quite
likely that the error is caused by a control character in the low ASCII
range.  The only way to avoid the problem is to clean up the XML on the
way in, before it is processed by MSXML.  And unfortunately I am not
aware of a way to do this without writing code to pipe the input stream
through a scrubber before passing it to MSXML.  Julia will know if there
are any code samples existing today (I doubt it).

Thanks,
Joshua




> -----Original Message-----
> From: Anoop A V [mailto:anoop_scorpio@hotmail.com]
> Sent: Tuesday, October 23, 2001 10:51 AM
> To: xml-dev@lists.xml.org
> Subject: [xml-dev] To continue parsing after a fatal error.
>
> Hi,
> I have an 800 MB file which I need to parse. When I do this using
MSXML
> SAX
> parser, I get a fatal error with the message "Invalid character found
in
> text content". And the parsing will be stopped. But I need to continue
> parsing the file even if an invalid character is met. I don't mind if
that
> particular node(s) is skipped. But I need to parse the whole file.
This
> file
> is not under my control, so there is no question of my being able to
edit
> this file and remove the invalid characters. Can anybody help?
>
> Thanks.
> Anoop.
>
> _________________________________________________________________
> Get your FREE download of MSN Explorer at
http://explorer.msn.com/intl.asp
>
>
> -----------------------------------------------------------------
> The xml-dev list is sponsored by XML.org <http://www.xml.org>, an
> initiative of OASIS <http://www.oasis-open.org>
>
> The list archives are at http://lists.xml.org/archives/xml-dev/
>
> To subscribe or unsubscribe from this elist use the subscription
> manager: <http://lists.xml.org/ob/adm.pl>

-----------------------------------------------------------------
The xml-dev list is sponsored by XML.org <http://www.xml.org>, an
initiative of OASIS <http://www.oasis-open.org>

The list archives are at http://lists.xml.org/archives/xml-dev/

To subscribe or unsubscribe from this elist use the subscription
manager: <http://lists.xml.org/ob/adm.pl>