OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

 


 

   Re: SAX: New Idea for Entity Resolution

[ Lists Home | Date Index | Thread Index ]
  • From: Tyler Baker <tyler@infinet.com>
  • To: David Megginson <ak117@freenet.carleton.ca>
  • Date: Sun, 19 Apr 1998 21:30:39 -0400

David Megginson wrote:

> James Clark writes:
>
>  > You could just have a class that encapsulates a structure with three
>  > members:
>  >
>  > - a CharacterStream
>  > - a ByteStream
>  > - a String
>  >
>  > At least one of the CharacterStream and ByteStream must be non-null. If
>  > the ByteStream is non-null the String can specify the encoding.
>
> [Read on to the bottom for a large-ish design change.]
>
> This implies, then, the following three interfaces:
>
>   public interface ByteStream {
>     public abstract int read ()
>       throws SAXException;
>     public abstract int read (byte b[], int start, int count)
>       throws SAXException;
>   }
>
>   public interface CharacterStream {
>     public abstract int read ()
>       throws SAXException;
>     public abstract int read (char ch[], int start, int count)
>       throws SAXException;
>   }
>
>   public class InputSource {
>     // For each variable, imagine a get/set pair instead...
>     public ByteStream byteStream;
>     public CharacterStream characterStream;
>     public String encoding;
>   }
>
> The nice thing here is that all of these can live on separate systems
> in a distributed environment: the InputSource can be a C-program on a
> VAX, the CharacterStream can come a Python program running under alpha
> Linux, and the parser can be running in Java on a Windows box.  There
> is no dependency on language- or system-specific features (except for
> java.lang.String, which should be able to map predictably to other
> languages).
>
> Now, why not take this a step further?
>
>   public class InputSource {
>     // For each variable, imagine a get/set pair instead...
>     public String publicId;
>     public String systemId;
>     public ByteStream byteStream;
>     public CharacterStream characterStream;
>     public String encoding;
>   }
>
> We'd have to define rules of precedence:
>
> 1) if there is a character stream, use it;
>
> 2) if there is no character stream but there is a byte stream, use the
>    byte stream;
>
> 3) if there is neither a character stream nor a byte stream but there
>    is a system identifier, open a connection to the system identifier;
>
> 4) if there is no character stream, byte stream, or system identifier,
>    throw an exception (or invoke the ErrorHandler).
>
> Now, we can get away with only one parse() method in
> org.xml.sax.Parser:
>
>   public abstract void parse (InputSource source)
>     throws Exception;
>
> It might still be useful to keep two separate methods in
> EntityResolver, though:
>
>   public interface EntityResolver
>   {
>     public String resolveSystemId (String publicId, String systemId)
>       throws SAXException;
>     public InputSource openEntity (String systemId)
>       throws Exception;
>   }
>
> Comments?
>
> All the best,
>
> David

This sounds like a great idea, however I think that InputSource should be
immutable in general.  Instead of :

  public class InputSource {
    // For each variable, imagine a get/set pair instead...
    public String publicId;
    public String systemId;
    public ByteStream byteStream;
    public CharacterStream characterStream;
    public String encoding;
  }

  public interface InputSource {
   String getPublicId();
   String getSystemId();
   ByteStream getByteStream();
   CharacterStream getCharacterStream();
   String getEncoding();
  }

In general, an input source should probably be immutable as the application will
actually fill in the blanks as to how the input source should be retrieved.  In
this sense, the system ID may not help out the parser in the first place if the
URL points to an inaccessible location source for the parser alone to read (some
sort of encryption of the underlying stream may be present).  In this case in
your previous aforementioned rules of precedence:

We'd have to define rules of precedence:

1) if there is a character stream, use it;

2) if there is no character stream but there is a byte stream, use the
   byte stream;

3) if there is neither a character stream nor a byte stream but there
   is a system identifier, open a connection to the system identifier;

4) if there is no character stream, byte stream, or system identifier,
   throw an exception (or invoke the ErrorHandler).

should be changed to something like:

We'd have to define rules of precedence:

1) if there is no character stream but there is a byte stream, use the
   byte stream;

2) if there is no byte stream but there is a character stream, use the
   character stream;

3) if there is both a character stream and a byte stream available, the
   parser may use the byte stream or the character stream, but not both
   at the same time (whichever suits the parser the best).

4) if there is neither a character stream nor a byte stream throw an exception

I don't believe the parser should attempt to try and open a connection using the
system identifier as the system identifier has no idea what steps to take in
order
to retrieve the data as a stream, let alone secure authorization to it in the
first place.

In Java you have URL's and URLHandlers where the URL prefix is used to lookup its
corresponding URL prefix.  Though programmatically convenient to just call
URL.openStream(), other than through setting system properties that the standard
URL handlers use for things like proxies or creating your own
URLStreamHandlerFactory, there is no good way to control how a specific URL's
content is actually retrieved which may need to be piped through a variety of
filters before it again in its raw form..

I think it would be a mistake for SAX to inherit this flaw which assumes the
parser has access to the specified system identifier in any environment.  Force
the application to provide a suitable ByteStream and/or CharacterStream for each
InputSource provided.

Tyler


xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo@ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo@ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa@ic.ac.uk)





 

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS