OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

 


 

   Re: Non-XML documents to XML Converter?

[ Lists Home | Date Index | Thread Index ]
  • From: Paul Prescod <paul@prescod.net>
  • To: "xml-dev@ic.ac.uk" <xml-dev@ic.ac.uk>
  • Date: Tue, 18 May 1999 08:58:16 -0500

"Roger L. Costello" wrote:
> 
> Interestingly, while driving in this morning I realized that this is
> what an XSL processor does.  The only difference is that an XSL
> Processor has (1) hardcoded to use <...> as the delimiter.
> 
> I think that it would be interesting to make an XSL Processor more
> generic such that you could "plug in" a format description document.
> Thus, the XSL Processor could transform not just XML documents, but any
> kind of documents.  Comments?

>From a formal languages point of view your "format description document"
is a grammar and grammar construction is not very easy. I mean your
particular non-XML syntax is easy but what about the C++ grammar? I don't
think that there is any grammar-based parsing tool that can both handle
the full generality of context free languages and have high performance.
:(

Another way to approach it is to abandon the grammar and just embed the
parsing logic directly in some computer program. This is typically what
Perl, Python and Omnimark programmers do. (though there are formal parser
packages for Perl and Python)

For your simple language either mechanism would be easy. In fact it looks
like about a fifteen line Python program to me. Here's the start of one
that optimizes readability over performance:

from string import split
from fileinput import FileInput

data = FileInput().read()
records = split( data, "//" )

counter = 0
for record in recordstrings:
    counter = counter+1
    parts = split( record, "/" )
    if parts[0]=="fruit":
        print "<message%s setid='%s'>"%(counter, parts[0])
        ...
    elif parts[0]=="...":
        ...

You can see how the "parsing" logic is spread through the program. In this
case that doesn't matter much because the language is so simple.

As an aside: your document type is a little odd. I don't think it is
intuitive or convenient to give every message a unique generic identifier
("tagname"). The whole point of the generic identifier is that it should
identify a *genre* -- i.e. all messages, or all fruit messages, etc. On
the other hand, you've got something called "setid" which seems to me to
be the right place for an element-unique identifier -- but you seem to
have put the generic identifier there!

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for only himself
 http://itrc.uwaterloo.ca/~papresco

The dress code in Las Cruces New Mexico has been tightened [to] target 
Gothic clothing, such as dark trench coats. "It is not a witch hunt"
Superintendent Jesse L. Gozales said. "It is for the safety of the kids
in our schools."  - Associated Press, May 16 1999

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo@ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo@ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa@ic.ac.uk)





 

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS