OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help



   Ugly XML processing looking for a generic XML solution (long)

[ Lists Home | Date Index | Thread Index ]
  • From: Michel Rodriguez <mrodrigu@ieee.org>
  • To: xml-dev@lists.xml.org
  • Date: Fri, 15 Dec 2000 07:28:20 -0500 (EST)


I have a problem, and I would like to know if there is any way to solve
it in a generic, XML clean way. I would be particularly interested in
knowing whether XSLT could be up to the task.

I am converting files from a word processor format (MIF) to XML. 

I use a 2-pass process: first go from MIF to a first XML file then from
this to the final XML file.

The first pass, essentially consists in turning each character or
paragraph style in the original file into a an element. The result is a
"flat" XML file: it lacks the "superstructure" of enclosing XML elements
(list items are tagged as such but lists are not for example).

The second pass is then of course to wrap those elements around the low
level ones. And this is the one that's causing me problems.

I would like to use a mechanism similar to what Frame does with their own
"Conversion Tables": using a table to describe the content of wrapping
tags with a regexp-like syntax.

The example below uses a table like:

  officers : person[officer]+
  perslist : officers, person+

But my real software uses rules as complex as
  stdtitle : stddes*, stddesmo?, reaf?, stdcoll?, titlemod?, rev?, title+

  The initial XML is something like:

     <!-- stuff -->
     <!-- there are actually _NO_ comments in the file -->
     <!-- names have been changed to protect the identity of the victims-->
     <person er="officer">Ms Foo</person> 
     <person er="officer">Mr Bar</person> 
     <person>Mss Toto</person>
     <person>Dr Tata</person>
     <!-- other stuff -->

   And I want a result like:

     <!-- stuff -->
         <person>Ms Foo</person>
         <person>Mr Bar</person>
       <person>Mss Toto</person>
       <person>Dr Tata</person>
     <!-- other stuff -->

Now here is my problem: 

As you can see it involves selecting a range of elements, defined by a
regexp (or grammar) like expression and wrapping it in an element. Then
the result might be used by the next rule.

Although I an not an expert in XSLT I believe there is no provision in it
to select such a range (I'd love to be wrong here!).

So the choice I seem to be facing is either writing a regexp engine on top
of an existing parser or use an existing regexp engine and write a parser.

So of course I faked a parser and used the Perl regexp engine.

My question is: is there any XML tool (existing or future) that could help
me. Maybe XPointer? I seem to be doing something just slightly outside the
usual scope of XML tools, just outside enough that I can't use them at
all. Any idea?

FYI here is the Perl solution I found. It is regexp based, so I considered
quite unsafe, although as I am generating the XML myself I am sure no
comment, CDATA section or other entity problem is going to trip me.

It is generic, adding a rule simply amounts to adding a line in %wrap and
an item in @wrap. It gives me the full power of Perl regular expressions
(or at least more than enough of it) when I write the right part of the
rule. It is just not XML based. Which I think is a Bad Thing (tm).

   # the main table
   my %wrap= ( officers         => 'person[officer]+',       
               perslist         => 'officers?, person+'); 

   # needed to apply the rules in the right order
   my @wrap=( 'officers', 'perslist');                   

   my %wrapper;          # stores subroutine that will do the replacement

   local undef $/; 

   my $xml= <$infile>; # slurp the whole file in memory
     { foreach my $tag (@wrap)
         { wrap( $xml, $tag, $wrap{$tag}); }         # easy!

       print $xml;                                   # spit it out
   # apply the transformation for a given rule
   sub wrap($$$)
     { my( $xml, $tag, $expr)= @_;
       # create the wrapper if needed
       $wrapper{$tag}||= make_wrapper( $expr, $tag);
       &{$wrapper{$tag}}->( $xml); 

   # this is where the real work is done
   sub make_wrapper($$)
     { my( $expr, $tag)= @_;
       my $att= '';
       my $subr;

       # figure out whether an attribute should be included
       if( $tag=~ /(\w+)\[(\w+)\]/)
         { $tag=$1; $att=$2 };

       # build the ugly regexp from the nicer syntax
       # no attribute given
       $expr=~ s{(\w+)\b(?![\[\]])}{(<$1.*?</$1>\\\s*)}g;       
       # attribute given
       $expr=~ s{(\w+)\[(\w+)\]}{(<$1 er=\"$2\".*?</$1>\\\s*)}g; 
       $expr=~ s{,\s*} {\\\s*}g;

       # build the wrapper subroutine, replacing the expression by the tag
       if( $att)
         { $subr= "{ ".'$_[0]'."=~ s{($expr)}" .
                  "{<$tag er=\"$att\">\n".'$1'."\n</$tag>}gos;} "; 
         { $subr= "{ ".'$_[0]'."=~ s{($expr)}" .
                  "{<$tag>\n".'$1'."\n</$tag>}gosx;} "; 
       # create a subroutine which will carry on the substitution
       return eval "sub { $subr }";

   Ouf! That's all!

Michel Rodriguez
Perl & XML
Toulouse Perl Mongers: http://hfb.pm.org/~toulouse/


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS