[
Lists Home |
Date Index |
Thread Index
]
- From: Michel Rodriguez <mrodrigu@ieee.org>
- To: xml-dev@lists.xml.org
- Date: Fri, 15 Dec 2000 07:28:20 -0500 (EST)
Hi,
I have a problem, and I would like to know if there is any way to solve
it in a generic, XML clean way. I would be particularly interested in
knowing whether XSLT could be up to the task.
I am converting files from a word processor format (MIF) to XML.
I use a 2-pass process: first go from MIF to a first XML file then from
this to the final XML file.
The first pass, essentially consists in turning each character or
paragraph style in the original file into a an element. The result is a
"flat" XML file: it lacks the "superstructure" of enclosing XML elements
(list items are tagged as such but lists are not for example).
The second pass is then of course to wrap those elements around the low
level ones. And this is the one that's causing me problems.
I would like to use a mechanism similar to what Frame does with their own
"Conversion Tables": using a table to describe the content of wrapping
tags with a regexp-like syntax.
The example below uses a table like:
officers : person[officer]+
perslist : officers, person+
But my real software uses rules as complex as
stdtitle : stddes*, stddesmo?, reaf?, stdcoll?, titlemod?, rev?, title+
The initial XML is something like:
<doc>
<!-- stuff -->
<!-- there are actually _NO_ comments in the file -->
<!-- names have been changed to protect the identity of the victims-->
<person er="officer">Ms Foo</person>
<person er="officer">Mr Bar</person>
<person>Mss Toto</person>
<person>Dr Tata</person>
<!-- other stuff -->
</doc>
And I want a result like:
<doc>
<!-- stuff -->
<perslist>
<officers>
<person>Ms Foo</person>
<person>Mr Bar</person>
</officers>
<person>Mss Toto</person>
<person>Dr Tata</person>
</perslist>
<!-- other stuff -->
</doc>
Now here is my problem:
As you can see it involves selecting a range of elements, defined by a
regexp (or grammar) like expression and wrapping it in an element. Then
the result might be used by the next rule.
Although I an not an expert in XSLT I believe there is no provision in it
to select such a range (I'd love to be wrong here!).
So the choice I seem to be facing is either writing a regexp engine on top
of an existing parser or use an existing regexp engine and write a parser.
So of course I faked a parser and used the Perl regexp engine.
My question is: is there any XML tool (existing or future) that could help
me. Maybe XPointer? I seem to be doing something just slightly outside the
usual scope of XML tools, just outside enough that I can't use them at
all. Any idea?
FYI here is the Perl solution I found. It is regexp based, so I considered
quite unsafe, although as I am generating the XML myself I am sure no
comment, CDATA section or other entity problem is going to trip me.
It is generic, adding a rule simply amounts to adding a line in %wrap and
an item in @wrap. It gives me the full power of Perl regular expressions
(or at least more than enough of it) when I write the right part of the
rule. It is just not XML based. Which I think is a Bad Thing (tm).
# the main table
my %wrap= ( officers => 'person[officer]+',
perslist => 'officers?, person+');
# needed to apply the rules in the right order
my @wrap=( 'officers', 'perslist');
my %wrapper; # stores subroutine that will do the replacement
local undef $/;
my $xml= <$infile>; # slurp the whole file in memory
{ foreach my $tag (@wrap)
{ wrap( $xml, $tag, $wrap{$tag}); } # easy!
print $xml; # spit it out
}
# apply the transformation for a given rule
sub wrap($$$)
{ my( $xml, $tag, $expr)= @_;
# create the wrapper if needed
$wrapper{$tag}||= make_wrapper( $expr, $tag);
&{$wrapper{$tag}}->( $xml);
}
# this is where the real work is done
sub make_wrapper($$)
{ my( $expr, $tag)= @_;
my $att= '';
my $subr;
# figure out whether an attribute should be included
if( $tag=~ /(\w+)\[(\w+)\]/)
{ $tag=$1; $att=$2 };
# build the ugly regexp from the nicer syntax
# no attribute given
$expr=~ s{(\w+)\b(?![\[\]])}{(<$1.*?</$1>\\\s*)}g;
# attribute given
$expr=~ s{(\w+)\[(\w+)\]}{(<$1 er=\"$2\".*?</$1>\\\s*)}g;
$expr=~ s{,\s*} {\\\s*}g;
# build the wrapper subroutine, replacing the expression by the tag
if( $att)
{ $subr= "{ ".'$_[0]'."=~ s{($expr)}" .
"{<$tag er=\"$att\">\n".'$1'."\n</$tag>}gos;} ";
}
else
{ $subr= "{ ".'$_[0]'."=~ s{($expr)}" .
"{<$tag>\n".'$1'."\n</$tag>}gosx;} ";
}
# create a subroutine which will carry on the substitution
return eval "sub { $subr }";
}
Ouf! That's all!
Michel Rodriguez
Perl & XML
http://www.xmltwig.com
Toulouse Perl Mongers: http://hfb.pm.org/~toulouse/
|