SHRVL: hierarchical, summarized SVRL may be useful for implyingstructures in horrid documents
SHRVL: Schematron Hierarchical Report View Language
It is possible to take an
SVRL document and convert it into a hierarchical XML document. This can give us another
weapon in our toolbelt for feature extraction and detecting implied structures.
See
https://schematron.com/document/3472.html
for the approach, for feature extraction, for what it is worth. I am
wondering if this makes life easier, for some kinds of processing,
because it unwraps the @location Xpaths and makes the report into more
of a shape like the document (like a kind of reverse Examplotron?).
We collate the SVRL and transpose the @location XPaths into element names with position attributes.
An
example mockup of a SHRVL-ed SVRL document made from running a
Schematron report on a large set of large HTML documents follows:
<shrvl>
<html>
<body pos="2">
<p pos="1" found="p-with-feasible-title" />
<div pos="2" found="div-with-small-simple-table">
<table pos="2" found="td-with-feasible-title"/>
</div>
<div pos="5" found="div-with-feasible-title" />
</body>
</html>
</shrvl>
where the bold part is transposed from
<svrl:successful-report id="p2-r88-a88" role="p-with-potential-heading"
location="/html/*[1][self::body]/*[1][self::p]" >
<text>A p with the right words may be the main title</text>
</svrl:successful-report>
We
might then give this document to our XSLT transformer, with the
original document, so that its features can guide the transformation of
the document. (For example, the XSLT could take the title from the first
candidate found, except that a table in the second position trumps a
starting p.)
This may reduce the
spaghetti coding, and provide a clear intermediate artifact that can be
viewed to understand what is going on when some new document that fails
the conversion is found.
It also can
provide a way to do meta-grammars (architectural forms-ish): where you
have a RELAX NG (or Schematron) schema to validate the SHVRL document.
This can act to let you know when your flat document has the features
expected for some transformation, without having to actually do that
heavyweight transformation.
Regards
Rick