OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: A simple guy with a simple problem




Sean McGrath, er, "Bob" wrote:

> Hello, my name is Bob and I'm a programmer.

Hi Bob!

> I work for a B2B company. My task today is to process
> incoming XML documents that are known to be valid against the foo
> DTD and change all occurences of the word "STUFF" to "stuff".
>
> I need to leave the documents otherwise unchanged in all material
> respects as they are going on to a third company in a B2B chain.

The requirements are a bit fuzzy: for example,
if the input contains "ASDFSTUFFQWERTY", does that
count as an occurrence of the word "STUFF"?
Also, the precise definition of "all material respects" 
is unclear (I gather that this is the real question).

At any rate, your best bet is to use sed (or the moral equivalent):

    sed -e 's/\<STUFF\>/stuff/g'

if STUFF should only be matched as a complete word, or

    sed -e 's/STUFF/stuff/g'

if the character sequence 'STUFF' should be matched anywhere.

This is guaranteed not to disturb any of the markup [*],
since fortunately the DTD:

>          <!ELEMENT foo (lit)*>
>          <!ELEMENT lit (#PCDATA)>
>          <!ATTLIST lit text CDATA "STUFF">

doesn't use "STUFF" as an element or attribute name.
If it did, you'd have a harder task.

[*] Actually, this is a lie: it will break if the document starts
with <!DOCTYPE foo SYSTEM "http://www.baz.com/STUFF/...">,
if it uses an internal general entity named "stuff"
and another named "STUFF".  (If it *only* contains one
called "STUFF", it's unclear whether renaming this
to "stuff" constitutes an unacceptable material change.)
There may be other corner cases as well.

To solve the harder task, there are three possible solutions:
(1) Use an off-the-shelf SAX parser, perform the substitution
on text events and attribute values, and reserialize it;
the output will have the same Infoset as the input, modulo
the required changes.  This approach will only work if
the precise lexical structure is immaterial.  If this
is not the case then I suggest (2) convince management
to change this requirement and implement solution (1).

If this isn't feasible, you'll need to (3) perform the
transformation and reserialization at the level of
XML lexical tokens.  The EXPAT parser has an internal
interface that reports individual XML lexemes; you could
base your program on that.  Or you could write your own
tokenizer; it's tedious but not difficult.

Hope this helps,


--Joe