XML.orgXML.org
FOCUS AREAS |XML-DEV |XML.org DAILY NEWSLINK |REGISTRY |RESOURCES |ABOUT
OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Re: [xml-dev] Tool converts records to XML

Roger L Costello <costello@mitre.org> writes:

> Michael Kay wrote:
>
>> the "Barnes & Noble" problem. The number #1 blunder 
>> when writing XML is not to bother escaping `<` and `&` 
>> if they happen to occur in your input.
>
> Ouch!
>
> You are right Michael.
>
> Upon reflection, I realized that there is an even nastier problem
> lurking than the problem of converting & and < in the input record
> data into &amp; and &lt; in the output XML.
>
> ...
>
> To implement the character conversions in AWK would be a monumental task.
>
> Eeeeeeek!
>
> Lesson Learned: Don't use AWK to convert records to XML.

Well, you may be right, and I believe many on this list share my
preference for performing such conversions in XSLT and/or XQuery, but I
have to say that the lesson you suggest seems a slightly broader
conclusion than is warranted by the experience you describe.

A couple points of detail:

  - Your downstream tools are likely to be somewhat happier if you
    convert the data to UTF-8 or UTF-16, but unless I am mistaken you
    are not in fact required to do so, in order to turn the data into
    XML.  XML does allow encoding declarations.

  - If you do want to convert the encoding it would surprise me a bit if
    awk had no constructs suitable for the work.  It would surprise me
    even more if a system with awk did not have the iconv utility for    
    converting textual data from one encoding to another.

        iconv --from-code=WINDOWS-1252 --to-code=UTF-8 < myinput > output.utf8

  - Your note sounds as if you found it difficult to contemplate the
    horrifying task of escaping occurrences of & and < -- I don't see
    what you regard as so difficult.

    Of all the text formats I have worked with, XML is among the
    simplest as regards the number and nature of its rules, and
    especially the number of its magic characters.  It has two, count
    'em, two magic characters in ordinary textual content: ampersand and
    left angle bracket.  (Add the magic string ']]>' if for some reason
    you choose to generate CDATA marked sections.  Add the escaping of
    the delimiters if you are generating attribute values.)

    This contrasts favorably, in my experience, with the number of
    characters you need to take care to escape if you are generating
    other formats, for example TeX.  I have never seen the first cut at
    a TeX to XML conversion in which the programmer remembered to
    unescape all the escaped characters; I have never seen my first cut
    at a TeX document remember to escape all the characters that need
    escaping.

    You will perhaps be saying now that tab- and comma-delimited formats
    are simpler.  But even tab- or comma-delimited formats are likely to
    have at least two magic characters and maybe more: they need to
    escape at least their main delimiter (tab or comma), and then also
    to escape whatever mechanism is used to escape tab or comma: if some
    values are quoted, there will need to be ways to escape quotation
    marks within quoted values; if backslash escaping is used, backslash
    must also be escaped.  One reason I have come to despise CSV is that
    I have come across so many pieces of software which claim to accept
    CSV but whose authors have botched the parsing.  (One of them had no
    way to allow commas in strings.)

I wonder if the lesson to be learned might more accurately be formated
as "When writing any program, pay attention to the formal definitions of
your input and your output; if you don't, you are likely to produce
output that is not in the specified output format."  If you are
producing XML, you have at least the advantage that your downstream data
consumers are likely to tell you what's wrong, instead of accepting the
bad data and silently producing bad results.

-- 
C. M. Sperberg-McQueen
Black Mesa Technologies LLC
http://blackmesatech.com


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS