[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
Re: [xml-dev] Tool converts records to XML
- From: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>
- To: Roger L Costello <costello@mitre.org>
- Date: Wed, 16 Nov 2022 17:08:50 -0700
Roger L Costello <costello@mitre.org> writes:
> Michael Kay wrote:
>
>> the "Barnes & Noble" problem. The number #1 blunder
>> when writing XML is not to bother escaping `<` and `&`
>> if they happen to occur in your input.
>
> Ouch!
>
> You are right Michael.
>
> Upon reflection, I realized that there is an even nastier problem
> lurking than the problem of converting & and < in the input record
> data into & and < in the output XML.
>
> ...
>
> To implement the character conversions in AWK would be a monumental task.
>
> Eeeeeeek!
>
> Lesson Learned: Don't use AWK to convert records to XML.
Well, you may be right, and I believe many on this list share my
preference for performing such conversions in XSLT and/or XQuery, but I
have to say that the lesson you suggest seems a slightly broader
conclusion than is warranted by the experience you describe.
A couple points of detail:
- Your downstream tools are likely to be somewhat happier if you
convert the data to UTF-8 or UTF-16, but unless I am mistaken you
are not in fact required to do so, in order to turn the data into
XML. XML does allow encoding declarations.
- If you do want to convert the encoding it would surprise me a bit if
awk had no constructs suitable for the work. It would surprise me
even more if a system with awk did not have the iconv utility for
converting textual data from one encoding to another.
iconv --from-code=WINDOWS-1252 --to-code=UTF-8 < myinput > output.utf8
- Your note sounds as if you found it difficult to contemplate the
horrifying task of escaping occurrences of & and < -- I don't see
what you regard as so difficult.
Of all the text formats I have worked with, XML is among the
simplest as regards the number and nature of its rules, and
especially the number of its magic characters. It has two, count
'em, two magic characters in ordinary textual content: ampersand and
left angle bracket. (Add the magic string ']]>' if for some reason
you choose to generate CDATA marked sections. Add the escaping of
the delimiters if you are generating attribute values.)
This contrasts favorably, in my experience, with the number of
characters you need to take care to escape if you are generating
other formats, for example TeX. I have never seen the first cut at
a TeX to XML conversion in which the programmer remembered to
unescape all the escaped characters; I have never seen my first cut
at a TeX document remember to escape all the characters that need
escaping.
You will perhaps be saying now that tab- and comma-delimited formats
are simpler. But even tab- or comma-delimited formats are likely to
have at least two magic characters and maybe more: they need to
escape at least their main delimiter (tab or comma), and then also
to escape whatever mechanism is used to escape tab or comma: if some
values are quoted, there will need to be ways to escape quotation
marks within quoted values; if backslash escaping is used, backslash
must also be escaped. One reason I have come to despise CSV is that
I have come across so many pieces of software which claim to accept
CSV but whose authors have botched the parsing. (One of them had no
way to allow commas in strings.)
I wonder if the lesson to be learned might more accurately be formated
as "When writing any program, pay attention to the formal definitions of
your input and your output; if you don't, you are likely to produce
output that is not in the specified output format." If you are
producing XML, you have at least the advantage that your downstream data
consumers are likely to tell you what's wrong, instead of accepting the
bad data and silently producing bad results.
--
C. M. Sperberg-McQueen
Black Mesa Technologies LLC
http://blackmesatech.com
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]