Roger L Costello <costello@mitre.org> writes:
> Michael Kay wrote:
>
>> the "Barnes & Noble" problem. The number #1 blunder
>> when writing XML is not to bother escaping `<` and `&`
>> if they happen to occur in your input.
>
> Ouch!
>
> You are right Michael.
>
> Upon reflection, I realized that there is an even nastier problem
> lurking than the problem of converting & and < in the input record
> data into & and < in the output XML.
>
> ...
>
> To implement the character conversions in AWK would be a monumental task.
>
> Eeeeeeek!
>
> Lesson Learned: Don't use AWK to convert records to XML.
Well, you may be right, and I believe many on this list share my
preference for performing such conversions in XSLT and/or XQuery, but I
have to say that the lesson you suggest seems a slightly broader
conclusion than is warranted by the experience you describe.
Agreed.
A couple points of detail:
- Your downstream tools are likely to be somewhat happier if you
convert the data to UTF-8 or UTF-16, but unless I am mistaken you
are not in fact required to do so, in order to turn the data into
XML. XML does allow encoding declarations.
- If you do want to convert the encoding it would surprise me a bit if
awk had no constructs suitable for the work. It would surprise me
even more if a system with awk did not have the iconv utility for
converting textual data from one encoding to another.
iconv --from-code=WINDOWS-1252 --to-code=UTF-8 < myinput > output.utf8
awk does not need such a construct. iconv and awk are part of a unix like shell ecosystem so iconv can pre or post process an awk conversion using normal shell scripting piping