XML.orgXML.org
FOCUS AREAS |XML-DEV |XML.org DAILY NEWSLINK |REGISTRY |RESOURCES |ABOUT
OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
RE: [xml-dev] Tool converts records to XML

Thanks Roger for sharing, effective simplicity is always interesting.  Here are some additional considerations.

 

  1. In the spirit of command-line options can add expressive power, worth noting that
  • <document> is a good default that could be renamed as a converter option, e.g.
  • <MyRootNodeName>

 

and similarly

  • <row> has a table-based connotation that might be better expressed as <element> for a contextual connotation, e.g.
  • <element>
  • <MyElementName>

 

  1. Such command-line rename options ought to be straightforward to implement, e.g.
  • toxml books.txt document= MyRootNodeName element=MyElementName

 

  1. In the skateboarding spirit of “attributes are not a crime” the following is also an equivalent, terser representation.

 

<document>

    <element title=”Unix Shell Programming”’ authors=”Stephen G. Kochan, Patrick Wood” date=”2019” isbn=”0-872-32400-3” publisher=”SAMS”/>

    <element title=”Small, Sharp Software Tools” authors=”Brian P. Hogan” date=”2019” isbn=”978-1-68050-296-1” publisher=” The Pragmatic Programmers”/>

    <element title=”The AWK Programming Language” authors=”Alfred V. Aho, Brian W. Kernighan, Peter J. Weinberger” date=”1988” isbn=”0-201-07981-X” publisher=”Addison-Wesley Publishing Company”/>

</document>

 

  1. Of note is that the escaping challenge for attribute values containing either single-quote ‘ or double-quote “ is usually easily handled by using the opposite character as attribute delimiter.
  • title=’My Favorite “Hello World” Program’

 

Although an author might have a single-quote or double-quote convention, such a choice is explicitly not considered part of the captured information in the Post-Schema Validation Inforset (PSVI).  During EXI endeavors we learned that user agents are allowed to switch between them as they see fit. So authors and authoring tools can have any encoding preference they want.

 

If an attribute value includes both single-quote ‘ and double-quote “ characters, then some character escaping is inevitable and multiple forms of attribute expression are possible.

  • <title>Roger’s Favorite “Hello World” Program</title>

and

  • title=’Roger&apos;s Favorite “Hello World” Program’
  • title=”Roger’s Favorite &quot;Hello World&quot; Program”
  • title=’Roger&apos;s Favorite &quot;Hello World&quot; Program’
  • title=”Roger&apos;s Favorite &quot;Hello World&quot; Program”
  • along with other numeric-escape alternatives, numerous permutations

 

Once again I believe that PSVI does not distinguish between any of these representations, all are exactly equivalent once parsed by an XML processor.  So you can choose whatever convention you like.

 

  1. Given these two equivalent forms for XML, it seems like a straight path to create corresponding equivalent representations in JSON and Turtle.

 

  1. Converting such semi-structured text like the books.txt example into structured data is of course a common and tedious (and error-prone) task.  So this is a worthy endeavor.

 

Establishing a good practice such as your toxml processor seems like an excellent processing-chain addition to XML, JSON, Semantic Web, and other Big Data workflows.

 

all the best, Don

--

Don Brutzman  Naval Postgraduate School, Code USW/Br        brutzman@nps.edu

Watkins 270,  MOVES Institute, Monterey CA 93943-5000 USA    +1.831.656.2149

X3D graphics, virtual worlds, Navy robotics https:// faculty.nps.edu/brutzman

 

From: Hans-Juergen Rennau <hrennau@yahoo.de>
Sent: Tuesday, November 15, 2022 1:37 AM
To: xml-dev@lists.xml.org; Roger L Costello <costello@mitre.org>
Subject: Re: [xml-dev] Tool converts records to XML

 

Roger, I would find it interesting to compare an awk solution with an XQuery one, also considering aspects like clarity and extensibility. Especially interesting as the potential of XQuery for tool building is by and large ignored.

 

With kind regards,

Hans-Jürgen

 

PS. Example of an XQuery-based solution:

 

declare variable $uri external;

declare variable $sep external := '&#x9;'; 

<document>{

    let $lines := unparsed-text-lines($uri)

    let $names := $lines => head() => tokenize($sep)

    for $line in tail($lines) return

    <row>{

        for $field at $pos in tokenize($line, $sep) return

            element {$names[$pos]} {$field}

    }</row>

}</document>

 

 

Am Dienstag, 15. November 2022 um 00:10:03 MEZ hat Roger L Costello <costello@mitre.org> Folgendes geschrieben:

 

 

Hi Folks,

 

In the spirit of UNIX tool building .....

 

I created a simple tool that converts records of tab-delimited data into XML. For example, these records:

 

title    authors    date    isbn    publisher

Unix Shell Programming    Stephen G. Kochan, Patrick Wood    2019    0-872-32400-3    SAMS

Small, Sharp Software Tools    Brian P. Hogan    2019    978-1-68050-296-1    The Pragmatic Programmers

The AWK Programming Language    Alfred V. Aho, Brian W. Kernighan, Peter J. Weinberger    1988    0-201-07981-X    Addison-Wesley Publishing Company

 

are converted to this XML:

 

<document>

    <row>

        <title>Unix Shell Programming</title>

        <authors>Stephen G. Kochan, Patrick Wood</authors>

        <date>2019</date>

        <isbn>0-872-32400-3</isbn>

        <publisher>SAMS</publisher>

    </row>

    <row>

        <title>Small, Sharp Software Tools</title>

        <authors>Brian P. Hogan</authors>

        <date>2019</date>

        <isbn>978-1-68050-296-1</isbn>

        <publisher>The Pragmatic Programmers</publisher>

    </row>

    <row>

        <title>The AWK Programming Language</title>

        <authors>Alfred V. Aho, Brian W. Kernighan, Peter J. Weinberger</authors>

        <date>1988</date>

        <isbn>0-201-07981-X</isbn>

        <publisher>Addison-Wesley Publishing Company</publisher>

    </row>

</document>

 

Each record is wrapped in a <row>...</row> element. The fields in each record are wrapped in an element named by the header. The root element is <document>...</document.

 

The tool may be invoked with a file, like this:

 

toxml books.txt

 

or from standard input, like this:

 

cat books.txt | toxml

 

The tool is a small AWK program, which I named "toxml":

---------------------------------------------------------

awk '

BEGIN      {  # field separator is tab (\t)

                  # record separator is LF (\n)

                  OFS=FS="\t"

                  RS="\n"

                  print "<document>"

            }

NR==1      {  # store column header names in an array

                  for (i=1; i<=NF; i++)

                          header[i]=$i;

            }

NR!=1      {  # create a <row>...</row> element for the line

                  # surround field $i with a start/end tag named header[i]

                  print "<row>"

                  for (i=1; i<=NF; i++)

                          print "<" header[i] ">" $i "</" header[i] ">"

                  print "</row>"

            }

END        { print "</document>" }' $*

---------------------------------------------------------

 

 

 

 

 

_______________________________________________________________________

 

XML-DEV is a publicly archived, unmoderated list hosted by OASIS

to support XML implementation and development. To minimize

spam in the archives, you must subscribe before posting.

 

[Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/

 

Attachment: smime.p7s
Description: application/pkcs7-signature



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS