Re: Fwd: Fw: [xml-dev] Imagine

On Mon, Feb 14, 2022 at 6:05 PM Rick Jelliffe <rjelliffe@allette.com.au> wrote:

In Australia, one of our most important public databases is the Pharmaceutical Benefits Scheme.
For all medications that the government choose, the government uses its purchasing power to negotiate prices from vendors which pharmacists use with some leeway; if your condition meets the requirement for that, when you buy the drug at the chemist you pay a discounted price and get a substantial rebate later. A drug "on the PBS" will be maybe 1/4 the price if prescribed "off the PBS", and in the case of pioneering drugs can be thousands of dollars cheaper per prescription. Almost no-one cannot afford drugs.

So it is basically an elaborate price catalog with medical notes, symptoms, formulae, brands, vendors, warnings, indications, limitations. Most prescription drugs go through the PBS, so PBS has been absolutely central to our society for two generations now. And a lot of money uses it.

For the last decade or so, PBS had a public data dump format in XML specifically designed to be RDF friendly: some RDF namespace attributes etc.

However, as far as I know, only one organization ever used the RDF capability, and I am not sure if it was experimental.

But most users found the constant indirection impossible to fathom for non-RDF use. (I think users wanted as complete a main tree as possible, with obvious relationships and tables where possible. Nothing flat.)

An update was made that used GUIDs for all IDs instead partitioning IDs by object. That simplified things a lot, reduced jumps and collected more fields together. A good improvement, I thought, but still without warm reception.

But my understanding is that the Department of Health are looking at dropping the public RDF entirely, and indeed maybe the public XML (not the internal publishing XMLs though.) I am not in the loop, and have not checked the situation in 2022, but that is my understanding of the ideas being floated. (One option was to pretty much expose the original relational data tables by simple web service queries on a each table returning rows as JSON.)

So this might be an example of trying to have RDF -friendly XML which some might think backfired. (Of course, perhaps users who shredded the XML back into relational tables were helped by the RDF-isms. But it seems the slimmest of arguments.) Not only was the RDF not what users were interested in, it may have contributed to non-idiomatic XML that was too much fiddly to use.

Yes, providing data in Open formats is about allowing potential ecosystems, not any guarantee that any new ecosystem will thrive, so I am not saying it was a bad call. I am not criticizing the XML or the RDF implementation.

I guess my point is that, in some cases, you ignore XML idiom at your peril. If you just dump relational tables into XML as tables, but if the database has a natural tree lurking in it, you are complicating the processing ability of the document as XML, in which case why not use CSV (or JSON)? I don't know why the same would not be so for triples.

Cheers,
Rick

On Tue, 15 Feb 2022, 9:13 am Webb Roberts, <webb@webbroberts.com> wrote:
During my time working on NIEM (the National Information Exchange Model), we kept integration of XML and RDF as a core tenet. The goal was to ensure that XML data and schemas in the NIEM ecosystem represented RDF data. There were several major pieces to this:

- We defined RDF resource identifiers for each XML qualified name. This gave us RDF names for types, elements, and attributes in XML schemas and data. (see https://niem.github.io/NIEM-NDR/v5.0/niem-ndr.html#section_5.6.1)
- We defined a mapping from data that uses NIEM to RDF. Instance documents are RDF datasets. Element and attribute occurrences are RDF properties. Most elements are subject-predicate-object triples. Some elements are RDF quads. Attribute and element values are RDF literals. (see https://niem.github.io/NIEM-NDR/v5.0/niem-ndr.html#section_5.6.3)
- We defined a mapping from XML schemas using NIEM to RDF schema. Complex types are RDF types. XSD type derivation reflects rdfs:subClassOf. Element and attribute declarations are RDF properties. Instance data has corresponding RDF types. (see https://niem.github.io/NIEM-NDR/v5.0/niem-ndr.html#section_5.6.4 & 5.6.5)
- We maintained rules about how XML and XML Schema were used that let us maintain the relationship between XML and RDF.

One consequence of the XML+XSD mapping to RDF is that it was very straightforward to use JSON-LD as a standard JSON representation for NIEM data, rather than construct a new mapping between XML and JSON. (see https://reference.niem.gov/niem/specification/json/5.0/niem-json-spec-5.0.html)

Webb Roberts
webb@webbroberts.com

On 2022-02-13, at 22:48, Dan Brickley <danbri@danbri.org> wrote:

As has already been pointed out - this might well seem dreamy on XML-DEV but in the RDF world it's pretty much what drew most of us to the technology.

Most RDF toolkits try to make it easy to consolidate information from various sources and formats into its common graph model. They will usually do some subset of the more explicitly RDF-flavoured formats, e.g. RDF/XML, Turtle, RDFa, JSON-LD, N-Triples, Trig, ... etc. But there will also be an API that can be called programmatically, to create triples from anything you have programmatic access to. You'll find XML adaptors of various kinds (XSLT being the most obvious). Back in the day there were angsty debates about schema annotation for mapping to triples. For example see https://www.w3.org/2003/02/schema-annotation
http://cmsmcq.com/2002/schema-annotation.html#ab2b3b3b7
https://www.w3.org/2000/08/w3c-synd/
https://www.w3.org/TR/schema-arch/
https://www.w3.org/1999/04/WebData etc

... although those things never turned out to be as important and central as folks thought.

RDF folk spend much of their time moving all kinds of data into RDF graphs/triples. But so much of this grungy data cleaning work is necessarily custom, per-dataset, per-application, ... limiting the value of generic conversion tools.

At W3C we did have something called GRDDL that's also close to the picture outlined here - https://en.wikipedia.org/wiki/GRDDL - but in my experience it is rarely used.

Finally - there is a ton of non imaginary RDF data out there. You might look at Schema.org - widely used for in-page markup, e.g. see http://webdatacommons.org/structureddata/schemaorgtables/ or for what we've been up to at Google, https://developers.google.com/search/docs/advanced/structured-data/search-gallery

Or at Wikidata, whose data is available in RDF dumps https://www.wikidata.org/wiki/Wikidata:RDF or at query.wikidata.org via SPARQL.

For Oygen entity in Wikidata, or rather their page about it, see https://www.wikidata.org/wiki/Q629

- chemical symbol (https://www.wikidata.org/wiki/Property:P246) = O
- atomic number (https://www.wikidata.org/wiki/Property:P1086) = 8
- mass (https://www.wikidata.org/wiki/Property:P2067) = 15.999 dalton
- electronegativity (https://www.wikidata.org/wiki/Property:P1108) = 3.44

If you look around at Wikidata you'll see that some of these factual claims are sourced.

So we can look into 15.999 being different to the 16 value in Hans-Jürgen's sketch. The sourcing given is:

Pure and Applied Chemistry
retrieved
19 October 2020
title
Atomic weights of the elements 2013 (IUPAC Technical Report) (English)
DOI
10.1515/PAC-2015-0305

Here is an example query that uses Wikidata SPARQL query service to pull out answers - i.e. Oxygen - with a chemical of 8.

SELECT ?chem ?chemLabel ?atomicNumber ?mass ?electronegativity ?anyprop ?anyval
WHERE {
?chem wdt:P246 ?chemSymbol; wdt:P1086 ?atomicNumber; wdt:P2067 ?mass; wdt:P1108 ?electronegativity; ?anyprop ?anyval .
FILTER(?atomicNumber=8)
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
# Helps get the label in your language, if not, then en language
}

You can run it here: https://w.wiki/4q52

And back on the original theme about mapping, it is also worth knowing about the CONSTRUCT mechanism in SPARQL.

We can take the above query and write CONSTRUCT queries that emit triples in a different shape or vocabulary.

This is a SPARQL query that takes what's in Wikidata for all Chemicals and emits triples along the lines sketched initially:

PREFIX foo: <https://foo.example.org/>
CONSTRUCT {
?chem foo:symbol ?chemLabel .
?chem foo:numberOfElectrons ?atomicNumber .
?chem foo:atomicMass ?mass .
?chem foo:electronegativity ?electronegativity .
?chem foo:discoTime ?discoTime .
# other properties from https://www.wikidata.org/wiki/Q629 here
} WHERE {
?chem wdt:P246 ?chemSymbol; wdt:P1086 ?atomicNumber; wdt:P2067 ?mass; wdt:P1108 ?electronegativity .
OPTIONAL { ?chem wdt:P575 ?discoTime . } .
# commented out to get everything FILTER(?atomicNumber=8)

SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
ORDER BY ?discoTime # this might be pointless for CONSTRUCT queries

Try it here: https://w.wiki/4q57

Hope this helps...

Dan

On Wed, 2 Feb 2022 at 11:31, Chet Ensign <chet.ensign@oasis-open.org> wrote:
I am posting this on behalf of Mr. Hans-Jürgen Rennau while we debug a problem with his emails being posted to the list.

---

Assume four datasets: an XML document, a JSON document, a CSV file and an HTML document (authored near the north pole, in the rain forest, in Athens and in the Antarctic, respectively).

Imagine a standard which enables you to define the mapping of a document node to a set of RDF triples.

Remember that all documents (XML, JSON, CSV, HTML) can be parsed into document nodes (for example see [1]).

Assume that the RDF graphs obtained from our documents contain the following triples:
foo:oxygen foo:symbol "O"
foo:oxygen foo:numberOfElectrons "8"
foo:oxygen foo:atomic mass "16"
foo:oxygen foo:electronegativity ."3.5"

each one found in a different one of the four RDF graphs.

Then we have integrated information, as we now know four things about oxygen, contributed by different data sources using a different data format. Of course it would be easy to serialize the integrated information into XML, or JSON, or CSV, or HTML or any other format (employing Inuit or any other natural language).

+ + + - - -

But I suppose you think this is an idle dream. Perhaps you think that the imagined standard would not be feasible to create or to use, or you question the practicality to leverage RDF IRIs for identifying resources and properties in more than a few specific cases.

Unfortunately I agree that it is an idle dream. Only the reason I see is a different one, as I am convinced that the imagined standard is not too difficult to create and to use and I do not question the practicality of using RDF IRIs in many fields, including natural science, pharmacology, health care, finance, many verticals and economical interaction. The reason I see is that it seems impossible to find minds with a deep interest in both, XML technology and semantic technology. if - then - but.

With kind regards,
Hans-Jürgen Rennau

[1]
https://www.w3.org/TR/xpath-functions-31/#func-doc
https://docs.basex.org/wiki/JSON_Module#json:doc
https://docs.basex.org/wiki/CSV_Module#csv:doc
https://docs.basex.org/wiki/HTML_Module#html:doc

--
Chet Ensign
Chief Technical Community Steward
OASIS Open

+1 201-341-1393
chet.ensign@oasis-open.org
www.oasis-open.org