XML.orgXML.org
FOCUS AREAS |XML-DEV |XML.org DAILY NEWSLINK |REGISTRY |RESOURCES |ABOUT
OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
Shouldn't XML be expressive enough to model E-R constructs directly?

I am not sure if the following makes sense. For the gist, skip to the Conjecture paras.  (Please don't be distracted by details when I say something that is not quite right in the numbered sections: they are examples not assertions.)

1. Typed nodes, untyped endpoints, untyped edges.
SGML's ESIS  (Element Structure InformationSet, akin to SAX) was a list reported events ("information") from parsing a document. So we can view it as nodes where each node has two unlabelled edges, except the first and last which have one only.
  • There are 3 types of node: data (characters), tag, declaration. 
  • There are five types of tag: start-document, end-document, internal entity reference, external entity reference, start-tag, end-tag, processing-instruction.
    • These may have various properties: e.g., for a start-tag, generic identifier and attributes
  • There are six types of declaration: doctype, entity, element, attlist, notation).
    • These may have various properties: name, value, etc. e.g. For an internal entity, a name, and text.    
  • The tag types are partially ordered: declarations must come before all element tags, attributes "tags" must come immediately after an element-start tag.
2. Typed nodes, typed endpoints, untyped edges
The XPath 1 Data Model has six types of nodes: document, element, text, attribute, comment, processing-instruction, notation. 
  • Element, attribute and processing-instructions are labeled: the element name, attribute name, and processing-instruction target.
    • There is one document node with an edge to one element node.  An attribute node is a name-value pair with a property to indicate if the node is an ID. 
  • There are edges are unlabelled, untyped but their endpoints have labels called the "axis": child/parent, preceding/following, attribute/parent, namespace/parent.  
    • An element node has only one parent end-point, one preceding endpoint and one following endpoint. All edges have an element at one or both ends. 
    • Child/parent edges have an ordering property. (This might be considered a label.)
  • The network is a tree, routed at the document node.
3. Untyped nodes, typed edges, untyped endpoints
This a kind of view that a text editor might have (ignoring markup declarations) allowing top-down navigation into various files.  It is a tree of anonymous nodes and named set of resources/files/objects called entities.
  • An entity is a text entity or non-text entity.
  • Nodes are anonymous and untyped
    • Each node is itself a sequence of text ranges in text entities and an ordering property
    • Each node therefore represents a text run in contents of an element, with entity references dereferenced but not merged, rather than the element itself.
  • There are five types of edges: data, element, attribute, comment, processing instruction, ID/REF
    • Element, attribute, processing instruction and ID/IDREF edges have names.
    • Edges are directed from, to.
    • Edges may also have text ranges (e.g. for the tags, if their location is useful to the application)
    • Apart from the root node, all nodes have one element edge connected by the "to" endpoint.
    • Except for elements, each node connects to only one edge.
    • Element, data, attribute, comment and processing-instruction edges form an ordered, directed tree.
    • ID/IDREF edges are named by the IDREF value. Consequently, the "from" endpoint with a node with an particular attribute edge connecting to a node with a value containing that identifer (i.e. an IDREF or IDREFs attribute), as does the "to" endpoint (ie., an ID attribute)
  • The document is therefore a directed, possibly cyclic rooted graph, where the nodes are themselves trees of possibly overlapping ranges of text in entities.
4. Attribute-labelled nodes with attribute-labelled edges.

An example is Chen's Entity-Relationship Model: it has three types of nodes: entity, relationship and attribute.
  • Nodes are labelled: the entity name, relationship name, attribute name.
  • Edges have no type or label, but their end-points have a cardinality property: for end-points of edges involving attributes, this may not be greater than 1. 
  • All nodes have at least one edge. A node cannot have an edge to the same type of node, and an attribute node only has one edge (i.e., there are "entity attributes" and "relationship attributes".) 
  • One attribute connected to an entity can be the "primary key".
  • The network is a graph. Edges and nodes do not have any ordering property.

Conjecture: XML cannot represent #4

XML's strength is that it can be viewed as node-labelled or edge-labelled networks in all sorts of ways.

However, while we can have attributes that attach to nodes or that attach to edges, there is no way in XML itself to attach attributes to both nodes and edges. Which is what the E-R model wants.  You have to go outside XML to some higher layer that may give a view of the document with some attributes coming in on nodes (or edges) associated with edges (or nodes.)  This impacts every part of the XML ecosystem,  which also needs to somehow convey this out-of-band non-schema information, where it exists.

For example, in a start-tag `< buttons type="wooden" xml:lang="de">...`  the @type is an attribute of buttons, but the @xml:lang is an attribute of the element's contents which are in German. In E-R terms, one should be an entity attribute and the other a relationship attribute. (Please don't quibble about this example, please focus on what I am trying to communicate, not on the details of what @xml:lang applies to in its specification, etc: I know what some of you are like :-) )

So what people often do, to get "pure" modeling, is like `< buttons type="wooden">< data xml:lang="de">...` to get an attachment. But this does not in fact let you know which attributes belong to edges and which to nodes, it just creates another edge and node.  You need to go outside.

How could this be resolved? My preference would be a specialized = delimiter, like =@, in an upgrade to XML for attributes that apply to the contents (attributes on the node if edges are normally used, and vice versa.)

But in XML 1.n, it could be done by some naming convention, for example, a namespace prefix starting with "data_" on any attribute that should attach to the contents between the tags, not the tag itself.  `< buttons type="wooden" xml:lang=@"de">...`


Regards
Rick


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 1993-2007 XML.org. This site is hosted by OASIS