xml-dev - Re: [xml-dev] Re: Divorcing Data Model and Syntax: was Re: [xml-dev] her

Re: [xml-dev] Re: Divorcing Data Model and Syntax: was Re: [xml-dev] her

[ Lists Home | Date Index | Thread Index ]

To: Patrick Durusau <pdurusau@emory.edu>
Subject: Re: [xml-dev] Re: Divorcing Data Model and Syntax: was Re: [xml-dev] heritage (was Re: [xml-dev] SGML on the Web)
From: Jeni Tennison <jeni@jenitennison.com>
Date: Mon, 7 Oct 2002 17:06:56 +0100
Cc: XML DEV <xml-dev@lists.xml.org>
In-reply-to: <3DA19A18.1030508@emory.edu>
Organization: Jeni Tennison Consulting Ltd
References: <200210071106.HAA12782@mail2.reutershealth.com><3DA16ACF.5010505@emory.edu> <76938281417.20021007124417@jenitennison.com><3DA18017.8010507@emory.edu> <107944480220.20021007142738@jenitennison.com><3DA19A18.1030508@emory.edu>
Reply-to: Jeni Tennison <jeni@jenitennison.com>

Hi Patrick,

>>A LMNL processor generates a LMNL data model (layers, ranges and
>>annotations) in whatever way it likes. I'm not sure what you mean
>>about places in the data and data associated with those places, but if
>>I rephrase to:
>>
>>  "Most LMNL processors will build a LMNL data model by taking a
>>   sequence of characters (a string) and deriving a structure from
>>   that string. This structure will usually be based on the presence
>>   of 'markup' within the string, whether explicit (such as XML tags)
>>   or implicit (such as spaces between words). The LMNL processor may
>>   also associate extra information with particular pieces of the
>>   string."
>
> Ultimately, whether my informal statement or your more formal one,
> the LMNL processor converts a serialization syntax based upon a data
> model into the LMNL data model. Is that a correct statement?

I don't want to answer yes or no to that because honestly it depends
on what you mean. Certainly a LMNL processor *can* be used in that
way, yes. Maybe the examples below will help clarify...

>>Yes. The reason that I don't think that this is a limitation is that
>>any particular layer within the LMNL data model can hold ranges and
>>annotations that represent any other kind of data model that can be
>>derived from a text document.
>
> Well, that is the rub isn't it? A string based data model cannot
> represent anything that is not a string. Divorcing the data model
> and serialization syntax is not limited to application to things
> that can be represented as strings. Anything that can be addressed
> from or by a serialization syntax can have a different data model
> imposed on it at the time of processing. Our examples, to be sure,
> have been texts, after all we (Matt & I) are both involved in
> biblical studies so it is what you would expect. ;-) Little demand
> for galactic coordinate spaces and the like in biblical studies. ;-)

Right -- the LMNL data model is limited to text documents. I think
that's a reasonable limitation given that we're trying to do
text-based markup, but I accept that it's a limitation.

The ARA model, on which LMNL is based, is *not* limited to text
documents, I believe -- it has the same generality as the grove
paradigm (from what I gather, before all the grove-heads start laying
into me).

>>Right. In the LMNL approach to the same problem, we might take a
>>sequence of characters (a text layer) and derive ranges over those
>>characters to create a syntactic layer, and then derive ranges over
>>those ranges to create a higher-level layer that represents a
>>particular data model, for example an XML Infoset layer.
>
> It is the step to the creation of the representation you describe
> that leaves me quite curious. Why? Unless there is something I
> cannot represent in the original serialization syntax that is an
> issue, why even go there?
>
> Take the example of the dictionary I offered over the weekend. (For the 
> benefit of those who missed that discussion I repeat the example.)
>
> ***repeat of example***
>
> <entry><headWord>JITTs</headWord>
>        (typical OED entry back to early Sumerian usage)
> </entry>

OK. A full LMNL processor could represent this as a text layer: a
sequence of characters {'<', 'e', 'n', 't', 'r', 'y', ..., 'r', 'y',
'>'}.

The LMNL processor would then construct ranges over those characters,
perhaps based on regular expressions. Which ranges are constructed
depends on the data model that you're aiming for, but let's say that
the first set of constructors were ones that recognised those
constructs specified by the XML Recommendation.

For example, it would recognise the sequence of characters {'<', 'e',
'n', 't', 'r', 'y', '>'} and construct a range over that sequence
whose name would be "xml:STag". It would recognise the sequence {'e',
'n', 't', 'r', 'y'} and construct a range over that sequence whose
name would be "xml:Name", and so on.

These ranges form a "layer" that overlays the "text layer". I'll call
this the "syntax" layer, but you could give it any label you like.

The LMNL processor could then construct another layer that contains
ranges that range over the ranges in the syntax layer. In this layer,
some of the ranges from the syntax layer could be interpreted to
create ranges called "info:element", while others were ignored. The
blob that you want to ignore could be ignored at this level.

The outcome is a layer that contains ranges that together represent an
XML Infoset. That layer can then be serialized as an XML document.

The reason for the LMNL syntax is that we want applications to be able
to pass around this whole data model -- the text layer, syntax layer
and the final XML Infoset layer and all the characters and ranges and
annotations that they contain -- all the analysis of the document.
That's not something that you want in this case, which is fair enough
-- you can just serialise the XML Infoset layer and be done with it.

Similarly if I started from another document:

  7 October 2002

and constructed layers over that, picking out the ranges of the year,
month and day in one layer, and interpreting the combination of those
three as being a date in another, then I could serialise that final
layer using a standard serialisation specific for dates:

  2002-10-07

But it might also be useful to also be able to pass around the full
results of the analysis. That's where the LMNL syntax comes into play,
because the data model as a whole can be represented using LMNL, for
example as:

  [!layer name="type" base="#default"]
  [date~type}
    [day}7{day] [month [num}10{]}October{month] [year}2002{year]
  {date~type]

> Now as I understand your explanation, everything between <entry> to
> </entry> would be converted into the LMNL data model?

There are many ways in which a LMNL processor could interpret the
document. One way would be to create an layer representing an XML
Infoset from it. Another way would be to create a reified LMNL layer
from it. Whatever kind of layer, it could represent all or part of the
document.

Whichever way the document is interpreted, the LMNL data model acts as
a framework for its interpretation.

> In other words, does (or does not) the LMNL data model recognize all
> the ranges in XML text being read into the LMNL data model?

If you took the document:

<entry><headWord>JITTs</headWord>
  (typical OED entry back to early Sumerian usage)
</entry>

and converted it, without filtering into a reified LMNL layer in the
standard way (where the "standard way" is a way that I've yet to write
up, but basically maps elements to [lr:range] and attributes to
[lr:annotation] ranges), then that reified LMNL layer would contain
everything from the XML document. If you then serialised that reified
LMNL layer in LMNL syntax, it would look like:

[entry}[headWord}JITTs{headWord]
  (typical OED entry back to early Sumerian usage)
{entry]

> If all I need do is avoid interpreting markup that will simply bulk
> up my DOM tree, why do I need LMNL? All the markup is still present
> should I assert another tree, this one selecting only the entry I
> want, and I process all the markup found just like any other XML
> fragment.

If all you want to do is avoid interpreting parts of your XML document
that will simply bulk up your DOM tree, then writing a SAXFilter to
weed out the stuff you're not interested in is quite sufficient.

The full LMNL processing gubbins that I've described above is just a
way of *thinking* about the problem of extracting meaning from text,
not necessarily the way to *do* it. An API onto a particular syntax
can act as a shortcut through the layers. For example, the SAX API is
a shortcut through the "syntax layer" that I described above -- rather
than using some generic processing based on regular expressions to
create STag and ETag ranges, a SAX parser reports the ranges as it
finds them.

Similarly, LMNOP (the LMNL parser) shortcuts the tedious stuff about
creating a text layer, a syntax layer and so on, and just describes,
using SAL, the ranges that are present on the reified LMNL layer.

Honestly, I think that JITTs and the LMNL processing model that I
described above are doing exactly the same thing -- forming a bridge
from a syntax to a data model -- but we're thinking about it in
different ways -- you as a single process that takes you from the
syntax to the data model, me as a sequence of processes that
eventually get you there.

In any case, the discussion is really helpful for me -- I find the
relationship between the reified LMNL layer and the LMNL data model,
and especially the consequences for the APIs that I'm putting
together, hard to wrap my head around at times. This discussion is
really helping me to clarify that in my own mind, even if I'm not
succeeding in clarifying it in anyone else's :)

Cheers,

Jeni

---
Jeni Tennison
http://www.jenitennison.com/

Follow-Ups:
- Re: [xml-dev] Re: Divorcing Data Model and Syntax: was Re: [xml-dev] heritage (was Re: [xml-dev] SGML on the Web)
  - From: "W. E. Perry" <wperry@fiduciary.com>

References:
- Re: [xml-dev] heritage (was Re: [xml-dev] SGML on the Web)
  - From: John Cowan <jcowan@reutershealth.com>
- Re: [xml-dev] heritage (was Re: [xml-dev] SGML on the Web)
  - From: Patrick Durusau <pdurusau@emory.edu>
- Re: [xml-dev] heritage (was Re: [xml-dev] SGML on the Web)
  - From: Jeni Tennison <jeni@jenitennison.com>
- Divorcing Data Model and Syntax: was Re: [xml-dev] heritage (was Re: [xml-dev] SGML on the Web)
  - From: Patrick Durusau <pdurusau@emory.edu>
- Re: Divorcing Data Model and Syntax: was Re: [xml-dev] heritage (was Re: [xml-dev] SGML on the Web)
  - From: Jeni Tennison <jeni@jenitennison.com>
- Re: [xml-dev] Re: Divorcing Data Model and Syntax: was Re: [xml-dev] heritage (was Re: [xml-dev] SGML on the Web)
  - From: Patrick Durusau <pdurusau@emory.edu>

Prev by Date: Re: [xml-dev] heritage (was Re: [xml-dev] SGML on the Web)
Next by Date: Essentials (was RE: [xml-dev] heritage (was Re: [xml-dev] SGML on theWeb))
Previous by thread: Re: [xml-dev] Re: Divorcing Data Model and Syntax: was Re: [xml-dev] heritage (was Re: [xml-dev] SGML on the Web)
Next by thread: Re: [xml-dev] Re: Divorcing Data Model and Syntax: was Re: [xml-dev] heritage (was Re: [xml-dev] SGML on the Web)
Index(es):
- Date
- Thread