OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

 


 

   XML vs. Other Data Formats

[ Lists Home | Date Index | Thread Index ]


B Tommie Usdin wrote:
>
> At 11:30 AM -0700 6/8/06, <juanrgonzaleza@canonicalscience.com> wrote:
> ...
>
>> For instance, we began to encode data in XML and after of some
>> experiences decided to abandon the format. Therefore in the next
>> statistics we will not be in the 40% as now ;-)
>
> That's really interesting. Can you tell us why you decided to abandon
> XML? And what data format better meets your needs? And why? (This sort
> of user experience is very valuable.)
>
> -- Tommie

We attempted to base our data and applications in full XML technology. We
obtained many difficulties for a correct implementation of available
specifications such as XHTML 1.1, MathML 2.0, XSL-FO, and others. We also
found further difficulties with CML, STMML, and UnitsML between others.

Then turned to our own Markup XML based language: CanonML. Composed of
several modules, e.g. CanonMath for mathematics that would improve MathML
in some aspects, CanonTexT instead XHTML (2.0), etcetera.

[http://canonicalscience.blogspot.com/2006/02/choosing-notationsyntax-for-canonmath.html]

This would rely on XSLT for transformations to browser side and CSS
(instead XSL-FO because is not supported).

But more problems did arise!

I finally decided that independently of how many time and money was wasted
in the XML approach it would never work in the way that I had in my mind.
In fact, it is really impressive to see that even giants as Elsevier are
not following all XML standards because problems, e.g. they are using an
in-house modification of w3c MathML and complementing its usage with own
Elsevier CEP markup.

Next logical step was the search for alternative technologies but what
one? SGML? TeX/LaTeX? Liminal? LAMN? YAML? Other?

None of them satisfied all needs and since we were ready to break with the
XML world, we could break more still adding requirements that initially
were outside the technological program at the Center. Then I decided we
could begin from zero, rewriting all layers.

I choose S(cheme)XML as good initial point but modified for adapting it to
exclusive needs. Then CanonML re-borns now being not a XML application but
a Canon(ical) Meta (formaL) Language. Actually I also am reusing CSS.

Requirements:

1) Dick Formal Language + Keizer vectors. This let us unification of
previous physicochemical scientific approach with the humanities world.

2) Data optimization similar to CSV-like approaches. This is specially
useful in large datuments: 7 Gb or bigger.

3) Encoding of non-hierarchical structures in a more powerful way that
liminal or GODDAG. Without added parsing difficulties associated to them.
For example Lewis structure for HF would be represented as

[H}[F} e e {H] e e e e e e {F]

<H/<F/ e e /H> e e e e e e /F>

4) Mathematical sophistication. Whereas in theory any mathematical
structure could be implemented in XML markup (e.g. OpenMath) in practice
there are problems related to presentation and also to human authoring.

Already MathML is so verbose cannot be authored by hand and tools are
generating ugly code. For example, after of 10 years of MathML, people
still has been unable to encode (ds)^2. Take as illustration Distler
MUSINGS blog which is claimed to be the most technologically advanced blog
of the planet. I do not needs the most advanced technology if cannot
encode something so simple as (ds)^2 in a full way.

Only real possibility for a first class encoding of
scientific-mathematical content is human authoring of formulae, which
cannot be achieved in a XML syntax. This is one of reason of popularity of
TeX/LaTeX/AMSTeX in academic comunities.

Also it may be remarked that nobody has still achieved TeX mathematical
typesetting quality in SGML or XML worlds.

5) No double data format: elements more attributes.

More elimination of any limitations of attributes (hierarchies, any
content) such as has been done in liminal or in ConciseXML.

6) Better internationalization and extensibility. Elimination of
limitations to tags names. It is interesting that internationalization is
only achieved in text not in markup. I can write Spanish text in a XML
document but cannot use Spanish words for the markup of a document <niño>
or <cigüeña> are not permitted for example. The Spanish version of English
<section> is <sección> and I can see some Spanish documents writing
<seccion> because own limitations of XML.

Any other limitations to markup are eliminated. For example, I can write
<water> or <cicloheptatrieno> but not <1,5-ciclooctadieno>. This is to be
avoided.

7) Multiple markup. How many times all of us need to name something in two
ways at the same time?

Normally one finds <B><I>A</I></B> and then people proposed <BI>A</BI>
years ago (today used in some forum boards). One finds <pre><code>
sequences today and then proposed <blockcode> in next XHTML 2.0.

But that obligates to add new tags in novel specifications instead reusing
available ones. This is achieved in CanonML via multimarkup model.

8) Elimination of end tags or doing them optative at least, somewhat as in
liminal, SXML, ConciseXML, and others.

9) Unification of language.

In the XML world one finds: XML, XPath, CSS, Relax NG, SVG... We use same
syntax for everything. Whereas XSL-FO is a "XMLized" version of CSS,
XPath, however, uses a non-XML syntax due to XML limitations.

10) Consistency.

Each module would be interoperable. Many XML technology developed by the
w3c is not consistent and interoperable, with many groups reinventing the
wheel or compiting ones with others -doing very difficult the life to end
users-.

It is very socking for outsiders that one group was using HTML links other
Xlinks and a third one using XHTML-links. It is surprising the open
discussion between CSS and XSL-FO members with rude critizism in some
cases. It is atonishing that <tag>a </tag> was equal to <tag>a</tag> in
some XML applications but not in the original XML specification.

It is unlikely that the wheel was being reinvented again, for example if I
want change <tag>content</tag> to bold font, I would use differents
aproaches if I am following CSS, XSL-FO, or MathML approaches; the problem
is that we can choose our favourite; the problem is we are forced to use
is implemented in browsers for each case (i.e. using the three in some
ocassions).

11) Simplification

Elimination of all unneded complexity and redundancy. For example
initially the first non-XML version of CanonML included pairs
tags-entities. For example

<para>
   My favourite greek letter is &beta; because elegance
</para>

was encoded as

[::para
   My favourite greek letter is \beta because elegance
]

now it is done like

[::para
   My favourite greek letter is ::beta because elegance
]

once a few days ago we took a LISP-TeX functional approach in CanonML
forgotting heritances and last residues from the initial XML-based
CanonML. Apparently T. Bray also want eliminate most of entities in next
XML 2.0.

In fact, we did recently discuss a rendering problem with one of MathML
predefined entities: &dd;. That problem had already been solved in the
original CanonMath language which was not using the entity.

I also am worried with XML dealing of White Space and that is also
simplified.

All other complexities as namespaces, Schemas, DOCTYPE, special notation
for empty tags and others are also to be removed.

12) But power it

E.g. eliminate the "--" limitation in comments. This is also achieved in
liminal.

Let PIs inside PIs. This nestig is also allowed in some other approach but
now I do not remember which.

13) Increase readability

Many datuments cannot be automated, e.g. scientific or mathematical
papers, therefore improving of readability may be welcomed.

Increase visual difference between begining and the end of marked fragments.

In XML difference there is a character of difference and this is far from
optimal. In liminal this is increased with the variation of two
characters.

CanonML copies the excellent readability of SXML, TeX, and similar
approaches.

White Space is also used for increasing the readability of CanonML datuments.


14) Above 13 may be an motivating collection of replies to your original
queries.



Juan R.

Center for CANONICAL |SCIENCE)







 

News | XML in Industry | Calendar | XML Registry
Marketplace | Resources | MyXML.org | Sponsors | Privacy Statement

Copyright 2001 XML.org. This site is hosted by OASIS