Re: [xml-dev] RE: Schemaless XML?

"Schemaless" just means "we didn't bother working out the schema in advance".

Something still has to either have a-priori knowledge of the data and how to make sense of it or you have to analyze it to try to make sense of it in a way that enables whatever processing it is you're doing. That analysis might be automatable or it might not be--it depends on the data and what you're looking for.

For example, in a corpus of random XML documents I could analyze the tag names and try to find correlations that suggest common semantics (how may variants of "p", "para", "paragraph" are there if I think this is documents-for-reading type content?) or I could look at element content to try to find values that look like dates or hyperlinks or references to media objects or whatever it is I'm interested in.

The point is that what you look for and how you look for it depends entirely on what you're trying to do with the data and what you know about it up front.

There a different classes of problem and different solutions optimized for them.

I wouldn't use a noSQL database to manage data that is already highly regular and for which the processing is both well defined and needs to be optimized (e.g., tracking financial information or doing inventory control or whatever). By the same token I wouldn't use a SQL database to manage data where the structure is not easily reduceable to tables and the range of types of processing I might want to apply to it is unbounded and likely to change drastically over time.

I would say that we can see relational databases and similar highly-constrained approaches to some degree as optimization strategies that were necessary in the face of limited computing resources. As long as your scale is not extreme those limitations largely don't apply today, so it's now practical to use less-constrainted approaches to data storage, analysis, and retrieval.

If I take every XML document I have lying around and just chunk it into an XML database like eXist or BaseX or MarkLogic I can then do lots of interesting things quickly because I can write new queries against the data based on whatever understanding I have or I can build through the analysis that these tools let me perform. If some of these documents have associated grammars that's useful but not in any way necessary.

But likewise there's no magic in these tools--if I put a big pile of crap in I still have a big pile of crap, it's just that I can decrapify it faster because the tools offer powerful analytic facilities (e.g., XQuery, big-data analysis tools like Hadoop, etc.). If I can store the results of my analysis back into the same system, it makes things easier (because it's all in one place) and lets me layer new stuff into my system. For example, if I can infer or impose relationships among data items it might be very useful to capture those relationships as RDF triples that I put back into a triple store. If the triple store happens to be the same system that contains the rest of my data then it makes adding those triples to my analysis processing easier.

Note that the use or non-use of XML in this scenario and use or non-use of grammars with that XML is not very interesting by itself. XML has obvious advantages for certain classes of data and it has other important advantages as an open data format. But it has no particular magic in this context of dealing with large volumes of heterogenous data.

If you're looking at technology that can provide the most value in the context of "I have lots of data in a variety of forms and I need to try to get my hands around it" then a system that can accept XML and RDF and JSON or other formats that can be easily mapped into those kinds of basic data representations is going to offer more value than one that only manages a single format, whether that single format is XML, triples, relational tables, or whatever.

Cheers,

Eliot

Eliot Kimber

http://contrext.com