Schemaless XML?

Hi Folks,

Scenario: You are building an application that receives XML documents from various sources. The kinds of data in the XML documents are varied. The XML documents themselves are structured in various ways. Over time, new XML documents are received, containing new, unanticipated kinds of data.

How will your application handle such diversity?

One approach is to create an XML Schema that models all the various kinds of XML documents that will be received. When the application needs to process new XML documents, the XSD is updated. The disadvantage of this approach is that the processing of the new XML documents will be delayed as the XSD is updated and as the application is updated to handle the new data. The advantage of this approach is that the application knows exactly what the data is and can process it efficiently.

An alternate approach is for the application to go “schemaless.” The application performs machine learning on the data it receives. I’m not sure what “machine learning on the data” means. I suspect that it means that an internal schema (in some form or another) is dynamically generated. Do you agree? If so, then the approach is not actually schemaless; rather, there is a dynamically generated schema. Do you agree? Is machine learning technology sufficiently advanced that it can classify and understand the data to the same degree as a carefully crafted schema and carefully crafted application code? Have you gone schemaless?

/Roger