On 4/22/2018 8:02 PM, Patrick Durusau wrote:
Simon,
I shudder at "...it's just an extraction problem...."
You're supposed to. That shuddering can be the first step away from
years of thinking that we work with solid stuff rather than liquids
or gases or plasma.
Switching from one ontology to another must just be a mapping problem. ;-)
It has similar problems, both technical and cultural.
If those are both "...just..." type problems, why do you think data
scientists keep talking about transformation of data being 80% of what
they do?
In my experience, data scientists talk about data clean-up as 80% of
what they do. I suppose you could count that as transformation, but
it includes everything from badly structured to badly entered to
outright corrupted data. I haven't ever heard "the XML Schema
structured things exactly as we wanted them" from... anyone who
hadn't just created that schema.
I don't think that "oh my god someone let this schema be extended"
is likely a problem for data scientists unless they wish that more
of their data had used those extensions. Granted, it's possible to
create extensions that duplicate data elsewhere in the schema, and
have semi-duplicate data that doesn't match (I have seen it!), but
again, that's not unusual cleanup.
Transformation requires an understanding of both the target and source
formats. Or should I say understanding the semantics of both formats?
Sure, if it's well-formed XML, all manner of things are fairly trivial,
if you just knew which ones to do.
In the case of schema extensions, you can:
(a) ignore the content because it isn't relevant to you
(b) ask for help
(c) study use context, including other people's transformations
(d) guess
Most of the time, (a) or (b) take care of it. (c) has not been
difficult in my (admittedly distant) experience.
There are also times - as Walter Perry has enjoyed reminding us over
the years - where we're more interested in where people jumped the
structures than we are in cases where they followed the rules.
Open schema models make it vastly easier to detect those changes.
In the interest of disclosure, I have seen any number of academic
projects that differ from other projects because they have special need
#1 or #2 or .... To be honest, not really. They typically are encoding
their texts to be different so it works with their tool set (which they
developed), etc. That may not be everyone's experience but it certainly
is mine in the humanities.
My experience is that everyone has something they want to do that
goes beyond the available tools. Either they shut up about it and
forget what they wanted, or they find a corner to allow it. As much
as I hate divs and spans in HTML, I know very very well why people
use them.
I'm not claiming my experience is universal and others may have
different stories to report.
Your experience is at least conventional. I just find those
conventions to be the wrong set.
There certainly are other ways to create vendor lock-in, such as writing
your own database software. (Or HR software, I understand the Pentagon
has some 6,000 such systems.)
It doesn't even take that. When XML was new, a lot of the
excitement around it came from businesses trying to get a leg up on
their competition by being ahead on the standards process. There
are still players in that game, but the stickiness of relationships,
the complexity of creating interfaces, the challenges of backwards
compatibility, and the power of brand loyalty seem vastly more
powerful.
You may be right, whether encouraged or not, bad behavior (lack of
interoperability) will occur. Still, the lack of same creates a lot of
wasted time and effort.
I no longer consider vocabulary interoperability good behavior. I
haven't for a long time. I think syntactic interop has much greater
value, making it much easier to share tools. Those tools ecosystems
are finally reaching the point where we can flexibly create and
exchange information without having to stay in lock step.
Way back in 2003, I gave a
rant at Extreme
Markup Languages on these issues, accompanied by Playmobil
figures and a bit of Strauss. I took a wrong turn in diving too
deep on the value of specific syntactic details, but I'm quite
content in the overall point that shared syntax is a blessing and
shared semantics a
curse.
XML was created in part as a reaction to HTML's fixed vocabulary.
I'm puzzled that HTML today seems to be grasping the need for
flexibility far better than the XML world - the dreaded div, span,
and class, JSON as needed, plus the slow refactoring of those pieces
into web components. We seem, though, to finally be reaching the
point where we can usefully exchange information and even interfaces
for working with information without endless negotiation over what
the structure must look like.
Enjoy the sunshine, endure the jetlag!
Thank you! I hope your gardening is going well!
Simon