[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
RE: [xml-dev] Text Markup Part II
- From: "Cox, Bruce" <Bruce.Cox@USPTO.GOV>
- To: Rand McRanderson <therandshow@gmail.com>, "xml-dev@lists.xml.org"<xml-dev@lists.xml.org>
- Date: Mon, 23 Jan 2012 15:06:15 -0500
As I read the many fascinating responses to Roger's original query, it seems to me that they often conflate two largely distinct meanings for the term "markup." Markup in the editorial, proofreader, or documentalist sense, it seems to me, isn't the same thing as XML markup. A scholar or editor adding marks to text (handwritten or otherwise) is working at one abstract layer (maybe more than one, but just one for the sake of this argument), while a technician taking the results and adding XML markup is working at a different layer.
Henry Thompson got it exactly right, I think, when he said that the purpose of markup (XML markup, I think) is "To make explicit for mechanical processing what is implicit-but-evident in the original." Editors and scholars, you could argue, do something very similar, but their audience is other humans, not machines. Humans are much harder to please. (I'm tempted to add "evident to persons" in Thompson's definition, but have so far resisted.)
As Jeremy Griffith pointed out, when automated typesetting was first used, the markup terminology was all about layout and typesetting, and was usually proprietary to the hardware. Early dedicated word processors were similar, such as the one I used here as late as 1989. The right kind of scholar might be able to make a case that the advent of general-purpose computers, especially personal computers, enabled general-purpose markup languages and gave a new dimension to editorial or scholarly markup. The advancing pace of technological change merits markup sufficiently abstracted from the hardware to easily adapt to unintended uses. Personal computers certainly enabled the Internet, pushing HTML way beyond its original purpose and creating the demand for XML. The very possibility of creating one's own markup language was, I'm guessing, more than a little exhilarating.
When the Government Printing Office switched from hot-lead typesetting to photocomposition, US patents were encoded using a hardware-specific typesetting markup that we called Blue Book (sorry, I've forgotten what GPO called it, and I never did know what hardware they used). That was before my time, but from about 1970 through 2000, we converted Blue Book to Green Book, which was our proto-semantic markup (Dialog-esque tags in 80-column punched-card images) intended for populating search systems. In 2001, we moved to full semantic (SGML) markup (WIPO Standard ST.32), which was then converted to photocomposition markup for printed patents. Since 2004, we've been using XML and a standardized vocabulary (WIPO standard ST.36 and soon ST.96). Page images (replacing printing) are created from PostScript files (from the XML using proprietary technology) published in PDF wrappers.
Patent examiners annotate what applicants submit. These annotations, similar to the scholarly markup that some of you have discussed, are private to the examiner, even after the contents of a file wrapper are made public. As we convert file content to text, we have to capture these annotations, preserve their context including their exact location or reference point in the content, and present them to the examiner on demand, but not to the general public or even privileged viewers of the file wrapper. It's an interesting challenge for the XML, probably already solved for other businesses with similar requirements. I'd welcome any pointers to commercial solutions.
Bruce B Cox
USPTO/OCIO/AED/SAED
571-272-9004
-----Original Message-----
From: Rand McRanderson [mailto:therandshow@gmail.com]
Sent: 2012 January 20, Friday 14:04
To: xml-dev@lists.xml.org
Subject: Re: [xml-dev] Text Markup Part II
I think a lot of it comes down to a faster-moving technology cycle, but also a less stable technology work environment. More companies want to incorporate new technologies faster, and many companies want be able to switch vendors easily and get new engineers up to speed faster.
Let me back up one step. I think intuitively we divide documents into chunks and sections and pieces and learn to identify where important parts are. In addition, in most big companies, even free text ends up being according to a standard which writes out how things should be laid out so that these pieces are in one place and those pieces are in another.
However, the intuitive understanding of free text, or even understanding free text according to a company writing standard, is usually tribal knowledge. Moreover, it is hard to move between computer systems. This doesn't matter when you have a very stable work environment where people have been working together for years and vendors have figured out how to process free text and integrate it into their systems.
But the working environment for engineers is not as stable as it used to be, partly because technology moves faster nowadays, but also because tech companies are not as stable as they used to be. This has meant lots of training of new engineers, or even worse when new project managers or new vendors come in, retraining of the old staff.
Markup doesn't solve these problems, but it makes them easier by allowing machine translation of information, easier standardization of document structure, and organized rules to replace the intuitive understanding of where important information is.
Plus there's fashion, once a technology trend gets started it takes on a life of its own, irrespective of your needs.
Sincerely,
John Thomas
On Fri, Jan 20, 2012 at 7:14 AM, Michael Hopwood <michael@editeur.org> wrote:
>>>> What I'm trying to get at is the fundamental rational between what appears to be two extremes not necessarily compatible, and why we ended up with only the later.
>
> Actually, and for a very long time before it became "cool" ;) to talk about this, it's really only been the latter, except that the "markup" for the rest of the document is implicit. Adding a "root" tag for the whole document simply formalises and makes machine-readable (although you could have done this a variety of other ways, like filename extensions) what "documentalists" of all kinds have been doing for a very long time; (more or less formally) identifying integral units of documentation.
>
> A MARC21 serialisation has message headers to separate different catalogue records in the stream - those are short documents, generally, although they can potentially get very long. And every element, as well as the whole thing, is marked up.
>
> ______________________________________________________________________
> _
>
> XML-DEV is a publicly archived, unmoderated list hosted by OASIS to
> support XML implementation and development. To minimize spam in the
> archives, you must subscribe before posting.
>
> [Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
> Or unsubscribe: xml-dev-unsubscribe@lists.xml.org
> subscribe: xml-dev-subscribe@lists.xml.org List archive:
> http://lists.xml.org/archives/xml-dev/
> List Guidelines: http://www.oasis-open.org/maillists/guidelines.php
>
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]