Re: [xml-dev] Paper with an order of magnitude speed increase forparsing

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

Re: [xml-dev] Paper with an order of magnitude speed increase forparsing JSON!

From: "Liam R. E. Quin" <liam@fromoldbooks.org>
To: Rick Jelliffe <rjelliffe@allette.com.au>, xml-dev <xml-dev@lists.xml.org>
Date: Wed, 02 Jan 2019 12:29:33 -0500

On Wed, 2019-01-02 at 22:00 +1100, Rick Jelliffe wrote:
> [...]
> has the old answer of preprocessing files through grep (etc) to find
> candidates now respectable again?

The tradeoff for XML is generally that reading the file twice (once to
work out whether you want to parse it, and once to parse it) is likely
to be slower than parsing in many cases, depending on the data
structures you build. But there are things that can change this:

* a persistent index, e.g. a full-text database, can sometimes answer
the question of which XML file(s) to load without having to load them:
the index can be much smaller than the files, and/or can exploit Zipf's
Law to look at only a fraction of the index. But if you're going to do
this, why not use what i call a fast-forest store, perhaps with an
XQuery interface?

* On a multi-CPU system, if you have millions of tiny XML files, a
thread that pre-reads the files will make parsing go much faster, as
the next file will usually be in the disk cache (at one time my text
retrieval system took advantage of this, but i had to remove it to
support Microsoft Windows years ago - it's very platform-specific.

* Modern server storage in some cases is faster than main memory, or as
fast as the bus speed. So a disk cache is pointless, and scanning files
might be cheap. But this storage is expensive, so using it for a
database index may make more sense.

So in the end you have to measure.

Reading the actual paper, 4,000 lines of C to parse JSON more quickly
seems a lot, especially when part of the motivation is that loading
into Hadoop is slow. But it's a research paper. See also
https://blog.cloudera.com/blog/2017/02/performance-comparing-of-different-file-formats-and-storage-engines-in-hadoop-file-system/

Big data analytics, in which large amounts of data might never be
parsed, is likely a very different beast from technical documentation
(say), where every file is probably read (from an XML db or from disk)
many times more often than it's written.

Best,

Liam

--
Liam Quin, https://www.holoweb.net/liam/cv/
Web slave for vintage clipart http://www.fromoldbooks.org/
Available for XML/Document/Information Architecture/
XSL/XQuery/Web/Text Processing/A11Y work & consulting.

References:
- Paper with an order of magnitude speed increase for parsing JSON!
  - From: Rick Jelliffe <rjelliffe@allette.com.au>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]