[
Lists Home |
Date Index |
Thread Index
]
- From: "Didier PH Martin" <martind@netfolder.com>
- To: "gopi" <gopi@aztecsoft.com>, "Jon Smirl" <jonsmirl@mediaone.net>, "Xml-Dev" <xml-dev@XML.ORG>
- Date: Tue, 28 Mar 2000 10:09:10 -0500
Hi,
I am following the discussion about the DOM and the permanent infosets with
great interest and it raised some questions too.
a) I saw a post saying that XT replied with an "Out of memory" error instead
of processing a 128K document. I do not know Java internals enough to say
something about it here but enough to say that the error message is mostly
caused either by some XT limitations or by the Java environement itself
(i.e. Java VM). Why? simply because on modern OS, the memory is limited by
the disk space and as you know, we call this virtual memory. So, when the
physical memory is insufficient to contain all the code and data, the
additional code and data is stored on disk and swapped in and out. From my
own experiments in Didier Lab's I found that the main problem with
information set based implementations is not the memory limitation per se
(if the infoset engine is able to fully use the virtual memory) but more a
question of clustering. If, for instance, a request for a particular node is
made to the information set and that the node in question is not currently
in physical memory, then this leads to a memory swap. So, the real problem
is more, how can we implement information sets so that the swapping
mechanism is minimized. Up to now, on the file I used, the swapping factor
has been reasonable but I never made a test with a 128K file. I only made
test with big permanent infoset not transient ones. Also, my XSLT scripts
have a tendency to do local processing and do not pull nodes from the
beginning or the end when the current processing is - for example - in the
middle of the file. So to speak, my XSLT scripts have a tendency to do local
processing and are not leading to a lot of memory swaping. Or that memory
swapping is occuring in a more of less linear fashion, wihout too much
random access outside the scope of the working set pages. In this case, a
working set of pages is a collection of neighborhood pages.
So to stretch my experiments I need bigger files and script different from
my actual patterns. Can anyone have a big XML file and its corresponding
XSLT script so that I can make new experiments with transient information
sets that are not java based? You'll help me make progress to pursue the
truth on this issue. I you have this kind and stuff and are able to share it
(I can sign an non-disclosure contract if you want), please e-mail me and I
thank you in advance for this. I'll be also able to give back the results to
the community and then discover what is:
a) language dependent
b) platform dependent (yes the Java environments are different platform -
i.e. does the IBM VM memory management behaves like the Microsoft or the
Sun's VM)
c) OS specific.
d) implementation dependent.
--------------------
b) About the permanent infosets. Again, I made several experiments in
Didier's labs with very large collections and infosets permanency. In some
ways, large collections are what you find today in relational databases. So,
I implemented an information set with a particular structure. An element can
contain other elements (so nothing new under the sun). But the elements
collection is managed differently. When the collection is large, it is
implemented as a black and red tree. When the collection is small, it is
implemented as a list. So, if an element is an parts catalog element
containing a big collection of parts, then the collection is implemented as
a black and red tree. On the other hand, each part element contains only
some sub-elements and the collection is implemented as a simple list. The
final goal of this, is to minimize the access time for a particular node
(the element name is the key when a black and red tree is used). Thus, to
access a particular part element in a parts-catalog element takes only log N
(nepearian log) search iteration to retrieve a sub node (i.e. a part
element). And to retrieve an element contained in a part element is a linear
search proportionally dependent on the number of item in the collection
(which is small). So far, so good.
As a second dimension to this system, the infoset nodes permanency. I used
memory mapped files or memory mapped object data base to realize my
experiments. This implies that in both cases, I am not restricted by the
physical memory space, that the physical memory is mapped to a file. For the
memory mapped file the amount of memory available was fixed. Except on
Windows 2000 where I was able to have a growable memory mapped file. And on
memory mapped data base, the data base is growable.
The issues I discovered with a permanent infoset is not the size per se (My
test info set gets a 15 megs size) but more where in virtual memory the
element is located. For instance, If I got a script that required random
elements in the parts-catalog collection, I got a high level of swapping. If
however, I got requests for elements in about the same page neighborhood I
got reasonable swapping. I guess that to master the swapping monster I will
have to look at the solution founds by the relational DB people.
Improvements: the big collections are managed by black and red trees and
this is not the most efficient way to manage huge indexes. I should probably
more toward B+* collections and have the bucket size equal the virtual
memory page (generally 4K). an other improvement is to master the allocation
of objects so that they are clustered by their indexes proximity. This last
case is only applicable to black and red or to B+* indexes.
Coding is based on C++, with constructors modified to allocate memory in
paged memory. This implies that the heap allocation mechanism where modified
and that I didn't used the default C++ heap allocation mechanisms.
Hope this can help the community reflection. Now its time to go back to
Didier's labs and do some work.
Cheers
Didier PH Martin
----------------------------------------------
Email: martind@netfolder.com
Conferences: Web Chicago(http://www.mfweb.com)
XML Europe (http://www.gca.org)
Book: XML Professional (http://www.wrox.com)
column: Style Matters (http://www.xml.com)
Products: http://www.netfolder.com
***************************************************************************
This is xml-dev, the mailing list for XML developers.
To unsubscribe, mailto:majordomo@xml.org&BODY=unsubscribe%20xml-dev
List archives are available at http://xml.org/archives/xml-dev/
***************************************************************************
|