[libxml-devel] memory consumption when finding inside of large document never goes away
Charlie Savage
cfis at savagexi.com
Sat Aug 16 17:30:02 EDT 2008
Hi Matt,
> I am running on OSX and RedHat. I am using the Node#find method with an
> XPath expression for the currently desired node in the default namespace
> of the document. The crashes stopped happening when I set my nodes
> variable to nil before calling GC.start. The memory does not spike too
> much if I call GC.start after every single Node#find but since parsing a
> single document into the required number of ruby objects necessesitates
> calling Node#find over a thousand times GC.start is really slowing
> things down.
Right, that is what you have to do (nodes = nil before GC.start). In my
view, this is a design flaw in Ruby's GC but I didn't get very far when
I asked about it to the Ruby core list. We can work around it, but I
haven't had a chance to do it. If you're feeling like writing some C
code, I can explain how I think the problem can be fixed so you avoid
all the manual GCs.
> From what I can tell calling Node#find on such a large document is
> causing Ruby to add extra object heaps which increases my memory usage
> in a way that the program does not recover from. This is unfortunate
> since I want to run multiple processes per box but each process is using
> several hundred megabytes of RAM after parsing a few large documents.
Well, the bindings generally only wrap an object when you access it. So
in theory, calling nodes = document.find should only add on Ruby object
(the result object). The code used to wrap every returned object, but
I'm pretty sure I changed it. To verify, the code is in the
xpath_object class.
Now if you then iterate over each returned node in the result, they will
of course get wrapped (i.e, a Ruby object is created for each libxml node).
> The SAX parser with empty callbacks can rip through the document in
> about 17ms which is very fast in my opinion. The speed problem arrises
> when I try to do anything in the callbacks. The nature of the program
> and the structure of the XML requires me to do quite few lookups in a
> series of hashes to determine the type of the current node and the type
> of each text element. When SAX parsing I have to hit the hashes more
> often since I don't have as much context information available as I do
> with a recursive depth first document walk with the document parser node
> objects. With the necessary code in the callbacks I was seeing parse
> times around 400ms which is about twice as slow as the document based
> approach.
Oh, I see. So its all in the lookups.
> XMLReader looks very interesting from the API docs but I am not sure
> that I grok how to actually use it. I will keep searching for resources
> but if you know of any examples of usage out there I would love to read
> some code.
I think there are a couple of tests (libxml/test) that might help a bit.
Can't say I'm super familiar with that code either. But look for
Python examples perhaps or .NET (libxml copied the api from .NET
supposedly, based on reading the libxml site).
Charlie
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 3237 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://rubyforge.org/pipermail/libxml-devel/attachments/20080816/3da2858b/attachment-0001.bin>
More information about the libxml-devel
mailing list