[libxml-devel] memory consumption when finding inside of large document never goes away

Charlie Savage cfis at savagexi.com
Sat Aug 16 17:30:02 EDT 2008


Hi Matt,

> I am running on OSX and RedHat.  I am using the Node#find method with an 
> XPath expression for the currently desired node in the default namespace 
> of the document.  The crashes stopped happening when I set my nodes 
> variable to nil before calling GC.start. The memory does not spike too 
> much if I call GC.start after every single Node#find but since parsing a 
> single document into the required number of ruby objects necessesitates 
> calling Node#find over a thousand times GC.start is really slowing 
> things down.

Right, that is what you have to do (nodes = nil before GC.start).  In my 
view, this is a design flaw in Ruby's GC but I didn't get very far when 
I asked about it to the Ruby core list.  We can work around it, but I 
haven't had a chance to do it.  If you're feeling like writing some C 
code, I can explain how I think the problem can be fixed so you avoid 
all the manual GCs.

>  From what I can tell calling Node#find on such a large document is 
> causing Ruby to add extra object heaps which increases my memory usage 
> in a way that the program does not recover from.  This is unfortunate 
> since I want to run multiple processes per box but each process is using 
> several hundred megabytes of RAM after parsing a few large documents.

Well, the bindings generally only wrap an object when you access it.  So 
in theory, calling nodes = document.find should only add on Ruby object 
(the result object).  The code used to wrap every returned object, but 
I'm pretty sure I changed it.  To verify, the code is in the 
xpath_object class.

Now if you then iterate over each returned node in the result, they will 
of course get wrapped (i.e, a Ruby object is created for each libxml node).

> The SAX parser with empty callbacks can rip through the document in 
> about 17ms which is very fast in my opinion.  The speed problem arrises 
> when I try to do anything in the callbacks.  The nature of the program 
> and the structure of the XML requires me to do quite  few lookups in a 
> series of hashes to determine the type of the current node and the type 
> of each text element.  When SAX parsing I have to hit the hashes more 
> often since I don't have as much context information available as I do 
> with a recursive depth first document walk with the document parser node 
> objects.  With the necessary code in the callbacks I was seeing parse 
> times around 400ms which is about twice as slow as the document based 
> approach.

Oh, I see.  So its all in the lookups.

> XMLReader looks very interesting from the API docs but I am not sure 
> that I grok how to actually use it.  I will keep searching for resources 
> but if you know of any examples of usage out there I would love to read 
> some code.

I think there are a couple of tests (libxml/test) that might help a bit. 
  Can't say I'm super familiar with that code either.  But look for 
Python examples perhaps or .NET (libxml copied the api from .NET 
supposedly, based on reading the libxml site).

Charlie
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 3237 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://rubyforge.org/pipermail/libxml-devel/attachments/20080816/3da2858b/attachment-0001.bin>


More information about the libxml-devel mailing list