[libxml-devel] memory consumption when finding inside of large document never goes away
Matthew Margolis
matt at mattmargolis.net
Sat Aug 16 17:16:45 EDT 2008
Charlie,
I am running on OSX and RedHat. I am using the Node#find method with an
XPath expression for the currently desired node in the default namespace of
the document. The crashes stopped happening when I set my nodes variable to
nil before calling GC.start. The memory does not spike too much if I call
GC.start after every single Node#find but since parsing a single document
into the required number of ruby objects necessesitates calling Node#find
over a thousand times GC.start is really slowing things down.
>From what I can tell calling Node#find on such a large document is causing
Ruby to add extra object heaps which increases my memory usage in a way that
the program does not recover from. This is unfortunate since I want to run
multiple processes per box but each process is using several hundred
megabytes of RAM after parsing a few large documents.
The SAX parser with empty callbacks can rip through the document in about
17ms which is very fast in my opinion. The speed problem arrises when I try
to do anything in the callbacks. The nature of the program and the
structure of the XML requires me to do quite few lookups in a series of
hashes to determine the type of the current node and the type of each text
element. When SAX parsing I have to hit the hashes more often since I don't
have as much context information available as I do with a recursive depth
first document walk with the document parser node objects. With the
necessary code in the callbacks I was seeing parse times around 400ms which
is about twice as slow as the document based approach.
XMLReader looks very interesting from the API docs but I am not sure that I
grok how to actually use it. I will keep searching for resources but if you
know of any examples of usage out there I would love to read some code.
Thank you,
Matt Margolis
2008/8/16 Charlie Savage <cfis at savagexi.com>
> Hi Matt,
>
> I am making the parsed ruby objects available to a Rails application and I
>> find that if I call GC.start when using the library with Rails that it takes
>> several seconds to garbage collect and sometimes crashes. If I call
>> GC.start in the loop when the program is running as a standalone process
>> then GC.start returns in a few dozen milliseconds.
>>
>
> What platform are you using? Can you run a debug version and get a stack
> trace so we can see what is going on? Are you using XPath? If so, make
> sure to free pointers to your XPath result objects and call GC.start before
> the associated documents get freed (see the rdocs for more info,
> document#find I think it is).
>
> I wrote a SAX style parser using libxml-ruby that does not suffer from the
>> memory growth but it is about 30 times slower than the document based parser
>> so I am really trying to make the document based approach work.
>>
>
> Why do you suppose SAX is so much slower. It should be a lot faster since
> it doesn't build an in-memory tree.
>
> Any chance the XMLReader would work for you?
>
> Charlie
>
> _______________________________________________
> libxml-devel mailing list
> libxml-devel at rubyforge.org
> http://rubyforge.org/mailman/listinfo/libxml-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://rubyforge.org/pipermail/libxml-devel/attachments/20080816/1bb396ba/attachment.html>
More information about the libxml-devel
mailing list