From rubikitch at ruby-lang.org Thu Jan 1 12:49:20 2009 From: rubikitch at ruby-lang.org (rubikitch at ruby-lang.org) Date: Fri, 02 Jan 2009 02:49:20 +0900 (JST) Subject: [Nokogiri-talk] Assertion failed Message-ID: <20090102.024920.19010451.rubikitch@ruby-lang.org> Hi, Currently I'm delveloping AutoPagerize for w3m. It uses Nokogiri on Ruby 1.9. This script causes assertion failure and exits abnormally on both Ruby 1.8 and Ruby 1.9. ==== require 'nokogiri' require 'open-uri' require 'kconv' url = "http://www.yahoo-search.jp/?id=300069&kw=a" xpath = %{id("resultList")//li} nokogiri = Nokogiri::HTML.parse(open(url).read.toutf8, nil, 'UTF-8') tree = nokogiri.xpath(xpath) ==== ==== $ ruby nokogiri-fatal.rb ruby: xml_xpath_context.c:49: evaluate: Assertion `ctx->node' failed. ==== -- rubikitch Blog: http://d.hatena.ne.jp/rubikitch/ Site: http://www.rubyist.net/~rubikitch/ From aaron.patterson at gmail.com Fri Jan 2 01:53:41 2009 From: aaron.patterson at gmail.com (Aaron Patterson) Date: Thu, 1 Jan 2009 22:53:41 -0800 Subject: [Nokogiri-talk] Assertion failed In-Reply-To: <20090102.024920.19010451.rubikitch@ruby-lang.org> References: <20090102.024920.19010451.rubikitch@ruby-lang.org> Message-ID: <6959e1680901012253p7f816645n5fa064bb79d314e2@mail.gmail.com> On Thu, Jan 1, 2009 at 9:49 AM, wrote: > Hi, > > Currently I'm delveloping AutoPagerize for w3m. It uses Nokogiri on Ruby 1.9. > > This script causes assertion failure and exits abnormally on both Ruby 1.8 and Ruby 1.9. > > ==== > require 'nokogiri' > require 'open-uri' > require 'kconv' > > url = "http://www.yahoo-search.jp/?id=300069&kw=a" > xpath = %{id("resultList")//li} > nokogiri = Nokogiri::HTML.parse(open(url).read.toutf8, nil, 'UTF-8') > tree = nokogiri.xpath(xpath) > ==== > > ==== > $ ruby nokogiri-fatal.rb > ruby: xml_xpath_context.c:49: evaluate: Assertion `ctx->node' failed. > ==== Interesting. What version of libxml are you using? Nokogiri::LIBXML_VERSION should contain the libxml version. -- Aaron Patterson http://tenderlovemaking.com/ From rubikitch at ruby-lang.org Fri Jan 2 02:21:33 2009 From: rubikitch at ruby-lang.org (rubikitch at ruby-lang.org) Date: Fri, 02 Jan 2009 16:21:33 +0900 (JST) Subject: [Nokogiri-talk] Assertion failed In-Reply-To: <6959e1680901012253p7f816645n5fa064bb79d314e2@mail.gmail.com> References: <20090102.024920.19010451.rubikitch@ruby-lang.org> <6959e1680901012253p7f816645n5fa064bb79d314e2@mail.gmail.com> Message-ID: <20090102.162133.31710204.rubikitch@ruby-lang.org> From: "Aaron Patterson" Subject: Re: [Nokogiri-talk] Assertion failed Date: Thu, 1 Jan 2009 22:53:41 -0800 > Interesting. What version of libxml are you using? > Nokogiri::LIBXML_VERSION should contain the libxml version. $ ruby19 -rnokogiri -e 'p Nokogiri::LIBXML_VERSION' "2.6.32" -- rubikitch Blog: http://d.hatena.ne.jp/rubikitch/ Site: http://www.rubyist.net/~rubikitch/ From aaron.patterson at gmail.com Fri Jan 2 02:28:06 2009 From: aaron.patterson at gmail.com (Aaron Patterson) Date: Thu, 1 Jan 2009 23:28:06 -0800 Subject: [Nokogiri-talk] Assertion failed In-Reply-To: <20090102.162133.31710204.rubikitch@ruby-lang.org> References: <20090102.024920.19010451.rubikitch@ruby-lang.org> <6959e1680901012253p7f816645n5fa064bb79d314e2@mail.gmail.com> <20090102.162133.31710204.rubikitch@ruby-lang.org> Message-ID: <6959e1680901012328o38595e76s1ef340d7437dcbb2@mail.gmail.com> On Thu, Jan 1, 2009 at 11:21 PM, wrote: > From: "Aaron Patterson" > Subject: Re: [Nokogiri-talk] Assertion failed > Date: Thu, 1 Jan 2009 22:53:41 -0800 > >> Interesting. What version of libxml are you using? >> Nokogiri::LIBXML_VERSION should contain the libxml version. > > $ ruby19 -rnokogiri -e 'p Nokogiri::LIBXML_VERSION' > "2.6.32" Thanks. I tried this out. Your xpath is invalid, but you should not get that error regardless. I've filed a ticket here: http://nokogiri.lighthouseapp.com/projects/19607-nokogiri/tickets/25-bad-xpath-causes-an-assertion-error I will fix the problem as soon as possible. Thank you for reporting it! -- Aaron Patterson http://tenderlovemaking.com/ From mike.dalessio at gmail.com Fri Jan 2 16:36:03 2009 From: mike.dalessio at gmail.com (Mike Dalessio) Date: Fri, 2 Jan 2009 16:36:03 -0500 Subject: [Nokogiri-talk] Assertion failed In-Reply-To: <20090102.162133.31710204.rubikitch@ruby-lang.org> References: <20090102.024920.19010451.rubikitch@ruby-lang.org> <6959e1680901012253p7f816645n5fa064bb79d314e2@mail.gmail.com> <20090102.162133.31710204.rubikitch@ruby-lang.org> Message-ID: <618c07250901021336s74b936c1jcd003c0c0acf2082@mail.gmail.com> Hi, This issue is now fixed in the github master branch. You can take a look at the commit for details: http://github.com/tenderlove/nokogiri/commit/764347e6dc34b425da1168807fe0aef431871304 This will be included in Nokogiri 1.1.1. -mike On Fri, Jan 2, 2009 at 2:21 AM, wrote: > From: "Aaron Patterson" > Subject: Re: [Nokogiri-talk] Assertion failed > Date: Thu, 1 Jan 2009 22:53:41 -0800 > > > Interesting. What version of libxml are you using? > > Nokogiri::LIBXML_VERSION should contain the libxml version. > > $ ruby19 -rnokogiri -e 'p Nokogiri::LIBXML_VERSION' > "2.6.32" > > -- > rubikitch > Blog: http://d.hatena.ne.jp/rubikitch/ > Site: http://www.rubyist.net/~rubikitch/ > _______________________________________________ > Nokogiri-talk mailing list > Nokogiri-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/nokogiri-talk > -- mike dalessio mike at csa.net -------------- next part -------------- An HTML attachment was scrubbed... URL: From rubikitch at ruby-lang.org Sun Jan 4 15:59:35 2009 From: rubikitch at ruby-lang.org (rubikitch at ruby-lang.org) Date: Mon, 05 Jan 2009 05:59:35 +0900 (JST) Subject: [Nokogiri-talk] Compatibility problem Message-ID: <20090105.055935.79177563.rubikitch@ruby-lang.org> Hi, Nokogiri is incompatible with Hpricot in some points. 1. Nokogiri::XML::Node#/ does not accept Symbol require 'hpricot' require 'nokogiri' def test1(klass) klass.parse("ok").at(:title).inner_html end test1 Hpricot # => "ok" test1 Nokogiri # => "ok" def test2(klass) klass.parse("ok")/:title end test2 Hpricot # => # "ok" }]> test2 Nokogiri # => # ~> /pkgs/ruby18/lib/ruby/site_ruby/1.8/nokogiri/css/generated_tokenizer.rb:48:in `initialize': can't convert Symbol into String (TypeError) # ~> from /pkgs/ruby18/lib/ruby/site_ruby/1.8/nokogiri/css/generated_tokenizer.rb:48:in `new' # ~> from /pkgs/ruby18/lib/ruby/site_ruby/1.8/nokogiri/css/generated_tokenizer.rb:48:in `scan_evaluate' # ~> from /pkgs/ruby18/lib/ruby/site_ruby/1.8/nokogiri/css/generated_tokenizer.rb:24:in `parse' # ~> from /pkgs/ruby18/lib/ruby/site_ruby/1.8/nokogiri/css.rb:13:in `parse' # ~> from /pkgs/ruby18/lib/ruby/site_ruby/1.8/nokogiri/xml/node.rb:48:in `/' # ~> from /pkgs/ruby18/lib/ruby/site_ruby/1.8/nokogiri/xml/node.rb:47:in `map' # ~> from /pkgs/ruby18/lib/ruby/site_ruby/1.8/nokogiri/xml/node.rb:47:in `/' # ~> from -:10:in `test2' # ~> from -:13 2. Nokogiri expects UTF-8 by default While Hpricot accepts any encodings, Nokogiri expects UTF-8. So parsing non-UTF-8 HTML may fail without optional third argument. Because WWW::Mechanize 0.9.0 uses Nokogiri as default parser, this spec may cause serious compatibility problem. via http://d.hatena.ne.jp/kitamomonga/20081229/ruby_difference_nokogiri_hpricot -- rubikitch Blog: http://d.hatena.ne.jp/rubikitch/ Site: http://www.rubyist.net/~rubikitch/ From tony at medioh.com Mon Jan 5 17:50:52 2009 From: tony at medioh.com (Tony Arcieri) Date: Mon, 5 Jan 2009 15:50:52 -0700 Subject: [Nokogiri-talk] Inspecting the namespace of a Nokogiri::XML::Node Message-ID: I have elements which are in an XML namespace, e.g. ... The resulting Nokogiri::XML::Node for these elements returns "bar" for the #name. Is there any way to inspect the namespace? Looking at the source I can't even figure out where Nokogiri::XML::Node#name or #path are defined. -- Tony Arcieri medioh.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From mike.dalessio at gmail.com Mon Jan 5 18:44:11 2009 From: mike.dalessio at gmail.com (Mike Dalessio) Date: Mon, 5 Jan 2009 18:44:11 -0500 Subject: [Nokogiri-talk] Inspecting the namespace of a Nokogiri::XML::Node In-Reply-To: References: Message-ID: <618c07250901051544o74916cbj36cf0f7d64693e1a@mail.gmail.com> Current github master has XML::Node.namespace() which will return the namespace prefix if it exists. This should be in 1.1.1 when it comes out. On Mon, Jan 5, 2009 at 5:50 PM, Tony Arcieri wrote: > I have elements which are in an XML namespace, e.g. > > ... > > The resulting Nokogiri::XML::Node for these elements returns "bar" for the > #name. Is there any way to inspect the namespace? > > Looking at the source I can't even figure out where > Nokogiri::XML::Node#name or #path are defined. > > -- > Tony Arcieri > medioh.com > > _______________________________________________ > Nokogiri-talk mailing list > Nokogiri-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/nokogiri-talk > > -- mike dalessio mike at csa.net -------------- next part -------------- An HTML attachment was scrubbed... URL: From mike.dalessio at gmail.com Mon Jan 5 21:02:09 2009 From: mike.dalessio at gmail.com (Mike Dalessio) Date: Mon, 5 Jan 2009 21:02:09 -0500 Subject: [Nokogiri-talk] Compatibility problem In-Reply-To: <20090105.055935.79177563.rubikitch@ruby-lang.org> References: <20090105.055935.79177563.rubikitch@ruby-lang.org> Message-ID: <618c07250901051802q6f179aa6sfd3643bef311ce3a@mail.gmail.com> Hi, Thanks for bringing these issues to our attention! I've just comitted a fix to github master which addresses the symbol-parameter issue: http://github.com/tenderlove/nokogiri/commit/45b39702375d770a629a0ae97e2103e7de6facfa However, I'm not sure I understand the second issue. Can you submit a failing spec or test that might help me understand? Thanks, -mike On Sun, Jan 4, 2009 at 3:59 PM, wrote: > Hi, > > Nokogiri is incompatible with Hpricot in some points. > > 1. Nokogiri::XML::Node#/ does not accept Symbol > > require 'hpricot' > require 'nokogiri' > > def test1(klass) > klass.parse("ok").at(:title).inner_html > end > test1 Hpricot # => "ok" > test1 Nokogiri # => "ok" > def test2(klass) > klass.parse("ok")/:title > end > test2 Hpricot # => # "ok" }]> > test2 Nokogiri # => > # ~> > /pkgs/ruby18/lib/ruby/site_ruby/1.8/nokogiri/css/generated_tokenizer.rb:48:in > `initialize': can't convert Symbol into String (TypeError) > # ~> from > /pkgs/ruby18/lib/ruby/site_ruby/1.8/nokogiri/css/generated_tokenizer.rb:48:in > `new' > # ~> from > /pkgs/ruby18/lib/ruby/site_ruby/1.8/nokogiri/css/generated_tokenizer.rb:48:in > `scan_evaluate' > # ~> from > /pkgs/ruby18/lib/ruby/site_ruby/1.8/nokogiri/css/generated_tokenizer.rb:24:in > `parse' > # ~> from /pkgs/ruby18/lib/ruby/site_ruby/1.8/nokogiri/css.rb:13:in > `parse' > # ~> from /pkgs/ruby18/lib/ruby/site_ruby/1.8/nokogiri/xml/node.rb:48:in > `/' > # ~> from /pkgs/ruby18/lib/ruby/site_ruby/1.8/nokogiri/xml/node.rb:47:in > `map' > # ~> from /pkgs/ruby18/lib/ruby/site_ruby/1.8/nokogiri/xml/node.rb:47:in > `/' > # ~> from -:10:in `test2' > # ~> from -:13 > > > 2. Nokogiri expects UTF-8 by default > > While Hpricot accepts any encodings, Nokogiri expects UTF-8. > So parsing non-UTF-8 HTML may fail without optional third argument. > Because WWW::Mechanize 0.9.0 uses Nokogiri as default parser, > this spec may cause serious compatibility problem. > > via > http://d.hatena.ne.jp/kitamomonga/20081229/ruby_difference_nokogiri_hpricot > > -- > rubikitch > Blog: http://d.hatena.ne.jp/rubikitch/ > Site: http://www.rubyist.net/~rubikitch/ > _______________________________________________ > Nokogiri-talk mailing list > Nokogiri-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/nokogiri-talk > -- mike dalessio mike at csa.net -------------- next part -------------- An HTML attachment was scrubbed... URL: From rubikitch at ruby-lang.org Tue Jan 6 04:25:04 2009 From: rubikitch at ruby-lang.org (rubikitch at ruby-lang.org) Date: Tue, 06 Jan 2009 18:25:04 +0900 (JST) Subject: [Nokogiri-talk] Compatibility problem In-Reply-To: <618c07250901051802q6f179aa6sfd3643bef311ce3a@mail.gmail.com> References: <20090105.055935.79177563.rubikitch@ruby-lang.org> <618c07250901051802q6f179aa6sfd3643bef311ce3a@mail.gmail.com> Message-ID: <20090106.182504.78301548.rubikitch@ruby-lang.org> From: "Mike Dalessio" Subject: Re: [Nokogiri-talk] Compatibility problem Date: Mon, 5 Jan 2009 21:02:09 -0500 > However, I'm not sure I understand the second issue. Can you submit a > failing spec or test that might help me understand? > > 2. Nokogiri expects UTF-8 by default > > > > While Hpricot accepts any encodings, Nokogiri expects UTF-8. > > So parsing non-UTF-8 HTML may fail without optional third argument. > > Because WWW::Mechanize 0.9.0 uses Nokogiri as default parser, > > this spec may cause serious compatibility problem. via http://d.hatena.ne.jp/kitamomonga/20081226/ruby_nokogiri_is_only_for_utf8 require 'kconv' require 'hpricot' require 'nokogiri' $KCODE='e' # HTML includes EUC-JP string. h1_euc = "?????????" h1_utf8 = h1_euc.toutf8 euc_html = "

#{h1_euc}

" # Hpricot parses EUC-JP html correctly. doc = Hpricot.parse(euc_html) text = doc.search('h1').inner_text text == h1_euc # => true # Nokogiri parses EUC-JP html with 3rd argument correctly, # but libxml2 internally converts HTML into UTF-8. # So the result of inner_text is UTF-8 encoding. doc = Nokogiri.parse(euc_html, nil, 'EUC-JP') text = doc.search('h1').inner_text text == h1_utf8 # => true # Nokogiri parses EUC-JP html without 3rd argument WRONGLY! doc = Nokogiri.parse(euc_html, nil, nil) text = doc.search('h1').inner_text text == h1_utf8 # => false -- rubikitch Blog: http://d.hatena.ne.jp/rubikitch/ Site: http://www.rubyist.net/~rubikitch/ From rubikitch at ruby-lang.org Tue Jan 6 04:38:22 2009 From: rubikitch at ruby-lang.org (rubikitch at ruby-lang.org) Date: Tue, 06 Jan 2009 18:38:22 +0900 (JST) Subject: [Nokogiri-talk] superfluous s Message-ID: <20090106.183822.132097333.rubikitch@ruby-lang.org> Nokogiri emits superfluous s for HTML whose newline is \r\n. It seems to convert \r into . require 'nokogiri' require 'nkf' # convert \n into \r\n html = NKF.nkf("-e --msdos", <

test paragraph foo bar

EOH nokogiri = Nokogiri::HTML.parse(html) puts nokogiri.to_html nokogiri.at("p").to_html # => "

test paragraph \nfoo bar

" nokogiri.search("p").to_html # => "

test paragraph \nfoo bar

" # >> # >> # >>

test paragraph # >> foo bar

# >> -- rubikitch Blog: http://d.hatena.ne.jp/rubikitch/ Site: http://www.rubyist.net/~rubikitch/ From rubikitch at ruby-lang.org Tue Jan 6 04:55:17 2009 From: rubikitch at ruby-lang.org (rubikitch at ruby-lang.org) Date: Tue, 06 Jan 2009 18:55:17 +0900 (JST) Subject: [Nokogiri-talk] Nokogiri::HTML without DOCTYPE Message-ID: <20090106.185517.52936377.rubikitch@ruby-lang.org> Nokogiri::HTML treats HTML without DOCTYPE as HTML 4.0 Transitional. But it cannot handle id() function. require 'nokogiri' DOCTYPE = '' html = <
hoge
EOH Nokogiri::HTML.parse(html) # => Nokogiri::HTML.parse(html).xpath('id("hoge")').to_a # => [] Nokogiri::HTML.parse(DOCTYPE+html) # => Nokogiri::HTML.parse(DOCTYPE+html).xpath('id("hoge")').to_a # => [
hoge
] -- rubikitch Blog: http://d.hatena.ne.jp/rubikitch/ Site: http://www.rubyist.net/~rubikitch/ From mike.dalessio at gmail.com Tue Jan 6 08:08:31 2009 From: mike.dalessio at gmail.com (Mike Dalessio) Date: Tue, 6 Jan 2009 08:08:31 -0500 Subject: [Nokogiri-talk] Compatibility problem In-Reply-To: <20090106.182504.78301548.rubikitch@ruby-lang.org> References: <20090105.055935.79177563.rubikitch@ruby-lang.org> <618c07250901051802q6f179aa6sfd3643bef311ce3a@mail.gmail.com> <20090106.182504.78301548.rubikitch@ruby-lang.org> Message-ID: <618c07250901060508h60261e1y378626bb4f0887d8@mail.gmail.com> Interesting! When I run this program, I get true / false / true (not true / true / false as you've commented). Here are the versions I'm running: libxml: 2.6.31 ruby: 1.8.6 p111 nokogiri: 1.1.0 What versions are you running? 2009/1/6 > From: "Mike Dalessio" > Subject: Re: [Nokogiri-talk] Compatibility problem > Date: Mon, 5 Jan 2009 21:02:09 -0500 > > > However, I'm not sure I understand the second issue. Can you submit a > > failing spec or test that might help me understand? > > > 2. Nokogiri expects UTF-8 by default > > > > > > While Hpricot accepts any encodings, Nokogiri expects UTF-8. > > > So parsing non-UTF-8 HTML may fail without optional third argument. > > > Because WWW::Mechanize 0.9.0 uses Nokogiri as default parser, > > > this spec may cause serious compatibility problem. > > via > http://d.hatena.ne.jp/kitamomonga/20081226/ruby_nokogiri_is_only_for_utf8 > > require 'kconv' > require 'hpricot' > require 'nokogiri' > > $KCODE='e' > > # HTML includes EUC-JP string. > h1_euc = "?????????" > h1_utf8 = h1_euc.toutf8 > euc_html = "

#{h1_euc}

" > > # Hpricot parses EUC-JP html correctly. > doc = Hpricot.parse(euc_html) > text = doc.search('h1').inner_text > text == h1_euc # => true > > # Nokogiri parses EUC-JP html with 3rd argument correctly, > # but libxml2 internally converts HTML into UTF-8. > # So the result of inner_text is UTF-8 encoding. > doc = Nokogiri.parse(euc_html, nil, 'EUC-JP') > text = doc.search('h1').inner_text > text == h1_utf8 # => true > > # Nokogiri parses EUC-JP html without 3rd argument WRONGLY! > doc = Nokogiri.parse(euc_html, nil, nil) > text = doc.search('h1').inner_text > text == h1_utf8 # => false > > -- > rubikitch > Blog: http://d.hatena.ne.jp/rubikitch/ > Site: http://www.rubyist.net/~rubikitch/ > _______________________________________________ > Nokogiri-talk mailing list > Nokogiri-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/nokogiri-talk > -- mike dalessio mike at csa.net -------------- next part -------------- An HTML attachment was scrubbed... URL: From mike.dalessio at gmail.com Tue Jan 6 08:11:38 2009 From: mike.dalessio at gmail.com (Mike Dalessio) Date: Tue, 6 Jan 2009 08:11:38 -0500 Subject: [Nokogiri-talk] superfluous s In-Reply-To: <20090106.183822.132097333.rubikitch@ruby-lang.org> References: <20090106.183822.132097333.rubikitch@ruby-lang.org> Message-ID: <618c07250901060511i69d9478oba1c9a4f14e5ae82@mail.gmail.com> Hi, Again, this works fine for me using libxml version 2.6.31 and nokogiri 1.0.0. Are you running an older libxml version? -mike On Tue, Jan 6, 2009 at 4:38 AM, wrote: > Nokogiri emits superfluous s for HTML whose newline is \r\n. > It seems to convert \r into . > > require 'nokogiri' > require 'nkf' > > # convert \n into \r\n > html = NKF.nkf("-e --msdos", < >

test paragraph > foo bar

> > EOH > nokogiri = Nokogiri::HTML.parse(html) > puts nokogiri.to_html > nokogiri.at("p").to_html # => "

test paragraph \nfoo bar >

" > nokogiri.search("p").to_html # => "

test paragraph \nfoo bar >

" > # >> http://www.w3.org/TR/REC-html40/loose.dtd"> > # >> > # >>

test paragraph > # >> foo bar

> # >> > > -- > rubikitch > Blog: http://d.hatena.ne.jp/rubikitch/ > Site: http://www.rubyist.net/~rubikitch/ > _______________________________________________ > Nokogiri-talk mailing list > Nokogiri-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/nokogiri-talk > -- mike dalessio mike at csa.net -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.patterson at gmail.com Tue Jan 6 11:07:16 2009 From: aaron.patterson at gmail.com (Aaron Patterson) Date: Tue, 6 Jan 2009 08:07:16 -0800 Subject: [Nokogiri-talk] superfluous s In-Reply-To: <618c07250901060511i69d9478oba1c9a4f14e5ae82@mail.gmail.com> References: <20090106.183822.132097333.rubikitch@ruby-lang.org> <618c07250901060511i69d9478oba1c9a4f14e5ae82@mail.gmail.com> Message-ID: <6959e1680901060807j1c63aa4bke255eab01dcbe900@mail.gmail.com> On Tue, Jan 6, 2009 at 5:11 AM, Mike Dalessio wrote: > Hi, > > Again, this works fine for me using libxml version 2.6.31 and nokogiri > 1.0.0. I'm able to reproduce this. Make sure you puts the last two lines. -- Aaron Patterson http://tenderlovemaking.com/ From aaron.patterson at gmail.com Tue Jan 6 11:14:39 2009 From: aaron.patterson at gmail.com (Aaron Patterson) Date: Tue, 6 Jan 2009 08:14:39 -0800 Subject: [Nokogiri-talk] Nokogiri::HTML without DOCTYPE In-Reply-To: <20090106.185517.52936377.rubikitch@ruby-lang.org> References: <20090106.185517.52936377.rubikitch@ruby-lang.org> Message-ID: <6959e1680901060814x34bd6602r51fac3a872769210@mail.gmail.com> On Tue, Jan 6, 2009 at 1:55 AM, wrote: > Nokogiri::HTML treats HTML without DOCTYPE as HTML 4.0 Transitional. > But it cannot handle id() function. > > require 'nokogiri' > > DOCTYPE = '' > html = < >
hoge
> > EOH > > Nokogiri::HTML.parse(html) # => > Nokogiri::HTML.parse(html).xpath('id("hoge")').to_a # => [] > Nokogiri::HTML.parse(DOCTYPE+html) # => > Nokogiri::HTML.parse(DOCTYPE+html).xpath('id("hoge")').to_a # => [
hoge
] Looks like a bug in libxml. I will report it to them. As a work around, I suggest this: Nokogiri::HTML.parse(html).xpath('//*[@id="hoge"]') -- Aaron Patterson http://tenderlovemaking.com/ From aaron.patterson at gmail.com Tue Jan 6 11:43:24 2009 From: aaron.patterson at gmail.com (Aaron Patterson) Date: Tue, 6 Jan 2009 08:43:24 -0800 Subject: [Nokogiri-talk] superfluous s In-Reply-To: <6959e1680901060807j1c63aa4bke255eab01dcbe900@mail.gmail.com> References: <20090106.183822.132097333.rubikitch@ruby-lang.org> <618c07250901060511i69d9478oba1c9a4f14e5ae82@mail.gmail.com> <6959e1680901060807j1c63aa4bke255eab01dcbe900@mail.gmail.com> Message-ID: <6959e1680901060843r41cd2782rc083b8155e30c94c@mail.gmail.com> On Tue, Jan 6, 2009 at 8:07 AM, Aaron Patterson wrote: > On Tue, Jan 6, 2009 at 5:11 AM, Mike Dalessio wrote: >> Hi, >> >> Again, this works fine for me using libxml version 2.6.31 and nokogiri >> 1.0.0. > > I'm able to reproduce this. Make sure you puts the last two lines. Fixed here: http://github.com/tenderlove/nokogiri/commit/8fe2b33b7983bfc2e438a12fce800d290454c92d -- Aaron Patterson http://tenderlovemaking.com/ From rubikitch at ruby-lang.org Wed Jan 7 11:00:42 2009 From: rubikitch at ruby-lang.org (rubikitch at ruby-lang.org) Date: Thu, 08 Jan 2009 01:00:42 +0900 (JST) Subject: [Nokogiri-talk] Compatibility problem In-Reply-To: <618c07250901060508h60261e1y378626bb4f0887d8@mail.gmail.com> References: <618c07250901051802q6f179aa6sfd3643bef311ce3a@mail.gmail.com> <20090106.182504.78301548.rubikitch@ruby-lang.org> <618c07250901060508h60261e1y378626bb4f0887d8@mail.gmail.com> Message-ID: <20090108.010042.238222913.rubikitch@ruby-lang.org> From: "Mike Dalessio" Subject: Re: [Nokogiri-talk] Compatibility problem Date: Tue, 6 Jan 2009 08:08:31 -0500 > Interesting! When I run this program, I get true / false / true (not true / > true / false as you've commented). > > Here are the versions I'm running: > > libxml: 2.6.31 > ruby: 1.8.6 p111 > nokogiri: 1.1.0 > > What versions are you running? I upgraded libxml2 now. Nokogiri::LIBXML_VERSION # => "2.6.32" Nokogiri::VERSION # => "1.1.0" RUBY_VERSION # => "1.8.7" RUBY_RELEASE_DATE # => "2008-12-29" I tried nokogiri with ruby 1.8.6p114, the result was same. -- rubikitch Blog: http://d.hatena.ne.jp/rubikitch/ Site: http://www.rubyist.net/~rubikitch/ From aaron.patterson at gmail.com Wed Jan 7 11:27:37 2009 From: aaron.patterson at gmail.com (Aaron Patterson) Date: Wed, 7 Jan 2009 08:27:37 -0800 Subject: [Nokogiri-talk] Compatibility problem In-Reply-To: <20090108.010042.238222913.rubikitch@ruby-lang.org> References: <618c07250901051802q6f179aa6sfd3643bef311ce3a@mail.gmail.com> <20090106.182504.78301548.rubikitch@ruby-lang.org> <618c07250901060508h60261e1y378626bb4f0887d8@mail.gmail.com> <20090108.010042.238222913.rubikitch@ruby-lang.org> Message-ID: <6959e1680901070827gfed8835q42d389c80df62f60@mail.gmail.com> On Wed, Jan 7, 2009 at 8:00 AM, wrote: > From: "Mike Dalessio" > Subject: Re: [Nokogiri-talk] Compatibility problem > Date: Tue, 6 Jan 2009 08:08:31 -0500 > >> Interesting! When I run this program, I get true / false / true (not true / >> true / false as you've commented). >> >> Here are the versions I'm running: >> >> libxml: 2.6.31 >> ruby: 1.8.6 p111 >> nokogiri: 1.1.0 >> >> What versions are you running? > > I upgraded libxml2 now. > > Nokogiri::LIBXML_VERSION # => "2.6.32" > Nokogiri::VERSION # => "1.1.0" > RUBY_VERSION # => "1.8.7" > RUBY_RELEASE_DATE # => "2008-12-29" > > I tried nokogiri with ruby 1.8.6p114, the result was same. rubikitch is probably using EUC-JP for his file encoding. I was able to make it break by forcing that encoding. This test should make it fail: http://gist.github.com/44322 -- Aaron Patterson http://tenderlovemaking.com/ From voxxar at gmail.com Thu Jan 8 10:27:54 2009 From: voxxar at gmail.com (=?ISO-8859-1?Q?Patrik_Sj=F6berg?=) Date: Thu, 8 Jan 2009 16:27:54 +0100 Subject: [Nokogiri-talk] Extracting text with newlines instead of
Message-ID: Hi I wanted to scrape a site and had a lot f trouble fetching plain text. I have googled for quite some time but seem no one out there ever tried scraping text other than prettily formatted values in divs like
value
Using Node.text like (doc/"div").text fetches the the text I want but with
tags removed without being substituted with \n so newlines are lost. I solved the problem by first replacing all
nodes with text nodes containing "\n" but I'm wondering if there is some easier way of if this is the way to go? code: agent = WWW::Mechanize.new page = agent.get('http://services.ltblekinge.se/webbis/') page.search('#myRepBaby tr div div div').each do |tr| if tr.text =~ /Mamma/ tr.search('br').each{ |br| br.replace(Nokogiri::XML::Text.new("\n", tr.document))} namediv = (tr/'div') name = namediv.text.strip namediv.remove text = Iconv.iconv("iso-8859-1", "utf-8", tr.text.strip) puts name puts text puts "" end end From aaron.patterson at gmail.com Thu Jan 8 11:47:10 2009 From: aaron.patterson at gmail.com (Aaron Patterson) Date: Thu, 8 Jan 2009 08:47:10 -0800 Subject: [Nokogiri-talk] Extracting text with newlines instead of
In-Reply-To: References: Message-ID: <6959e1680901080847r49c3e131h3c4f8ca35d9bd10f@mail.gmail.com> On Thu, Jan 8, 2009 at 7:27 AM, Patrik Sj?berg wrote: > Hi > > I wanted to scrape a site and had a lot f trouble fetching plain text. I > have googled for quite some time but seem no one out there ever tried > scraping text other than prettily formatted values in divs like
id="key">value
> > Using Node.text like (doc/"div").text fetches the the text I want but with >
tags removed without being substituted with \n so newlines are lost. > I solved the problem by first replacing all
nodes with text nodes > containing "\n" but I'm wondering if there is some easier way of if this is > the way to go? What you have done is probably the easiest way to go. The browser changes the
tags to newlines, so if you want it formatted the same way the browser does, you'll have to do the replacement yourself. -- Aaron Patterson http://tenderlovemaking.com/ From smparkes at smparkes.net Thu Jan 8 22:03:20 2009 From: smparkes at smparkes.net (Steven Parkes) Date: Thu, 8 Jan 2009 19:03:20 -0800 Subject: [Nokogiri-talk] create an element with the same name as a method Message-ID: <000b01c9720b$f61b70d0$e2525270$@net> I'm trying to use builder to create my XML and I've run into something I can't figure out how to do. The issue is that I need to create an element with the same name as a method on my class. I can't figure out anyway to do this. Maybe add an element(string,&block) method, with, basically, the contents of the false if condition in method_missing? From quadfolius at gmail.com Fri Jan 9 10:45:56 2009 From: quadfolius at gmail.com (Kaid Wong) Date: Fri, 9 Jan 2009 23:45:56 +0800 Subject: [Nokogiri-talk] Checking for libxslt(non-root installation) failed Message-ID: <4ea3755c0901090745o4e1d35c3j701ca4bf472d1890@mail.gmail.com> Hi, I tried adding libxslt directories to HEAD_DIRS, but still failed the check. I don't have root privilege, so my libxslt installation sits in my home directory. ---------- Kaid -------------- next part -------------- An HTML attachment was scrubbed... URL: From voldor at gmail.com Fri Jan 9 12:11:46 2009 From: voldor at gmail.com (Antel) Date: Fri, 9 Jan 2009 17:11:46 +0000 Subject: [Nokogiri-talk] create an element with the same name as a method In-Reply-To: <000b01c9720b$f61b70d0$e2525270$@net> References: <000b01c9720b$f61b70d0$e2525270$@net> Message-ID: On Fri, Jan 9, 2009 at 3:03 AM, Steven Parkes wrote: > I'm trying to use builder to create my XML and I've run into something I > can't figure out how to do. > > The issue is that I need to create an element with the same name as a > method > on my class. I can't figure out anyway to do this. > > Maybe add an element(string,&block) method, with, basically, the contents > of > the false if condition in method_missing? > > I run on your same issue: processing_method "xpath" the method processing_method can be anything in my script and so I think in yours, here is my solution based on a code I found in a ruby forum: def method_missing(method, xpath) if self.inspect == "main" do_something_with_xpath #dynamically injecting a local variable into a local scope self.class.send(:attr_accessor, method) self.send("#{method}=", Doc::Results.get) else super end end title "//title" puts title #> "The new website im building" -------------- next part -------------- An HTML attachment was scrubbed... URL: From zhenjian at gmail.com Tue Jan 13 03:35:22 2009 From: zhenjian at gmail.com (Zhenjian YU) Date: Tue, 13 Jan 2009 16:35:22 +0800 Subject: [Nokogiri-talk] Nokogiri doesn't produce the correct list as hpricot Message-ID: <4ba70b40901130035j2663e6b0tffe3e243b08e6cac@mail.gmail.com> Hi, I use nokogiri to scrape some book items ('.//div[@class="book_list"]') from a Chinese page http://read.dangdang.com/list_4 . It seems that nokogiri doesn't handle this page very well. nokogiri only got 16 book_list items, with the last item book_1226, which is wrong. While hpricot can produce the correct 20 book_list items. Can anyone figure out what the problem is? Best Regards, Castor -- http://www.yobo.com http://www.8sheng.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From mike.dalessio at gmail.com Tue Jan 13 08:20:09 2009 From: mike.dalessio at gmail.com (Mike Dalessio) Date: Tue, 13 Jan 2009 08:20:09 -0500 Subject: [Nokogiri-talk] Nokogiri doesn't produce the correct list as hpricot In-Reply-To: <4ba70b40901130035j2663e6b0tffe3e243b08e6cac@mail.gmail.com> References: <4ba70b40901130035j2663e6b0tffe3e243b08e6cac@mail.gmail.com> Message-ID: <618c07250901130520r61ff0818k170acfc25ee1ae08@mail.gmail.com> Hi, I noticed that, if you parse that page with Nokogiri and then output the parsed document, it is truncated halfway through book_1226's
tag. This might be libxml2 failing to parse the character set. Aaron, you know encoding better than I do. Any ideas? -mike On Tue, Jan 13, 2009 at 3:35 AM, Zhenjian YU wrote: > Hi, > > I use nokogiri to scrape some book items ('.//div[@class="book_list"]') > from a Chinese page http://read.dangdang.com/list_4 . > It seems that nokogiri doesn't handle this page very well. nokogiri only > got 16 book_list items, with the last item book_1226, which is wrong. > While hpricot can produce the correct 20 book_list items. > > Can anyone figure out what the problem is? > > Best Regards, > Castor > > -- > http://www.yobo.com > http://www.8sheng.com > > _______________________________________________ > Nokogiri-talk mailing list > Nokogiri-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/nokogiri-talk > > -- mike dalessio mike at csa.net -------------- next part -------------- An HTML attachment was scrubbed... URL: From zhenjian at gmail.com Tue Jan 13 09:20:38 2009 From: zhenjian at gmail.com (Zhenjian YU) Date: Tue, 13 Jan 2009 22:20:38 +0800 Subject: [Nokogiri-talk] Nokogiri doesn't produce the correct list as hpricot In-Reply-To: <618c07250901130520r61ff0818k170acfc25ee1ae08@mail.gmail.com> References: <4ba70b40901130035j2663e6b0tffe3e243b08e6cac@mail.gmail.com> <618c07250901130520r61ff0818k170acfc25ee1ae08@mail.gmail.com> Message-ID: <4ba70b40901130620j160f8856j789115afc13838d0@mail.gmail.com> The encoding of the web page is GB2312. Is there some encoding problem? Best Regards, Castor On Tue, Jan 13, 2009 at 9:20 PM, Mike Dalessio wrote: > Hi, > > I noticed that, if you parse that page with Nokogiri and then output the > parsed document, it is truncated halfway through book_1226's
class='autohr'> tag. This might be libxml2 failing to parse the character > set. > > Aaron, you know encoding better than I do. Any ideas? > > -mike > > On Tue, Jan 13, 2009 at 3:35 AM, Zhenjian YU wrote: > >> Hi, >> >> I use nokogiri to scrape some book items ('.//div[@class="book_list"]') >> from a Chinese page http://read.dangdang.com/list_4 . >> It seems that nokogiri doesn't handle this page very well. nokogiri only >> got 16 book_list items, with the last item book_1226, which is wrong. >> While hpricot can produce the correct 20 book_list items. >> >> Can anyone figure out what the problem is? >> >> Best Regards, >> Castor >> >> -- >> http://www.yobo.com >> http://www.8sheng.com >> >> _______________________________________________ >> Nokogiri-talk mailing list >> Nokogiri-talk at rubyforge.org >> http://rubyforge.org/mailman/listinfo/nokogiri-talk >> >> > > > -- > mike dalessio > mike at csa.net > > _______________________________________________ > Nokogiri-talk mailing list > Nokogiri-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/nokogiri-talk > > -- http://www.yobo.com http://www.8sheng.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From zhenjian at gmail.com Tue Jan 13 09:58:17 2009 From: zhenjian at gmail.com (Zhenjian YU) Date: Tue, 13 Jan 2009 22:58:17 +0800 Subject: [Nokogiri-talk] Nokogiri doesn't produce the correct list as hpricot In-Reply-To: <4ba70b40901130620j160f8856j789115afc13838d0@mail.gmail.com> References: <4ba70b40901130035j2663e6b0tffe3e243b08e6cac@mail.gmail.com> <618c07250901130520r61ff0818k170acfc25ee1ae08@mail.gmail.com> <4ba70b40901130620j160f8856j789115afc13838d0@mail.gmail.com> Message-ID: <4ba70b40901130658l7a0052e1hd0ea94ab8f4daedd@mail.gmail.com> I tried to parse another page http://read.dangdang.com/content_571897 . The parsed document was truncated too. Best Regards, Castor On Tue, Jan 13, 2009 at 10:20 PM, Zhenjian YU wrote: > The encoding of the web page is GB2312. Is there some encoding problem? > > Best Regards, > Castor > > > On Tue, Jan 13, 2009 at 9:20 PM, Mike Dalessio wrote: > >> Hi, >> >> I noticed that, if you parse that page with Nokogiri and then output the >> parsed document, it is truncated halfway through book_1226's
> class='autohr'> tag. This might be libxml2 failing to parse the character >> set. >> >> Aaron, you know encoding better than I do. Any ideas? >> >> -mike >> >> On Tue, Jan 13, 2009 at 3:35 AM, Zhenjian YU wrote: >> >>> Hi, >>> >>> I use nokogiri to scrape some book items ('.//div[@class="book_list"]') >>> from a Chinese page http://read.dangdang.com/list_4 . >>> It seems that nokogiri doesn't handle this page very well. nokogiri only >>> got 16 book_list items, with the last item book_1226, which is wrong. >>> While hpricot can produce the correct 20 book_list items. >>> >>> Can anyone figure out what the problem is? >>> >>> Best Regards, >>> Castor >>> >>> -- >>> http://www.yobo.com >>> http://www.8sheng.com >>> >>> _______________________________________________ >>> Nokogiri-talk mailing list >>> Nokogiri-talk at rubyforge.org >>> http://rubyforge.org/mailman/listinfo/nokogiri-talk >>> >>> >> >> >> -- >> mike dalessio >> mike at csa.net >> >> _______________________________________________ >> Nokogiri-talk mailing list >> Nokogiri-talk at rubyforge.org >> http://rubyforge.org/mailman/listinfo/nokogiri-talk >> >> > > > -- > http://www.yobo.com > http://www.8sheng.com > -- http://www.yobo.com http://www.8sheng.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.patterson at gmail.com Tue Jan 13 11:45:22 2009 From: aaron.patterson at gmail.com (Aaron Patterson) Date: Tue, 13 Jan 2009 08:45:22 -0800 Subject: [Nokogiri-talk] Nokogiri doesn't produce the correct list as hpricot In-Reply-To: <4ba70b40901130035j2663e6b0tffe3e243b08e6cac@mail.gmail.com> References: <4ba70b40901130035j2663e6b0tffe3e243b08e6cac@mail.gmail.com> Message-ID: <6959e1680901130845p7e71475k5ea50d345544f875@mail.gmail.com> On Tue, Jan 13, 2009 at 12:35 AM, Zhenjian YU wrote: > Hi, > > I use nokogiri to scrape some book items ('.//div[@class="book_list"]') from > a Chinese page http://read.dangdang.com/list_4 . > It seems that nokogiri doesn't handle this page very well. nokogiri only got > 16 book_list items, with the last item book_1226, which is wrong. > While hpricot can produce the correct 20 book_list items. > > Can anyone figure out what the problem is? Specifying an encoding works: require 'rubygems' require 'nokogiri' require 'open-uri' doc = Nokogiri::HTML(open('http://read.dangdang.com/list_4'), nil, 'CP936') puts doc.xpath('//div[@class="book_list"]').length # => 20 Is there a good way to detect encoding in ruby? I'd like to make nokogiri just deal with this stuff automatically. -- Aaron Patterson http://tenderlovemaking.com/ From rubikitch at ruby-lang.org Tue Jan 13 12:28:05 2009 From: rubikitch at ruby-lang.org (rubikitch at ruby-lang.org) Date: Wed, 14 Jan 2009 02:28:05 +0900 (JST) Subject: [Nokogiri-talk] Nokogiri doesn't produce the correct list as hpricot In-Reply-To: <6959e1680901130845p7e71475k5ea50d345544f875@mail.gmail.com> References: <4ba70b40901130035j2663e6b0tffe3e243b08e6cac@mail.gmail.com> <6959e1680901130845p7e71475k5ea50d345544f875@mail.gmail.com> Message-ID: <20090114.022805.68093029.rubikitch@ruby-lang.org> From: "Aaron Patterson" Subject: Re: [Nokogiri-talk] Nokogiri doesn't produce the correct list as hpricot Date: Tue, 13 Jan 2009 08:45:22 -0800 > Is there a good way to detect encoding in ruby? I'd like to make > nokogiri just deal with this stuff automatically. charset header? require 'rubygems' require 'nokogiri' require 'open-uri' open('http://read.dangdang.com/list_4') do |f| html = f.read f.charset # => "gb2312" doc = Nokogiri::HTML(html, nil, f.charset) doc.xpath('//div[@class="book_list"]').length # => 16 end But this result is incorrect. Bug? -- rubikitch Blog: http://d.hatena.ne.jp/rubikitch/ Site: http://www.rubyist.net/~rubikitch/ From aaron.patterson at gmail.com Tue Jan 13 13:31:54 2009 From: aaron.patterson at gmail.com (Aaron Patterson) Date: Tue, 13 Jan 2009 10:31:54 -0800 Subject: [Nokogiri-talk] Nokogiri doesn't produce the correct list as hpricot In-Reply-To: <20090114.022805.68093029.rubikitch@ruby-lang.org> References: <4ba70b40901130035j2663e6b0tffe3e243b08e6cac@mail.gmail.com> <6959e1680901130845p7e71475k5ea50d345544f875@mail.gmail.com> <20090114.022805.68093029.rubikitch@ruby-lang.org> Message-ID: <6959e1680901131031i344206a8yc5583496c367f5ed@mail.gmail.com> On Tue, Jan 13, 2009 at 9:28 AM, wrote: > From: "Aaron Patterson" > Subject: Re: [Nokogiri-talk] Nokogiri doesn't produce the correct list as hpricot > Date: Tue, 13 Jan 2009 08:45:22 -0800 > >> Is there a good way to detect encoding in ruby? I'd like to make >> nokogiri just deal with this stuff automatically. > > charset header? > > require 'rubygems' > require 'nokogiri' > require 'open-uri' > > open('http://read.dangdang.com/list_4') do |f| > html = f.read > f.charset # => "gb2312" > doc = Nokogiri::HTML(html, nil, f.charset) > doc.xpath('//div[@class="book_list"]').length # => 16 > end > > But this result is incorrect. Bug? I'm not sure. It works with 'CP936' as the encoding. Perhaps they are synonymous? -- Aaron Patterson http://tenderlovemaking.com/ From andywatts at angrylapdog.com Tue Jan 13 14:45:38 2009 From: andywatts at angrylapdog.com (Andrew Watts-Curnow) Date: Tue, 13 Jan 2009 11:45:38 -0800 Subject: [Nokogiri-talk] XPath expressions other than location paths Message-ID: <6cf2f09c0901131145w3f618552s10a0d520c23d2632@mail.gmail.com> Hi, I'm curious about support for XPath expressions that return booleans, string or numbers. Here's a couple of quotes from the specs. > The primary syntactic construct in XPath is the expression. An expression > matches the production Expr . An > expression is evaluated to yield an object, which has one of the following > four basic types: > > - node-set (an unordered collection of nodes without duplicates) > - boolean (true or false) > - number (a floating-point number) > - string (a sequence of UCS characters) > > One important kind of expression is a location path. .... The result of > evaluating an expression that is a location path is the node-set containing > the nodes selected by the location path. Ultimately, I'm hoping to use XPaths like the following. count(//book) => 2 contains(//gems, 'nokogiri') => True substring-after(//h1, 'Nokogiri is') => 'subarashi' Firefox's document.evaluate handles these, but maybe that's the DOM XPath spec. Does libxml support expressions other than location paths? Would this make sense as an enhancement to nokogiri? Thanks Andy -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.patterson at gmail.com Tue Jan 13 14:53:56 2009 From: aaron.patterson at gmail.com (Aaron Patterson) Date: Tue, 13 Jan 2009 11:53:56 -0800 Subject: [Nokogiri-talk] XPath expressions other than location paths In-Reply-To: <6cf2f09c0901131145w3f618552s10a0d520c23d2632@mail.gmail.com> References: <6cf2f09c0901131145w3f618552s10a0d520c23d2632@mail.gmail.com> Message-ID: <6959e1680901131153x230f701lca213fec698321a9@mail.gmail.com> On Tue, Jan 13, 2009 at 11:45 AM, Andrew Watts-Curnow wrote: > Hi, > > I'm curious about support for XPath expressions that return booleans, string > or numbers. > > Here's a couple of quotes from the specs. >> >> The primary syntactic construct in XPath is the expression. An expression >> matches the production Expr. An expression is evaluated to yield an object, >> which has one of the following four basic types: >> >> node-set (an unordered collection of nodes without duplicates) >> boolean (true or false) >> number (a floating-point number) >> string (a sequence of UCS characters) >> >> One important kind of expression is a location path. .... The result of >> evaluating an expression that is a location path is the node-set containing >> the nodes selected by the location path. > > Ultimately, I'm hoping to use XPaths like the following. > count(//book) => 2 > contains(//gems, 'nokogiri') => True > substring-after(//h1, 'Nokogiri is') => 'subarashi' > Firefox's document.evaluate handles these, but maybe that's the DOM XPath > spec. > > Does libxml support expressions other than location paths? > Would this make sense as an enhancement to nokogiri? I think it is possible, but I have a hard time justifying to myself why you would need it. For example: count(//book) could be doc.xpath('//book').length contains(//gems, 'nokogiri') could be doc.xpath('//gems').any? { |x| x.to_s =~ /nokogiri/ } substring-after(//h1, 'Nokogiri is') could be doc.xpath('//h1').to_s.gsub(/^Nokogiri is/, '') I'm willing to look in to returning something other than a node set more seriously if I could get a compelling example. -- Aaron Patterson http://tenderlovemaking.com/ From andywatts at angrylapdog.com Tue Jan 13 16:41:30 2009 From: andywatts at angrylapdog.com (Andrew Watts-Curnow) Date: Tue, 13 Jan 2009 13:41:30 -0800 Subject: [Nokogiri-talk] XPath expressions other than location paths In-Reply-To: <6959e1680901131153x230f701lca213fec698321a9@mail.gmail.com> References: <6cf2f09c0901131145w3f618552s10a0d520c23d2632@mail.gmail.com> <6959e1680901131153x230f701lca213fec698321a9@mail.gmail.com> Message-ID: <6cf2f09c0901131341n2c4d170ercb2049b49a5ecf9c@mail.gmail.com> Aaron, Thanks for the reply. I use my XPath expressions in different environments and therefore trying to keep them as portable as possible. For example, some of my XPaths will also be used by firefox's document.evaluate command. Supporting more of the spec and thereby XPath portability seems a pretty compelling.....but then I'm the guy that needs it. :) - Andy On Tue, Jan 13, 2009 at 11:53 AM, Aaron Patterson wrote: > On Tue, Jan 13, 2009 at 11:45 AM, Andrew Watts-Curnow > wrote: > > Hi, > > > > I'm curious about support for XPath expressions that return booleans, > string > > or numbers. > > > > Here's a couple of quotes from the specs. > >> > >> The primary syntactic construct in XPath is the expression. An > expression > >> matches the production Expr. An expression is evaluated to yield an > object, > >> which has one of the following four basic types: > >> > >> node-set (an unordered collection of nodes without duplicates) > >> boolean (true or false) > >> number (a floating-point number) > >> string (a sequence of UCS characters) > >> > >> One important kind of expression is a location path. .... The result of > >> evaluating an expression that is a location path is the node-set > containing > >> the nodes selected by the location path. > > > > Ultimately, I'm hoping to use XPaths like the following. > > count(//book) => 2 > > contains(//gems, 'nokogiri') => True > > substring-after(//h1, 'Nokogiri is') => 'subarashi' > > Firefox's document.evaluate handles these, but maybe that's the DOM XPath > > spec. > > > > Does libxml support expressions other than location paths? > > Would this make sense as an enhancement to nokogiri? > > I think it is possible, but I have a hard time justifying to myself > why you would need it. > > For example: > > count(//book) could be doc.xpath('//book').length > contains(//gems, 'nokogiri') could be doc.xpath('//gems').any? { |x| > x.to_s =~ /nokogiri/ } > substring-after(//h1, 'Nokogiri is') could be > doc.xpath('//h1').to_s.gsub(/^Nokogiri is/, '') > > I'm willing to look in to returning something other than a node set > more seriously if I could get a compelling example. > > -- > Aaron Patterson > http://tenderlovemaking.com/ > _______________________________________________ > Nokogiri-talk mailing list > Nokogiri-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/nokogiri-talk > -------------- next part -------------- An HTML attachment was scrubbed... URL: From holtonma at gmail.com Tue Jan 13 18:27:19 2009 From: holtonma at gmail.com (Mark Holton) Date: Tue, 13 Jan 2009 15:27:19 -0800 Subject: [Nokogiri-talk] XPath expressions other than location paths In-Reply-To: <6959e1680901131153x230f701lca213fec698321a9@mail.gmail.com> References: <6cf2f09c0901131145w3f618552s10a0d520c23d2632@mail.gmail.com> <6959e1680901131153x230f701lca213fec698321a9@mail.gmail.com> Message-ID: <8bd9e8730901131527t301e2dd9tb48d2784acbce009@mail.gmail.com> I like 'the saw' simple and effective... not bloated or like a Swiss Army Knife. My two cents. :Mark On Tue, Jan 13, 2009 at 11:53 AM, Aaron Patterson wrote: > On Tue, Jan 13, 2009 at 11:45 AM, Andrew Watts-Curnow > wrote: > > Hi, > > > > I'm curious about support for XPath expressions that return booleans, > string > > or numbers. > > > > Here's a couple of quotes from the specs. > >> > >> The primary syntactic construct in XPath is the expression. An > expression > >> matches the production Expr. An expression is evaluated to yield an > object, > >> which has one of the following four basic types: > >> > >> node-set (an unordered collection of nodes without duplicates) > >> boolean (true or false) > >> number (a floating-point number) > >> string (a sequence of UCS characters) > >> > >> One important kind of expression is a location path. .... The result of > >> evaluating an expression that is a location path is the node-set > containing > >> the nodes selected by the location path. > > > > Ultimately, I'm hoping to use XPaths like the following. > > count(//book) => 2 > > contains(//gems, 'nokogiri') => True > > substring-after(//h1, 'Nokogiri is') => 'subarashi' > > Firefox's document.evaluate handles these, but maybe that's the DOM XPath > > spec. > > > > Does libxml support expressions other than location paths? > > Would this make sense as an enhancement to nokogiri? > > I think it is possible, but I have a hard time justifying to myself > why you would need it. > > For example: > > count(//book) could be doc.xpath('//book').length > contains(//gems, 'nokogiri') could be doc.xpath('//gems').any? { |x| > x.to_s =~ /nokogiri/ } > substring-after(//h1, 'Nokogiri is') could be > doc.xpath('//h1').to_s.gsub(/^Nokogiri is/, '') > > I'm willing to look in to returning something other than a node set > more seriously if I could get a compelling example. > > -- > Aaron Patterson > http://tenderlovemaking.com/ > _______________________________________________ > Nokogiri-talk mailing list > Nokogiri-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/nokogiri-talk > -------------- next part -------------- An HTML attachment was scrubbed... URL: From goodieboy at gmail.com Tue Jan 13 21:06:19 2009 From: goodieboy at gmail.com (Matt Mitchell) Date: Tue, 13 Jan 2009 21:06:19 -0500 Subject: [Nokogiri-talk] beginner questions Message-ID: Hi, I just started using NG (after experiencing some Hpricot pain) and am very happy with it so far. I'm using it to scrape large TEI files to index into Solr, and also chunking out bits of xml fragments to disk. I've got a few questions and well, I'll just cut to it :) is there a nice way to get the next/previous sibling element, not text? how can i get the absolute position of a node? (to the root) how can i get the relative position of a node? (to it's parent) how can i get the depth of a node? is there a way to search using negation? - Hpricot uses a "filter" like //:not(blah) - but does not seem to work correctly! does it support unicode/utf-8? Thanks in advance! And thanks for the awesome library. Matt -------------- next part -------------- An HTML attachment was scrubbed... URL: From zhenjian at gmail.com Wed Jan 14 00:54:49 2009 From: zhenjian at gmail.com (Zhenjian YU) Date: Wed, 14 Jan 2009 13:54:49 +0800 Subject: [Nokogiri-talk] Nokogiri doesn't produce the correct list as hpricot In-Reply-To: <6959e1680901131031i344206a8yc5583496c367f5ed@mail.gmail.com> References: <4ba70b40901130035j2663e6b0tffe3e243b08e6cac@mail.gmail.com> <6959e1680901130845p7e71475k5ea50d345544f875@mail.gmail.com> <20090114.022805.68093029.rubikitch@ruby-lang.org> <6959e1680901131031i344206a8yc5583496c367f5ed@mail.gmail.com> Message-ID: <4ba70b40901132154n36014e7dx8eade0613df3fd08@mail.gmail.com> I found the problem. libxml2 actually uses header information to determine the encoding, gb2312. The problem is that in the web page, there are some chars that don't belong to gb2312. Model browsers are robust enough to correctly handle them. But libxml2 uses iconv to do encoding conversion, which breaks when encounters these chars. This leads to truncated document in nokogiri. Explicitly specify the charset as GB18030 will solve this problem. GB18030 is big enough to contain all possible Chinese chars. On Wed, Jan 14, 2009 at 2:31 AM, Aaron Patterson wrote: > On Tue, Jan 13, 2009 at 9:28 AM, wrote: > > From: "Aaron Patterson" > > Subject: Re: [Nokogiri-talk] Nokogiri doesn't produce the correct list as > hpricot > > Date: Tue, 13 Jan 2009 08:45:22 -0800 > > > >> Is there a good way to detect encoding in ruby? I'd like to make > >> nokogiri just deal with this stuff automatically. > > > > charset header? > > > > require 'rubygems' > > require 'nokogiri' > > require 'open-uri' > > > > open('http://read.dangdang.com/list_4') do |f| > > html = f.read > > f.charset # => "gb2312" > > doc = Nokogiri::HTML(html, nil, f.charset) > > doc.xpath('//div[@class="book_list"]').length # => 16 > > end > > > > But this result is incorrect. Bug? > > I'm not sure. It works with 'CP936' as the encoding. Perhaps they > are synonymous? > > -- > Aaron Patterson > http://tenderlovemaking.com/ > _______________________________________________ > Nokogiri-talk mailing list > Nokogiri-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/nokogiri-talk > -- http://www.yobo.com http://www.8sheng.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From rainerthiel at zoho.com Sat Jan 17 10:55:23 2009 From: rainerthiel at zoho.com (rainerthiel) Date: Sat, 17 Jan 2009 17:55:23 +0200 Subject: [Nokogiri-talk] XSLT v2.0 support Message-ID: <11ee54dbab1.5813744184432384679.-270860799764988955@zoho.com> Greetings, Warnings ;) 1. Newbie question 2. Windows user i have installed the gem (nokogiri-1.0.5-x86-mswin32-60) and run the tests. All ok. Now i am trying to transform an xml using a stylesheet that outputs to multiple result documents. The code snippet: style = Nokogiri::XSLT.parse(File.read(xslt_file)) fails as follows: compilation error: element result-document xsltStylePreCompute: unknown xsl:result-document Does Nokogiri not support XSLT 2.0? Regards Rainer Thiel -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.patterson at gmail.com Sat Jan 17 16:34:58 2009 From: aaron.patterson at gmail.com (Aaron Patterson) Date: Sat, 17 Jan 2009 13:34:58 -0800 Subject: [Nokogiri-talk] XSLT v2.0 support In-Reply-To: <11ee54dbab1.5813744184432384679.-270860799764988955@zoho.com> References: <11ee54dbab1.5813744184432384679.-270860799764988955@zoho.com> Message-ID: <6959e1680901171334j1b9330b7vfc154e928d2b0b12@mail.gmail.com> On Sat, Jan 17, 2009 at 7:55 AM, rainerthiel wrote: > Greetings, > > Warnings ;) > 1. Newbie question > 2. Windows user > > i have installed the gem (nokogiri-1.0.5-x86-mswin32-60) and run the tests. > All ok. > Now i am trying to transform an xml using a stylesheet that outputs to > multiple result documents. > > The code snippet: > > style = Nokogiri::XSLT.parse(File.read(xslt_file)) > > fails as follows: > > compilation error: element result-document > xsltStylePreCompute: unknown xsl:result-document > > Does Nokogiri not support XSLT 2.0? I'm not sure..... It supports whatever libxslt, and libexslt support. We added support for libexslt pretty recently, so you should upgrade nokogiri and try again. -- Aaron Patterson http://tenderlovemaking.com/ From mike.tracy at gmail.com Wed Jan 21 10:33:01 2009 From: mike.tracy at gmail.com (Michael Tracy) Date: Wed, 21 Jan 2009 09:33:01 -0600 Subject: [Nokogiri-talk] unexpected behavior with xpath search Message-ID: This is a pattern I use alot when testing webapps and Hpricot behaves as I would expect, Nokogiri does something different. The scenario: 1.Simple page with 3 tables and varying numbers of tr/td 2. search for table elements 3. foreach table element search for tr elements 4. foreach tr element search for td elements 5. return the inner_text of td elements as an array The result I expect is to get an array of 6 elements that are the inner_html of each td. Each inner search would be limited to the children of each table node. Am I doing something wrong here?
table1 tr1 td1 table1 tr1 td2
table1 tr2 td1
table1 tr3 td1
table2 tr1 td1 table2 tr1 td2
=> nil irb(main):019:0> doc = Nokogiri::HTML(File.read("nokotest.html"));ret = []; doc.search("//table").each { |tab| tab.search("//tr").each { | tr| tr.search("//td").each { |td| ret << td.inner_text }}};nil; ret.size => 48 irb(main):020:0> doc = Hpricot.parse(File.read("nokotest.html"));ret = []; doc.search("//table").each { |tab| tab.search("//tr").each { | tr| tr.search("//td").each { |td| ret << td.inner_text }}};nil; ret.size => 6 -- --- Michael L. Tracy // matasano security read us on the web: http://www.matasano.com/log -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.patterson at gmail.com Wed Jan 21 11:32:30 2009 From: aaron.patterson at gmail.com (Aaron Patterson) Date: Wed, 21 Jan 2009 08:32:30 -0800 Subject: [Nokogiri-talk] unexpected behavior with xpath search In-Reply-To: References: Message-ID: <6959e1680901210832y1350e23bwcf3df4f8be4b680e@mail.gmail.com> On Wed, Jan 21, 2009 at 7:33 AM, Michael Tracy wrote: > This is a pattern I use alot when testing webapps and Hpricot behaves as I > would expect, Nokogiri does something different. > The scenario: > 1.Simple page with 3 tables and varying numbers of tr/td > 2. search for table elements > 3. foreach table element search for tr elements > 4. foreach tr element search for td elements > 5. return the inner_text of td elements as an array > The result I expect is to get an array of 6 elements that are the inner_html > of each td. Each inner search would be limited to the children of each > table node. > Am I doing something wrong here? You're not doing anything wrong. Your XPath queries probably just aren't expressing what you think they're expressing. :-) Double slash indicates that you want to search relative to the root of the document. Beginning your xpath query with a slash *always* indicates "do this search from the root node". Relative searches start with a dot. Try something like this: doc.search("//table").each { |tab| tab.search(".//tr").each { |tr| tr.search(".//td").each { |td| ret << td.inner_text }}};nil; ret.size Hpricot does not implement XPath spec correctly, so this may cause confusion. I hope my example helps! -- Aaron Patterson http://tenderlovemaking.com/ From mtracy at matasano.com Wed Jan 21 12:55:55 2009 From: mtracy at matasano.com (Michael Tracy) Date: Wed, 21 Jan 2009 11:55:55 -0600 Subject: [Nokogiri-talk] unexpected behavior with xpath search In-Reply-To: <6959e1680901210832y1350e23bwcf3df4f8be4b680e@mail.gmail.com> References: <6959e1680901210832y1350e23bwcf3df4f8be4b680e@mail.gmail.com> Message-ID: That's what I get for relying on Hpricot bugs! irb(main):036:0> doc = Nokogiri::HTML(File.read ("nokotest.html"));doc.search("//table/tr/td").map { |x| x.inner_html }.size => 6 irb(main):037:0> ret = [];doc.search("//table").each { |tab| tab.search(".//tr").each { |tr| tr.search(".//td").each { |td| ret << td.inner_text }}};nil; ret.size => 6 like a charm. Thanks, -mt On Jan 21, 2009, at 10:32 AM, Aaron Patterson wrote: > On Wed, Jan 21, 2009 at 7:33 AM, Michael Tracy > wrote: >> This is a pattern I use alot when testing webapps and Hpricot >> behaves as I >> would expect, Nokogiri does something different. >> The scenario: >> 1.Simple page with 3 tables and varying numbers of tr/td >> 2. search for table elements >> 3. foreach table element search for tr elements >> 4. foreach tr element search for td elements >> 5. return the inner_text of td elements as an array >> The result I expect is to get an array of 6 elements that are the >> inner_html >> of each td. Each inner search would be limited to the children of >> each >> table node. >> Am I doing something wrong here? > > You're not doing anything wrong. Your XPath queries probably just > aren't expressing what you think they're expressing. :-) > > Double slash indicates that you want to search relative to the root of > the document. Beginning your xpath query with a slash *always* > indicates "do this search from the root node". Relative searches > start with a dot. > > Try something like this: > > doc.search("//table").each { |tab| tab.search(".//tr").each { |tr| > tr.search(".//td").each { |td| ret << td.inner_text }}};nil; ret.size > > Hpricot does not implement XPath spec correctly, so this may cause > confusion. I hope my example helps! > > -- > Aaron Patterson > http://tenderlovemaking.com/ > _______________________________________________ > Nokogiri-talk mailing list > Nokogiri-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/nokogiri-talk -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 2421 bytes Desc: not available URL: From mike.dalessio at gmail.com Thu Jan 22 09:09:23 2009 From: mike.dalessio at gmail.com (Mike Dalessio) Date: Thu, 22 Jan 2009 09:09:23 -0500 Subject: [Nokogiri-talk] beginner questions In-Reply-To: References: Message-ID: <618c07250901220609x1d89b13du559e9e3c83666148@mail.gmail.com> Hi Matt, It doesn't look like anyone's tried to answer your questions, and you've probably figured out the answers by now, but for posterity, I'll try to answer anyway! On Tue, Jan 13, 2009 at 9:06 PM, Matt Mitchell wrote: > Hi, > > I just started using NG (after experiencing some Hpricot pain) and am very > happy with it so far. I'm using it to scrape large TEI files to index into > Solr, and also chunking out bits of xml fragments to disk. I've got a few > questions and well, I'll just cut to it :) > > is there a nice way to get the next/previous sibling element, not text? node.next or node.previous > > how can i get the absolute position of a node? (to the root) try node.css_path or node.path > > how can i get the relative position of a node? (to it's parent) node.parent.children.to_a.index(node) will give you an offset into the node's siblings, but that will include text nodes (including blank text nodes), so it may not be what you actually want. can you give me a use case for what you're trying to do? > > how can i get the depth of a node? walk back up the node.parent chain until you get nil. (if you think this should be in the API, make a case for it!) > > is there a way to search using negation? > - Hpricot uses a "filter" like //:not(blah) - but does not seem to work > correctly! right, hpricot's implementation was inconsistent, and we couldn't come up with a use case for supporting that operator. can you provide a use case? i'm open to implementing something that will be useful, and that can't be achieved using the existing nokogiri interface. > > does it support unicode/utf-8? you bet. you may need to provide the encoding as an additional argument to the parse method. > > > Thanks in advance! And thanks for the awesome library. > > Matt > > _______________________________________________ > Nokogiri-talk mailing list > Nokogiri-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/nokogiri-talk > > -- mike dalessio mike at csa.net -------------- next part -------------- An HTML attachment was scrubbed... URL: From normelton at gmail.com Fri Jan 23 15:55:15 2009 From: normelton at gmail.com (Norman Elton) Date: Fri, 23 Jan 2009 15:55:15 -0500 Subject: [Nokogiri-talk] Stream Parsing Message-ID: <6b3a7f010901231255v2d93beebg3eba63dbbb2bccae@mail.gmail.com> I've tried hooking Nokogiri up to a TCP socket to parse incoming XML data. I'd like to parse data as it's coming in, but it looks like my start_document method isn't invoked until the socket is closed. Is there a way to parse data "on the fly"? Thanks! Norman From aaron.patterson at gmail.com Fri Jan 23 17:07:26 2009 From: aaron.patterson at gmail.com (Aaron Patterson) Date: Fri, 23 Jan 2009 14:07:26 -0800 Subject: [Nokogiri-talk] Stream Parsing In-Reply-To: <6b3a7f010901231255v2d93beebg3eba63dbbb2bccae@mail.gmail.com> References: <6b3a7f010901231255v2d93beebg3eba63dbbb2bccae@mail.gmail.com> Message-ID: <6959e1680901231407o6be69175p18344f227d852aa5@mail.gmail.com> On Fri, Jan 23, 2009 at 12:55 PM, Norman Elton wrote: > I've tried hooking Nokogiri up to a TCP socket to parse incoming XML > data. I'd like to parse data as it's coming in, but it looks like my > start_document method isn't invoked until the socket is closed. > > Is there a way to parse data "on the fly"? Interesting. It should be. Are you passing it an IO object? Can you provide some sample code? We do support IO streams in the SAX parser, so that *should* work. -- Aaron Patterson http://tenderlovemaking.com/ From normelton at gmail.com Fri Jan 23 21:05:11 2009 From: normelton at gmail.com (Norman Elton) Date: Fri, 23 Jan 2009 21:05:11 -0500 Subject: [Nokogiri-talk] Stream Parsing In-Reply-To: <6959e1680901231407o6be69175p18344f227d852aa5@mail.gmail.com> References: <6b3a7f010901231255v2d93beebg3eba63dbbb2bccae@mail.gmail.com> <6959e1680901231407o6be69175p18344f227d852aa5@mail.gmail.com> Message-ID: <6b3a7f010901231805v5fc9d6d1pacbcd4a9cce10fd0@mail.gmail.com> >> Is there a way to parse data "on the fly"? > > Interesting. It should be. Are you passing it an IO object? Can you > provide some sample code? We do support IO streams in the SAX parser, > so that *should* work. Yes, I'm passing an IO object. I've pasted a boiled-down test case online: http://pastie.org/369341 It basically defines a simple SAX handler that outputs a timestamped message for a variety of events. Then it forks itself into a read/writer, and the writer feeds XML tags into the reader. On my system, the writer delivers each message, but the reader does not recognize any until the writer closes the pipe. When I replace the nokogiri parser with just plain calls to io_read.gets, it works fine. Apologies if I'm missing something obvious here. Thanks for any help! Norman From rubikitch at ruby-lang.org Sat Jan 24 13:45:55 2009 From: rubikitch at ruby-lang.org (rubikitch at ruby-lang.org) Date: Sun, 25 Jan 2009 03:45:55 +0900 (JST) Subject: [Nokogiri-talk] HEADER_DIRS Message-ID: <20090125.034555.78759419.rubikitch@ruby-lang.org> Hi, I installed the latest snapshot of libxml2 in /usr/local, and there is older libxml2 in /usr. In this situation, nokogiri uses older one. diff --git INDEX:/ext/nokogiri/extconf.rb WORKDIR:/ext/nokogiri/extconf.rb index 8247c05..e217cc2 100644 --- INDEX:/ext/nokogiri/extconf.rb +++ WORKDIR:/ext/nokogiri/extconf.rb @@ -41,8 +41,8 @@ else HEADER_DIRS = [ File.join(INCLUDEDIR, "libxml2"), INCLUDEDIR, + '/usr/local/include/libxml2', '/usr/include/libxml2', - '/usr/local/include/libxml2' ] [ -- rubikitch Blog: http://d.hatena.ne.jp/rubikitch/ Site: http://www.rubyist.net/~rubikitch/ From aaron.patterson at gmail.com Sat Jan 24 14:01:34 2009 From: aaron.patterson at gmail.com (Aaron Patterson) Date: Sat, 24 Jan 2009 11:01:34 -0800 Subject: [Nokogiri-talk] HEADER_DIRS In-Reply-To: <20090125.034555.78759419.rubikitch@ruby-lang.org> References: <20090125.034555.78759419.rubikitch@ruby-lang.org> Message-ID: <6959e1680901241101x1f722454vb6c7b2354455ba85@mail.gmail.com> On Sat, Jan 24, 2009 at 10:45 AM, wrote: > Hi, > > I installed the latest snapshot of libxml2 in /usr/local, > and there is older libxml2 in /usr. > In this situation, nokogiri uses older one. > > diff --git INDEX:/ext/nokogiri/extconf.rb WORKDIR:/ext/nokogiri/extconf.rb > index 8247c05..e217cc2 100644 > --- INDEX:/ext/nokogiri/extconf.rb > +++ WORKDIR:/ext/nokogiri/extconf.rb > @@ -41,8 +41,8 @@ else > HEADER_DIRS = [ > File.join(INCLUDEDIR, "libxml2"), > INCLUDEDIR, > + '/usr/local/include/libxml2', > '/usr/include/libxml2', > - '/usr/local/include/libxml2' > ] Applied! ?????? http://github.com/tenderlove/nokogiri/commit/3c271cd7ee65a3daf3849067858ce722fcacb5c0 -- Aaron Patterson http://tenderlovemaking.com/ From mike.dalessio at gmail.com Mon Jan 26 09:14:56 2009 From: mike.dalessio at gmail.com (Mike Dalessio) Date: Mon, 26 Jan 2009 09:14:56 -0500 Subject: [Nokogiri-talk] Stream Parsing In-Reply-To: <6b3a7f010901231805v5fc9d6d1pacbcd4a9cce10fd0@mail.gmail.com> References: <6b3a7f010901231255v2d93beebg3eba63dbbb2bccae@mail.gmail.com> <6959e1680901231407o6be69175p18344f227d852aa5@mail.gmail.com> <6b3a7f010901231805v5fc9d6d1pacbcd4a9cce10fd0@mail.gmail.com> Message-ID: <618c07250901260614qd4d03bbu1fda9493bc1aaded@mail.gmail.com> Interesting. Two comments: 1) IO.pipe doesn't seem to be the best way to reproduce this -- even io_read.read blocks until io_write is closed, independent of Nokogiri or libxml2. 2) I've rewritten your test using sockets, and I'm getting the same result. Codes: http://pastie.org/371022 Aaron, I'm not sure what to make of this. it's actually xmlParseDocument() that's hanging, and io_read_callback doesn't get invoked until the socket is closed. Any ideas? On Fri, Jan 23, 2009 at 9:05 PM, Norman Elton wrote: > >> Is there a way to parse data "on the fly"? > > > > Interesting. It should be. Are you passing it an IO object? Can you > > provide some sample code? We do support IO streams in the SAX parser, > > so that *should* work. > > Yes, I'm passing an IO object. I've pasted a boiled-down test case online: > > http://pastie.org/369341 > > It basically defines a simple SAX handler that outputs a timestamped > message for a variety of events. Then it forks itself into a > read/writer, and the writer feeds XML tags into the reader. On my > system, the writer delivers each message, but the reader does not > recognize any until the writer closes the pipe. > > When I replace the nokogiri parser with just plain calls to > io_read.gets, it works fine. > > Apologies if I'm missing something obvious here. Thanks for any help! > > Norman > _______________________________________________ > Nokogiri-talk mailing list > Nokogiri-talk at rubyforge.org > http://rubyforge.org/mailman/listinfo/nokogiri-talk > -- mike dalessio mike at csa.net -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.patterson at gmail.com Mon Jan 26 12:14:28 2009 From: aaron.patterson at gmail.com (Aaron Patterson) Date: Mon, 26 Jan 2009 09:14:28 -0800 Subject: [Nokogiri-talk] Stream Parsing In-Reply-To: <618c07250901260614qd4d03bbu1fda9493bc1aaded@mail.gmail.com> References: <6b3a7f010901231255v2d93beebg3eba63dbbb2bccae@mail.gmail.com> <6959e1680901231407o6be69175p18344f227d852aa5@mail.gmail.com> <6b3a7f010901231805v5fc9d6d1pacbcd4a9cce10fd0@mail.gmail.com> <618c07250901260614qd4d03bbu1fda9493bc1aaded@mail.gmail.com> Message-ID: <6959e1680901260914g9350763lafdbde1831f9991e@mail.gmail.com> Good morning! On Mon, Jan 26, 2009 at 6:14 AM, Mike Dalessio wrote: > Interesting. Two comments: > > 1) IO.pipe doesn't seem to be the best way to reproduce this -- even > io_read.read blocks until io_write is closed, independent of Nokogiri or > libxml2. > > 2) I've rewritten your test using sockets, and I'm getting the same result. > > Codes: http://pastie.org/371022 > > Aaron, I'm not sure what to make of this. it's actually xmlParseDocument() > that's hanging, and io_read_callback doesn't get invoked until the socket is > closed. Any ideas? Woot. I have an answer. libxml buffers it's input to 4k. No parsing/callbacks happen until the buffer gets filled. Note my puts on line 58. If you run this, you'll see the callbacks get called before the document finishes writing. http://gist.github.com/52867 -- Aaron Patterson http://tenderlovemaking.com/ From normelton at gmail.com Mon Jan 26 14:42:06 2009 From: normelton at gmail.com (Norman Elton) Date: Mon, 26 Jan 2009 14:42:06 -0500 Subject: [Nokogiri-talk] Stream Parsing In-Reply-To: <6959e1680901260914g9350763lafdbde1831f9991e@mail.gmail.com> References: <6b3a7f010901231255v2d93beebg3eba63dbbb2bccae@mail.gmail.com> <6959e1680901231407o6be69175p18344f227d852aa5@mail.gmail.com> <6b3a7f010901231805v5fc9d6d1pacbcd4a9cce10fd0@mail.gmail.com> <618c07250901260614qd4d03bbu1fda9493bc1aaded@mail.gmail.com> <6959e1680901260914g9350763lafdbde1831f9991e@mail.gmail.com> Message-ID: <6b3a7f010901261142i21cbb9d6s3c7a37bcd4963b61@mail.gmail.com> > Woot. I have an answer. libxml buffers it's input to 4k. No > parsing/callbacks happen until the buffer gets filled. Aaron, Is there any workaround? A way to tell libxml to not buffer? I'd really like to move to Nokogiri, but need to be able to parse streams waiting for the socket to close (or the buffer to fill). Thanks!! Norman From aaron.patterson at gmail.com Mon Jan 26 22:59:57 2009 From: aaron.patterson at gmail.com (Aaron Patterson) Date: Mon, 26 Jan 2009 19:59:57 -0800 Subject: [Nokogiri-talk] Stream Parsing In-Reply-To: <6b3a7f010901261142i21cbb9d6s3c7a37bcd4963b61@mail.gmail.com> References: <6b3a7f010901231255v2d93beebg3eba63dbbb2bccae@mail.gmail.com> <6959e1680901231407o6be69175p18344f227d852aa5@mail.gmail.com> <6b3a7f010901231805v5fc9d6d1pacbcd4a9cce10fd0@mail.gmail.com> <618c07250901260614qd4d03bbu1fda9493bc1aaded@mail.gmail.com> <6959e1680901260914g9350763lafdbde1831f9991e@mail.gmail.com> <6b3a7f010901261142i21cbb9d6s3c7a37bcd4963b61@mail.gmail.com> Message-ID: <6959e1680901261959t2b446794qd9115d79e27c6fb7@mail.gmail.com> On Mon, Jan 26, 2009 at 11:42 AM, Norman Elton wrote: >> Woot. I have an answer. libxml buffers it's input to 4k. No >> parsing/callbacks happen until the buffer gets filled. > > Aaron, > > Is there any workaround? A way to tell libxml to not buffer? I'd > really like to move to Nokogiri, but need to be able to parse streams > waiting for the socket to close (or the buffer to fill). How did I know this would be the next question? ;-) As far as I can tell, we can't tell libxml not to buffer. Technically, we could decrease the size of the buffer, but that looks very dangerous. I can't seem to find anything in the API documentation that would let us force libxml to make it's callbacks. Given this new info, do you have any ideas Mike? I can seem to find any help in the libxml source. Just out of curiosity, what size chunks are you parsing? I'm trying to visualize a use case where a 4k buffer size is too large. If you're dealing with small chunks, a push parser might be better. I am very close to having the push parser work. :-) -- Aaron Patterson http://tenderlovemaking.com/ From normelton at gmail.com Mon Jan 26 23:28:26 2009 From: normelton at gmail.com (Norman Elton) Date: Mon, 26 Jan 2009 23:28:26 -0500 Subject: [Nokogiri-talk] Stream Parsing In-Reply-To: <6959e1680901261959t2b446794qd9115d79e27c6fb7@mail.gmail.com> References: <6b3a7f010901231255v2d93beebg3eba63dbbb2bccae@mail.gmail.com> <6959e1680901231407o6be69175p18344f227d852aa5@mail.gmail.com> <6b3a7f010901231805v5fc9d6d1pacbcd4a9cce10fd0@mail.gmail.com> <618c07250901260614qd4d03bbu1fda9493bc1aaded@mail.gmail.com> <6959e1680901260914g9350763lafdbde1831f9991e@mail.gmail.com> <6b3a7f010901261142i21cbb9d6s3c7a37bcd4963b61@mail.gmail.com> <6959e1680901261959t2b446794qd9115d79e27c6fb7@mail.gmail.com> Message-ID: <6b3a7f010901262028v7135a112nb9d2f11e3bc3da12@mail.gmail.com> > Just out of curiosity, what size chunks are you parsing? I'm trying > to visualize a use case where a 4k buffer size is too large. If > you're dealing with small chunks, a push parser might be better. I am > very close to having the push parser work. :-) If I can hook an IO up to a push parser, I'm pretty sure that's exactly what I'd want. The details of my application are a little non-standard. Basically, a TCP socket exists between server and client. The RPC exchange consists of two long-running XML documents, one in each direction. Both machines exchange xml-decl's, followed by an opening tag. The client then burps XML snippets to the server, which responds with a response snippet. The document never actually "ends" until the session is complete and the socket closed. In order for this to work, each speaker needs to recognize each other's communication without the document fully parsing or the socket closing. It sounds like a push parser would do the trick :) Thanks for your help! Norman From aaron.patterson at gmail.com Mon Jan 26 23:50:03 2009 From: aaron.patterson at gmail.com (Aaron Patterson) Date: Mon, 26 Jan 2009 20:50:03 -0800 Subject: [Nokogiri-talk] Stream Parsing In-Reply-To: <6b3a7f010901262028v7135a112nb9d2f11e3bc3da12@mail.gmail.com> References: <6b3a7f010901231255v2d93beebg3eba63dbbb2bccae@mail.gmail.com> <6959e1680901231407o6be69175p18344f227d852aa5@mail.gmail.com> <6b3a7f010901231805v5fc9d6d1pacbcd4a9cce10fd0@mail.gmail.com> <618c07250901260614qd4d03bbu1fda9493bc1aaded@mail.gmail.com> <6959e1680901260914g9350763lafdbde1831f9991e@mail.gmail.com> <6b3a7f010901261142i21cbb9d6s3c7a37bcd4963b61@mail.gmail.com> <6959e1680901261959t2b446794qd9115d79e27c6fb7@mail.gmail.com> <6b3a7f010901262028v7135a112nb9d2f11e3bc3da12@mail.gmail.com> Message-ID: <6959e1680901262050n52bb03a7md12dc12b8638a5d4@mail.gmail.com> On Mon, Jan 26, 2009 at 8:28 PM, Norman Elton wrote: >> Just out of curiosity, what size chunks are you parsing? I'm trying >> to visualize a use case where a 4k buffer size is too large. If >> you're dealing with small chunks, a push parser might be better. I am >> very close to having the push parser work. :-) > > If I can hook an IO up to a push parser, I'm pretty sure that's > exactly what I'd want. > > The details of my application are a little non-standard. Basically, a > TCP socket exists between server and client. The RPC exchange consists > of two long-running XML documents, one in each direction. Both > machines exchange xml-decl's, followed by an opening tag. The client > then burps XML snippets to the server, which responds with a response > snippet. The document never actually "ends" until the session is > complete and the socket closed. > > In order for this to work, each speaker needs to recognize each > other's communication without the document fully parsing or the socket > closing. It sounds like a push parser would do the trick :) Gotcha. Sounds like you've got an XMPP type situation. Yes, you want push parsing. Keep an eye on this ticket: http://nokogiri.lighthouseapp.com/projects/19607/tickets/9-push-parsing-chunked-parsing I think I'll have something tonight or tomorrow. -- Aaron Patterson http://tenderlovemaking.com/