From boss at airbladesoftware.com Thu Dec 7 09:21:04 2006 From: boss at airbladesoftware.com (Andrew Stewart) Date: Thu, 7 Dec 2006 14:21:04 +0000 Subject: [Mechanize-users] Response To Form Submission Hanging Message-ID: Hello, I am using Mechanize to post a form to a website. When I do this by hand in my browser the response takes about 35s to come back (it's a long page full of tables and graphics). When I do this with Mechanize, the server starts to respond and then appears to hang. The obvious conclusion is that my code is wrong but I am reasonably sure that I haven't altered it since it was working earlier in the week (famous last words!). And I don't know how to observe exactly what my browser sends to the server to compare with my Mechanize log. What's the best way to debug the problem? In fact, more importantly, what's the problem? :) Here's some of the evidence I have gathered: The log of the response: D, [2006-12-07T14:04:29.213035 #374] DEBUG -- : response-header: content-type => text/html; charset=utf-8 D, [2006-12-07T14:04:29.213337 #374] DEBUG -- : response-header: date => Thu, 07 Dec 2006 14:04:27 GMT D, [2006-12-07T14:04:29.213465 #374] DEBUG -- : response-header: server => Apache D, [2006-12-07T14:04:29.213561 #374] DEBUG -- : response-header: set- cookie => backend=web-2-06; path=/; expires=Fri, 08-Dec-2006 14:04:28 GMT D, [2006-12-07T14:04:29.213657 #374] DEBUG -- : response-header: transfer-encoding => chunked D, [2006-12-07T14:04:29.214291 #374] DEBUG -- : saved cookie: backend=web-2-06 (Let me know if it would help to see the log of the request.) The stack trace produced when I interrupt my code after several minutes: ^C/usr/local/lib/ruby/1.8/net/protocol.rb:133:in `sysread': Interrupt from /usr/local/lib/ruby/1.8/net/protocol.rb:133:in `rbuf_fill' from /usr/local/lib/ruby/1.8/timeout.rb:56:in `timeout' from /usr/local/lib/ruby/1.8/timeout.rb:76:in `timeout' from /usr/local/lib/ruby/1.8/net/protocol.rb:132:in `rbuf_fill' from /usr/local/lib/ruby/1.8/net/protocol.rb:86:in `read' from /usr/local/lib/ruby/1.8/net/http.rb:2200:in `read_chunked' from /usr/local/lib/ruby/1.8/net/http.rb:2175:in `read_body_0' from /usr/local/lib/ruby/1.8/net/http.rb:2141:in `read_body' ... 7 levels... from /usr/local/lib/ruby/gems/1.8/gems/mechanize-0.6.3/lib/ mechanize.rb:265:in `post_form' from /usr/local/lib/ruby/gems/1.8/gems/mechanize-0.6.3/lib/ mechanize.rb:201:in `submit' from lib/grabber.rb:49 from lib/grabber.rb:28 And some of my code: ... 21 agent.read_timeout = 120 # clutching at straws ... 45 form = page.forms[1] 46 form.checkboxes.each { |c| c.checked = true } 47 options.merge('stage' => 'refine').each { |k,v| form.add_field! (k, v) } 48 puts "about to submit at #{Time.now}" 49 page = agent.submit(form) 50 puts "submitted at #{Time.now}" ... The 'about to submit...' statement is written out to the console. The 'submitted at...' line isn't. I've left it running for up to 5 minutes. Any help would be much appreciated. Thanks and regards, Andy Stewart From schapht at gmail.com Thu Dec 7 10:20:57 2006 From: schapht at gmail.com (Mat Schaffer) Date: Thu, 7 Dec 2006 10:20:57 -0500 Subject: [Mechanize-users] Response To Form Submission Hanging In-Reply-To: References: Message-ID: <652E835D-E017-4A96-BA08-50BF199D80AC@gmail.com> On Dec 7, 2006, at 9:21 AM, Andrew Stewart wrote: > I am using Mechanize to post a form to a website. When I do this by > hand in my browser the response takes about 35s to come back (it's a > long page full of tables and graphics). When I do this with > Mechanize, the server starts to respond and then appears to hang. I don't have any specific ideas based off of your logs. But this problem reminded me that I've seen Net::HTTP (specifically mongrel and webrick, but they both use it) do some non-standard things and trip up browsers. Maybe you're seeing something similar? Personally I would use Charles to debug it to see if there are any marked differences between what the browser is doing and what mechanize is doing. It's shareware, so you can only run it for like 30 minutes at a time. But I find that's long enough to debug most issues. http://www.xk72.com/charles/ Good luck! Maybe Aaron will have some more ideas for you. -Mat From boss at airbladesoftware.com Fri Dec 8 04:46:13 2006 From: boss at airbladesoftware.com (Andrew Stewart) Date: Fri, 8 Dec 2006 09:46:13 +0000 Subject: [Mechanize-users] Response To Form Submission Hanging In-Reply-To: <652E835D-E017-4A96-BA08-50BF199D80AC@gmail.com> References: <652E835D-E017-4A96-BA08-50BF199D80AC@gmail.com> Message-ID: <3775AAD3-F208-4353-955F-D1212EBE4391@airbladesoftware.com> Mat, > I don't have any specific ideas based off of your logs. But this > problem reminded me that I've seen Net::HTTP (specifically mongrel > and webrick, but they both use it) do some non-standard things and > trip up browsers. Maybe you're seeing something similar? Intriguing.... > Personally I would use Charles to debug it to see if there are any > marked differences between what the browser is doing and what > mechanize is doing. It's shareware, so you can only run it for like > 30 minutes at a time. But I find that's long enough to debug most > issues. Thanks for the pointer to Charles -- I didn't know of it and it looks useful. > Good luck! Maybe Aaron will have some more ideas for you. Let's hope! Thanks and regards, Andy Stewart From aaron_patterson at speakeasy.net Fri Dec 8 18:20:37 2006 From: aaron_patterson at speakeasy.net (Aaron Patterson) Date: Fri, 8 Dec 2006 15:20:37 -0800 Subject: [Mechanize-users] Response To Form Submission Hanging In-Reply-To: References: Message-ID: <20061208232037.GA26010@eviladmins.lan> Hi Andrew, On Thu, Dec 07, 2006 at 02:21:04PM +0000, Andrew Stewart wrote: > Hello, > > I am using Mechanize to post a form to a website. When I do this by > hand in my browser the response takes about 35s to come back (it's a > long page full of tables and graphics). When I do this with > Mechanize, the server starts to respond and then appears to hang. How long exactly is the page? In the mega-bytes? > > The obvious conclusion is that my code is wrong but I am reasonably > sure that I haven't altered it since it was working earlier in the > week (famous last words!). And I don't know how to observe exactly > what my browser sends to the server to compare with my Mechanize log. > > What's the best way to debug the problem? In fact, more importantly, > what's the problem? :) > > Here's some of the evidence I have gathered: > > The log of the response: > > D, [2006-12-07T14:04:29.213035 #374] DEBUG -- : response-header: > content-type => text/html; charset=utf-8 > D, [2006-12-07T14:04:29.213337 #374] DEBUG -- : response-header: date > => Thu, 07 Dec 2006 14:04:27 GMT > D, [2006-12-07T14:04:29.213465 #374] DEBUG -- : response-header: > server => Apache > D, [2006-12-07T14:04:29.213561 #374] DEBUG -- : response-header: set- > cookie => backend=web-2-06; path=/; expires=Fri, 08-Dec-2006 14:04:28 > GMT > D, [2006-12-07T14:04:29.213657 #374] DEBUG -- : response-header: > transfer-encoding => chunked > D, [2006-12-07T14:04:29.214291 #374] DEBUG -- : saved cookie: > backend=web-2-06 > > (Let me know if it would help to see the log of the request.) > > The stack trace produced when I interrupt my code after several minutes: > > ^C/usr/local/lib/ruby/1.8/net/protocol.rb:133:in `sysread': Interrupt > from /usr/local/lib/ruby/1.8/net/protocol.rb:133:in `rbuf_fill' > from /usr/local/lib/ruby/1.8/timeout.rb:56:in `timeout' > from /usr/local/lib/ruby/1.8/timeout.rb:76:in `timeout' > from /usr/local/lib/ruby/1.8/net/protocol.rb:132:in `rbuf_fill' > from /usr/local/lib/ruby/1.8/net/protocol.rb:86:in `read' > from /usr/local/lib/ruby/1.8/net/http.rb:2200:in `read_chunked' > from /usr/local/lib/ruby/1.8/net/http.rb:2175:in `read_body_0' > from /usr/local/lib/ruby/1.8/net/http.rb:2141:in `read_body' > ... 7 levels... > from /usr/local/lib/ruby/gems/1.8/gems/mechanize-0.6.3/lib/ > mechanize.rb:265:in `post_form' > from /usr/local/lib/ruby/gems/1.8/gems/mechanize-0.6.3/lib/ > mechanize.rb:201:in `submit' > from lib/grabber.rb:49 > from lib/grabber.rb:28 I've seen stack traces similar to this when you get read timeouts. If the server is slow, that may be your problem..... Also, there is sort of a bug in ruby net/http where large file will take a long time to download and max out your CPU. Try adding the following snippet and see if that helps. Note that it will only work for ruby versions > 1.8.2. class Net::InternetMessageIO #:nodoc: alias :old_rbuf_fill :rbuf_fill def rbuf_fill begin @rbuf << @io.read_nonblock(65536) rescue Errno::EWOULDBLOCK if IO.select([@io], nil, nil, @read_timeout) @rbuf << @io.read_nonblock(65536) else raise Timeout::TimeoutError end end end end If you can send a short sample that will reproduce the problem, I can help more. -- Aaron Patterson http://tenderlovemaking.com/ From christopher.mcmahon at gmail.com Sat Dec 9 13:21:51 2006 From: christopher.mcmahon at gmail.com (Chris McMahon) Date: Sat, 9 Dec 2006 10:21:51 -0800 Subject: [Mechanize-users] manipulate headers? In-Reply-To: <72799cd70612091001g3b0b5086ra11cfe59f6ab586e@mail.gmail.com> References: <72799cd70612091001g3b0b5086ra11cfe59f6ab586e@mail.gmail.com> Message-ID: <72799cd70612091021l39510209wb78955d55d5063b1@mail.gmail.com> Hi... Here's a working Perl script that I want to be able to do in Ruby: use WWW::Mechanize; my $url = "http://host/tt?name=chris"; my $mech = WWW::Mechanize->new(); $mech->add_header( Referer => "http://chrismcmahonsblog.blogspot.com" ); $mech->add_header( Cookie => "messageid=170118; memberid=1007"); $mech->get($url) so the header values for Referer and Cookie are passed with the HTTP GET. There seems to be an add_field method in mechanize.rb: request.add_field('Referer', cur_page.uri.to_s) and rdoc (http://mechanize.rubyforge.org/) indicates an add_field value for Net::HTTPHeader, but the example is funny, because the example says "add_header", not "add_field". In either case, none of the "add_*" statements in the script below work, each yields an "undefined method" error. require 'mechanize' mech = WWW::Mechanize.new #request.add_header('Referer', " http://chrismcmahonsblog.blogspot.com" ) #mech.add_header('Referer', "http://chrismcmahonsblog.blogspot.com " ) #mech.add_field('Cookie', "messageid=170118; memberid=1007") puts mech.get("http://host/tt?name=chris").inspect can anyone tell me how to add Referer and Cookie headers to an HTTP GET request? -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/mechanize-users/attachments/20061209/f3cee23b/attachment.html From christopher.mcmahon at gmail.com Sat Dec 9 13:01:37 2006 From: christopher.mcmahon at gmail.com (Chris McMahon) Date: Sat, 9 Dec 2006 10:01:37 -0800 Subject: [Mechanize-users] manipulate headers? Message-ID: <72799cd70612091001g3b0b5086ra11cfe59f6ab586e@mail.gmail.com> Hi... Here's a working Perl script that I want to be able to do in Ruby: use WWW::Mechanize; my $url = "http://host/tt?name=chris"; my $mech = WWW::Mechanize->new(); $mech->add_header( Referer => "http://chrismcmahonsblog.blogspot.com" ); $mech->add_header( Cookie => "messageid=170118; memberid=1007"); $mech->get($url) so the header values for Referer and Cookie are passed with the HTTP GET. There seems to be an add_field method in mechanize.rb: request.add_field('Referer', cur_page.uri.to_s) and rdoc (http://mechanize.rubyforge.org/) indicates an add_field value for Net::HTTPHeader, but the example is funny, because the example says "add_header", not "add_field". In either case, none of the "add_*" statements in the script below work, each yields an "undefined method" error. require 'mechanize' mech = WWW::Mechanize.new #request.add_header('Referer', "http://chrismcmahonsblog.blogspot.com" ) #mech.add_header('Referer', "http://chrismcmahonsblog.blogspot.com" ) #mech.add_field('Cookie', "messageid=170118; memberid=1007") puts mech.get("http://host/tt?name=chris").inspect can anyone tell me how to add Referer and Cookie headers to an HTTP GET request? -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/mechanize-users/attachments/20061209/aa4e35d2/attachment.html From boss at airbladesoftware.com Mon Dec 11 05:15:40 2006 From: boss at airbladesoftware.com (Andrew Stewart) Date: Mon, 11 Dec 2006 10:15:40 +0000 Subject: [Mechanize-users] Response To Form Submission Hanging In-Reply-To: <20061208232037.GA26010@eviladmins.lan> References: <20061208232037.GA26010@eviladmins.lan> Message-ID: <33CD149A-E33B-4D12-853B-CDFE4F34C830@airbladesoftware.com> Hi Aaron, On 8 Dec 2006, at 23:20, Aaron Patterson wrote: > How long exactly is the page? In the mega-bytes? Just under 3.2 MB. > I've seen stack traces similar to this when you get read timeouts. If > the server is slow, that may be your problem..... I wouldn't call the server fast but it seems good enough. When I do what the script does myself in the browser, the page comes back in under a minute. > Also, there is sort of a bug in ruby net/http where large file will > take a long time to download and max out your CPU. Try adding the > following snippet and see if that helps. Note that it will only work > for ruby versions > 1.8.2. Thanks for the snippet. I'm on Ruby 1.8.4 and unfortunately the snippet didn't appear to make any difference. > If you can send a short sample that will reproduce the problem, I can > help more. I've pasted my code below. The part which hangs is the final line using Mechanize: "page = agent.submit(form)". ---- require 'rubygems' require 'mechanize' require 'logger' class Net::InternetMessageIO #:nodoc: alias :old_rbuf_fill :rbuf_fill def rbuf_fill begin @rbuf << @io.read_nonblock(65536) rescue Errno::EWOULDBLOCK if IO.select([@io], nil, nil, @read_timeout) @rbuf << @io.read_nonblock(65536) else raise Timeout::TimeoutError end end end end class SangerParser < WWW::Mechanize::Page def initialize(uri = nil, response = nil, body = nil, code = nil) # Ditch any runs of table data which contain only and at least 3 "" or ""s body.gsub!(/(<\/?td>){3,}/, '') # Ditch any title attributes which equal nothing at all body.gsub!(/ title=>/i, '>') super(uri, response, body, code) end end logger = Logger.new('mech.log') logger.level = Logger::DEBUG agent = WWW::Mechanize.new { |a| a.log = logger } agent.pluggable_parser.html = SangerParser agent.user_agent_alias = 'Mac Safari' agent.read_timeout = 120 # Does this make any difference? # Tissue selection screen url = 'http://www.sanger.ac.uk/cgi-bin/genetics/CGP/genotyping/lohmap' page = agent.get(url) # Just choose one tissue for now #tissues = page.forms[1].fields.name('site_1').options[1..-1].map { | o| o.value } tissues = %w( Colorectal ) tissues.each { |tissue| # chrom: chromosome to display # site_1: tissue to display # freqval: number of consecutive homozygous markers to colour (1-6) # failmark: extend homozygous markup (yes = 1, no = 0) # sort: sort the cell lines (by loss [high to low] = 1, by cell line [A-Z] = 0) options = { 'site_1' => tissue, 'chrom' => '1', 'freqval' => '1', 'failmark' => '0', 'sort' => '1' } # Post directly without going via the tissue selection form page = agent.post(url, options.merge('stage' => 'display')) # LOH data selection screen form = page.forms[1] form.checkboxes.each { |c| c.checked = true } options.merge('stage' => 'refine').each { |k, v| form.add_field! (k, v) } puts "about to submit at #{Time.now}" page = agent.submit(form) puts "submitted at #{Time.now}" # Write out results page to disk open("data/#{tissue.downcase}.html", 'w') { |f| f << page.body } } ---- Thanks for your help, Andy Stewart From aaron_patterson at speakeasy.net Tue Dec 12 22:43:32 2006 From: aaron_patterson at speakeasy.net (Aaron Patterson) Date: Tue, 12 Dec 2006 19:43:32 -0800 Subject: [Mechanize-users] Response To Form Submission Hanging In-Reply-To: <33CD149A-E33B-4D12-853B-CDFE4F34C830@airbladesoftware.com> References: <20061208232037.GA26010@eviladmins.lan> <33CD149A-E33B-4D12-853B-CDFE4F34C830@airbladesoftware.com> Message-ID: <20061213034332.GA2456@eviladmins.lan> On Mon, Dec 11, 2006 at 10:15:40AM +0000, Andrew Stewart wrote: > Hi Aaron, > > On 8 Dec 2006, at 23:20, Aaron Patterson wrote: > > > How long exactly is the page? In the mega-bytes? > > Just under 3.2 MB. > > > I've seen stack traces similar to this when you get read timeouts. If > > the server is slow, that may be your problem..... > > I wouldn't call the server fast but it seems good enough. When I do > what the script does myself in the browser, the page comes back in > under a minute. > > > Also, there is sort of a bug in ruby net/http where large file will > > take a long time to download and max out your CPU. Try adding the > > following snippet and see if that helps. Note that it will only work > > for ruby versions > 1.8.2. > > Thanks for the snippet. I'm on Ruby 1.8.4 and unfortunately the > snippet didn't appear to make any difference. > > > If you can send a short sample that will reproduce the problem, I can > > help more. [snip] I think I've solved the problem. I added some extra debugging to mechanize and found that the server was returning too much data! I think it was caught in some sort of infinite loop. Mechanize downloaded over 100 megs of data before the socket read timed out. Basically the problem was sending the "sort" argument to the server. When I removed that, things seemed to work OK. I found using LiveHTTPHeaders for Firefox that the form you submit isn't supposed to send a "sort" argument. I noticed that the html returned is not clean, so I modified your pluggable parser to use Tidy. This let Hpricot parse everything properly and the form object seemed to be populated correctly. Anyway, here is my version of your script (you'll need to verify it is returning the correct data from the form): require 'rubygems' require 'mechanize' require 'logger' require 'tidy' # Change this depending on your system Tidy.path = '/usr/lib/libtidy.dylib' class SangerParser < WWW::Mechanize::Page def initialize(uri = nil, response = nil, body = nil, code = nil) xml = Tidy.open() do |tidy| tidy.options.output_xml = true xml = tidy.clean(body) xml end super(uri, response, xml, code) end end logger = Logger.new('mech.log') logger.level = Logger::DEBUG agent = WWW::Mechanize.new { |a| a.log = logger } agent.pluggable_parser.html = SangerParser agent.user_agent_alias = 'Mac Safari' agent.read_timeout = 120 # Does this make any difference? # Tissue selection screen url = 'http://www.sanger.ac.uk/cgi-bin/genetics/CGP/genotyping/lohmap' page = agent.get(url) # Just choose one tissue for now tissues = page.forms[1].fields.name('site_1').options[1..-1].map { |o| o.value } tissues = %w( Colorectal ) tissues.each { |tissue| # chrom: chromosome to display # site_1: tissue to display # freqval: number of consecutive homozygous markers to colour (1-6) # failmark: extend homozygous markup (yes = 1, no = 0) # sort: sort the cell lines (by loss [high to low] = 1, by cell line [A-Z] = 0) options = { 'site_1' => tissue, 'chrom' => '1', 'freqval' => '1', 'failmark' => '0', 'sort' => '1' } # Post directly without going via the tissue selection form page = agent.post(url, options.merge('stage' => 'display')) # LOH data selection screen form = page.forms[1] form.checkboxes.each { |c| c.checked = true } puts "about to submit at #{Time.now}" page = agent.submit(form, form.buttons.first) puts "submitted at #{Time.now}" # Write out results page to disk open("data/#{tissue.downcase}.html", 'w') { |f| f << page.body } } -- Aaron Patterson http://tenderlovemaking.com/ From boss at airbladesoftware.com Wed Dec 13 12:12:21 2006 From: boss at airbladesoftware.com (Andrew Stewart) Date: Wed, 13 Dec 2006 17:12:21 +0000 Subject: [Mechanize-users] Response To Form Submission Hanging In-Reply-To: <20061213034332.GA2456@eviladmins.lan> References: <20061208232037.GA26010@eviladmins.lan> <33CD149A-E33B-4D12-853B-CDFE4F34C830@airbladesoftware.com> <20061213034332.GA2456@eviladmins.lan> Message-ID: Hi Aaron, On 13 Dec 2006, at 03:43, Aaron Patterson wrote: > I think I've solved the problem. I added some extra debugging to > mechanize and found that the server was returning too much data! I > think it was caught in some sort of infinite loop. Mechanize > downloaded > over 100 megs of data before the socket read timed out. > > Basically the problem was sending the "sort" argument to the server. > When I removed that, things seemed to work OK. I found using > LiveHTTPHeaders for Firefox that the form you submit isn't supposed to > send a "sort" argument. It works perfectly now. Thank you! I never thought that the "sort" argument was causing the problem. Well spotted. > I noticed that the html returned is not clean, so I modified your > pluggable > parser to use Tidy. This let Hpricot parse everything properly and > the > form object seemed to be populated correctly. Using Tidy sounds much better than laboriously and erroneously doing my own tidying. And as you say, it allows the form object to be populated correctly. Super! > Anyway, here is my version of your script (you'll need to verify it is > returning the correct data from the form): I've checked the data and it's 100% accurate. Thanks again for your help -- I am really grateful. And I've learned a few things along the way too :) Regards, Andy Stewart From zach at zachbaker.com Sat Dec 30 14:32:06 2006 From: zach at zachbaker.com (Zach Baker) Date: Sat, 30 Dec 2006 11:32:06 -0800 Subject: [Mechanize-users] Change I needed to make in to_absolute_uri for unescaped URL separator characters Message-ID: <4596BEB6.8070000@zachbaker.com> Mechanize is great! It's better than anything I was expecting to be out there, and the syntax is really nice. I had a bit of a problem though -- following URLs with commas. The code in to_absolute_uri works great for spaces, but some pages I was working on had URLs with unescaped commas that URI rejected when I tried to click() on them. So I changed the first statement in Mechanize#to_absolute_uri to: url = URI.parse( URI.unescape( Util.html_unescape(url.to_s.strip)).gsub(/[ ,]/){|m| '%%%X' % m[0]} ) unless url.is_a? URI I don't know what a more complete solution for this bug is, but this at least lets me follow the URLs I need to, so here it is if anyone else needs to use it. -- Zach. From zach at zachbaker.com Sun Dec 31 13:28:49 2006 From: zach at zachbaker.com (Zach Baker) Date: Sun, 31 Dec 2006 10:28:49 -0800 Subject: [Mechanize-users] Retrying requests Message-ID: <45980161.1080708@zachbaker.com> I use this method to retry my requests if there are retrieval problems. def with_retries(num_retries = 4) begin yield rescue Errno::ECONNRESET, Errno::ECONNABORTED, Errno::EHOSTUNREACH, Errno::ECONNREFUSED, Errno::ETIMEDOUT, Timeout::Error, WWW::Mechanize::ResponseCodeError num_retries -= 1 retry unless num_retries < 0 raise end end So I can try three times to get a page: with_retries(2){page.click()} before the outer method raises an exception. But I'm not really sure if I'm catching all the right exceptions. Anyone know? -- Zach.