From ericmilford at gmail.com Sat May 2 21:00:55 2009 From: ericmilford at gmail.com (Eric Milford) Date: Sat, 2 May 2009 21:00:55 -0400 Subject: [Mechanize-users] Page successfully returned but without forms Message-ID: <320C130F-70BB-4EB3-B4A7-F823CEC53ED6@gmail.com> Has anyone seen an instance before where a page is returned without its corresponding forms, etc. available to the page object? The page I am accessing does in fact have forms, though page.forms is empty. What's causing this and is there a way to get around it? Thanks From ericmilford at gmail.com Sat May 2 21:03:16 2009 From: ericmilford at gmail.com (Eric Milford) Date: Sat, 2 May 2009 21:03:16 -0400 Subject: [Mechanize-users] Mechanize & ASP/ASPX pages Message-ID: <516BB0CF-8AA9-4950-8C64-9F13822CC279@gmail.com> Can anyone provide any tips/tricks for clicking through ASP/ASPX pages. From what I can remember about the .NET world, you move between pages by doing a javascript postback to a form. That said, I can mimic the javascript action and submit the form, but for some reason the .NET application always errors. What's the secret! :-) I've managed to get one previous ASP page working fine, but that took some time. Thanks. From astarr at wiredquote.com Tue May 5 05:26:32 2009 From: astarr at wiredquote.com (Aaron Starr) Date: Tue, 5 May 2009 02:26:32 -0700 Subject: [Mechanize-users] Only partially reading a page! Message-ID: <669cc1ca0905050226k22080d03rac3e1ce89c8a98a6@mail.gmail.com> I am trying to get a page which includes a form, but the form is missing from the WWW::Mechanize::Page object. I retrieve it via: page = web_agent.submit(a_different_form) For debugging this problem, I then immediately write the resulting page to two different logs: File.open('big.html','wb') { |f| f.write(page.body) } File.open('little.html','wb') { |f| f.write(page.root.to_html) } The results of these two methods (page.body vs. page.root.to_html) are dramatically different, with most of the page missing from the page.root.to_html version. In other words, the form appears in page.body, but not in page.root.to_html. Furthermore, page.body seems to have valid html, because I can do this from irb: f = File.open('big.html','rb') { |f| f.read } page = WWW::Mechanize::Page.new(nil, {'content-type'=>'text/html'}, f, 200) And that page works just fine -- the form is there. Any idea why the page retrieved from web_agent.submit, which apparently has the same body as the page created by hand, would nevertheless have two different element lists? Many, many thanks in advance for whatever guidance you can give. Aaron From astarr at wiredquote.com Wed May 6 02:27:37 2009 From: astarr at wiredquote.com (Aaron Starr) Date: Tue, 5 May 2009 23:27:37 -0700 Subject: [Mechanize-users] Only partially reading a page! Message-ID: <669cc1ca0905052327h4540ba7am9f558143f485d92e@mail.gmail.com> Or, here's the quick and exciting version of the same question. On a particular page, I consistently get this weird result: ??? page.root.to_html != WWW::Mechanize::Page.new(nil,{'content-type'=>'text/html'}, page.body, 200).root.to_html Isn't that weird? And, exciting? Any idea how something like that could happen? Anyone? Say, Aaron Patterson? Aaron (Slow, breath-takingly boring version of the question:) > > I am trying to get a page which includes a form, but the form is > missing from the WWW::Mechanize::Page object. I retrieve it via: > > ? ?page = web_agent.submit(a_different_form) > > For debugging this problem, I then immediately write the resulting > page to two different logs: > > ? ?File.open('big.html','wb') { |f| f.write(page.body) } > ? ?File.open('little.html','wb') { |f| f.write(page.root.to_html) } > > The results of these two methods (page.body vs. page.root.to_html) are > dramatically different, with most of the page missing from the > page.root.to_html version. > > In other words, the form appears in page.body, but not in page.root.to_html. > > Furthermore, page.body seems to have valid html, because I can do this from irb: > > ? ?f = File.open('big.html','rb') { |f| f.read } > ? ?page = WWW::Mechanize::Page.new(nil, {'content-type'=>'text/html'}, f, 200) > > And that page works just fine -- the form is there. > > Any idea why the page retrieved from web_agent.submit, which > apparently has the same body as the page created by hand, would > nevertheless have two different element lists? > > Many, many thanks in advance for whatever guidance you can give. > > Aaron From mat.schaffer at gmail.com Wed May 6 08:44:58 2009 From: mat.schaffer at gmail.com (Mat Schaffer) Date: Wed, 6 May 2009 08:44:58 -0400 Subject: [Mechanize-users] Only partially reading a page! In-Reply-To: <669cc1ca0905052327h4540ba7am9f558143f485d92e@mail.gmail.com> References: <669cc1ca0905052327h4540ba7am9f558143f485d92e@mail.gmail.com> Message-ID: I'm offline so can't verify now, but this sounds like the problem that keeps coming up on this list lately: Quoting Anthony F: Setting the html_parser to the Nokogiri or Hpricot object (rather than the default Nokogiri::HTML) object worked for me, like so: WWW::Mechanize.html_parser = Nokogiri or WWW::Mechanize.html_parser = Hpricot Hope that helps. I'll keep this in mind next time I have some downtime and see if I can get a patch together for Aaron. Have you tried cloning and installing his version from github? It might already have this issue fixed. -Mat On May 6, 2009, at 2:27 AM, Aaron Starr wrote: > Or, here's the quick and exciting version of the same question. > > On a particular page, I consistently get this weird result: > > page.root.to_html != > WWW::Mechanize::Page.new(nil,{'content-type'=>'text/html'}, page.body, > 200).root.to_html > > Isn't that weird? And, exciting? Any idea how something like that > could happen? Anyone? Say, Aaron Patterson? > > Aaron > > (Slow, breath-takingly boring version of the question:) >> >> I am trying to get a page which includes a form, but the form is >> missing from the WWW::Mechanize::Page object. I retrieve it via: >> >> page = web_agent.submit(a_different_form) >> >> For debugging this problem, I then immediately write the resulting >> page to two different logs: >> >> File.open('big.html','wb') { |f| f.write(page.body) } >> File.open('little.html','wb') { |f| f.write(page.root.to_html) } >> >> The results of these two methods (page.body vs. page.root.to_html) >> are >> dramatically different, with most of the page missing from the >> page.root.to_html version. >> >> In other words, the form appears in page.body, but not in >> page.root.to_html. >> >> Furthermore, page.body seems to have valid html, because I can do >> this from irb: >> >> f = File.open('big.html','rb') { |f| f.read } >> page = WWW::Mechanize::Page.new(nil, {'content-type'=>'text/ >> html'}, f, 200) >> >> And that page works just fine -- the form is there. >> >> Any idea why the page retrieved from web_agent.submit, which >> apparently has the same body as the page created by hand, would >> nevertheless have two different element lists? >> >> Many, many thanks in advance for whatever guidance you can give. >> >> Aaron > _______________________________________________ > Mechanize-users mailing list > Mechanize-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mechanize-users From astarr at wiredquote.com Wed May 6 14:26:37 2009 From: astarr at wiredquote.com (Aaron Starr) Date: Wed, 6 May 2009 11:26:37 -0700 Subject: [Mechanize-users] Only partially reading a page! In-Reply-To: References: <669cc1ca0905052327h4540ba7am9f558143f485d92e@mail.gmail.com> Message-ID: <669cc1ca0905061126g7d0716b7ya0e805620f375bd@mail.gmail.com> Mat, Thank you so much for your response. I had temporarily worked around the issue by automatically taking each page as it's returned and building a new page from it, passing the old page's body. And that's how sausage gets made, ladies and gentlemen. I'll try the "WWW::Mechanize.html_parser = Nokogiri" temporary work-around -- that seems about eight-hundred times more clever. Sincere thanks, Aaron On Wed, May 6, 2009 at 5:44 AM, Mat Schaffer wrote: > I'm offline so can't verify now, but this sounds like the problem that > keeps coming up on this list lately: > > Quoting Anthony F: > > Setting the html_parser to the Nokogiri or Hpricot object (rather than the > default Nokogiri::HTML) object worked for me, like so: > > WWW::Mechanize.html_parser = Nokogiri > or > WWW::Mechanize.html_parser = Hpricot > > Hope that helps. I'll keep this in mind next time I have some downtime and > see if I can get a patch together for Aaron. Have you tried cloning and > installing his version from github? It might already have this issue fixed. > > -Mat > > > > On May 6, 2009, at 2:27 AM, Aaron Starr wrote: > > Or, here's the quick and exciting version of the same question. >> >> On a particular page, I consistently get this weird result: >> >> page.root.to_html != >> WWW::Mechanize::Page.new(nil,{'content-type'=>'text/html'}, page.body, >> 200).root.to_html >> >> Isn't that weird? And, exciting? Any idea how something like that >> could happen? Anyone? Say, Aaron Patterson? >> >> Aaron >> >> (Slow, breath-takingly boring version of the question:) >> >>> >>> I am trying to get a page which includes a form, but the form is >>> missing from the WWW::Mechanize::Page object. I retrieve it via: >>> >>> page = web_agent.submit(a_different_form) >>> >>> For debugging this problem, I then immediately write the resulting >>> page to two different logs: >>> >>> File.open('big.html','wb') { |f| f.write(page.body) } >>> File.open('little.html','wb') { |f| f.write(page.root.to_html) } >>> >>> The results of these two methods (page.body vs. page.root.to_html) are >>> dramatically different, with most of the page missing from the >>> page.root.to_html version. >>> >>> In other words, the form appears in page.body, but not in >>> page.root.to_html. >>> >>> Furthermore, page.body seems to have valid html, because I can do this >>> from irb: >>> >>> f = File.open('big.html','rb') { |f| f.read } >>> page = WWW::Mechanize::Page.new(nil, {'content-type'=>'text/html'}, f, >>> 200) >>> >>> And that page works just fine -- the form is there. >>> >>> Any idea why the page retrieved from web_agent.submit, which >>> apparently has the same body as the page created by hand, would >>> nevertheless have two different element lists? >>> >>> Many, many thanks in advance for whatever guidance you can give. >>> >>> Aaron >>> >> _______________________________________________ >> Mechanize-users mailing list >> Mechanize-users at rubyforge.org >> http://rubyforge.org/mailman/listinfo/mechanize-users >> > > _______________________________________________ > Mechanize-users mailing list > Mechanize-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mechanize-users > -------------- next part -------------- An HTML attachment was scrubbed... URL: From chrisamiller at gmail.com Thu May 7 21:01:17 2009 From: chrisamiller at gmail.com (Chris Miller) Date: Thu, 7 May 2009 20:01:17 -0500 Subject: [Mechanize-users] Malformed HTML Message-ID: <61ca89240905071801v4823d7e3w41a9bb666a72f177@mail.gmail.com> I'm using Mechanize to parse an extraordinarily malformed html page. After submitting a form like so: page = mech.submit(dform) The result I get back is truncated. I suspect that it's because the source HTML looks like this: yadda yadda

some text

My 'page' variable contains only the data that occurs before the second tag. Am I right in suspecting that this is the cause of my problems? Are there any work-arounds that will enable me to grab all of the text, even if it can't be parsed sanely? Thanks, Chris Miller chrisamiller at gmail.com From astarr at wiredquote.com Thu May 7 21:31:12 2009 From: astarr at wiredquote.com (Aaron Starr) Date: Thu, 7 May 2009 18:31:12 -0700 Subject: [Mechanize-users] Malformed HTML In-Reply-To: <61ca89240905071801v4823d7e3w41a9bb666a72f177@mail.gmail.com> References: <61ca89240905071801v4823d7e3w41a9bb666a72f177@mail.gmail.com> Message-ID: <669cc1ca0905071831k47b7d687sd3701caecf734e28@mail.gmail.com> page.body should have the raw, original text-that-kind-of-reminds-us-of-html, does it not? On Thu, May 7, 2009 at 6:01 PM, Chris Miller wrote: > I'm using Mechanize to parse an extraordinarily malformed html page. > > After submitting a form like so: > page = mech.submit(dform) > > The result I get back is truncated. I suspect that it's because the > source HTML looks like this: > > > yadda yadda >

some text

> > >
> > > My 'page' variable contains only the data that occurs before the > second tag. > > Am I right in suspecting that this is the cause of my problems? Are > there any work-arounds that will enable me to grab all of the text, > even if it can't be parsed sanely? > > Thanks, > > Chris Miller > chrisamiller at gmail.com > _______________________________________________ > Mechanize-users mailing list > Mechanize-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mechanize-users > -------------- next part -------------- An HTML attachment was scrubbed... URL: From chrisamiller at gmail.com Fri May 8 12:03:41 2009 From: chrisamiller at gmail.com (Chris Miller) Date: Fri, 8 May 2009 11:03:41 -0500 Subject: [Mechanize-users] Malformed HTML In-Reply-To: <669cc1ca0905071831k47b7d687sd3701caecf734e28@mail.gmail.com> References: <61ca89240905071801v4823d7e3w41a9bb666a72f177@mail.gmail.com> <669cc1ca0905071831k47b7d687sd3701caecf734e28@mail.gmail.com> Message-ID: <61ca89240905080903v26910c06nd7b9b1ea828aae8c@mail.gmail.com> That solves my problem - I somehow missed that in the docs. Thanks for your help. -Chris That helps somewhat. Now, when I parse a local copy of the html page, I get everything by using the page.body command. The problem is, that when I try to retrieve the data from the server, the page.body command cuts off at the original point - right before the second tag. The html file in question can be found here: http://chrisamiller.com/temp.html Hrmm... maybe some of the page is being written out by javascript. If that's the case, mechanize won't be able to deal with it, right? -Chris On Thu, May 7, 2009 at 8:31 PM, Aaron Starr wrote: > page.body should have the raw, original > text-that-kind-of-reminds-us-of-html, does it not? > > > On Thu, May 7, 2009 at 6:01 PM, Chris Miller wrote: >> >> I'm using Mechanize to parse an extraordinarily malformed html page. >> >> After submitting a form like so: >> ? page = mech.submit(dform) >> >> The result I get back is truncated. ?I suspect that it's because the >> source HTML looks like this: >> >> >> yadda yadda >> ? ?

some text

>> >> ? ? >> ? ?
>> >> >> My 'page' variable contains only the data that occurs before the >> second tag. >> >> Am I right in suspecting that this is the cause of my problems? ?Are >> there any work-arounds that will enable me to grab all of the text, >> even if it can't be parsed sanely? >> >> Thanks, >> >> Chris Miller >> chrisamiller at gmail.com >> _______________________________________________ >> Mechanize-users mailing list >> Mechanize-users at rubyforge.org >> http://rubyforge.org/mailman/listinfo/mechanize-users > > > _______________________________________________ > Mechanize-users mailing list > Mechanize-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mechanize-users > From lightfault at gmail.com Sat May 9 14:09:22 2009 From: lightfault at gmail.com (So and so) Date: Sat, 9 May 2009 21:09:22 +0300 Subject: [Mechanize-users] Fetching information with regex ? Message-ID: Hey, I tried running this source http://pastebin.ca/1417565 Yet, I get none(links.length is zero, while there're results), does mechanize support regex ? If not - how is it possible to fetch specific wildcard items ? Thanks in advance. From apoc at sixserv.org Tue May 26 23:59:42 2009 From: apoc at sixserv.org (apoc) Date: Wed, 27 May 2009 05:59:42 +0200 Subject: [Mechanize-users] Problems with custom Referer Message-ID: <4A1CBAAE.3000304@sixserv.org> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi @ all, I'm new to this list, so sorry if this was already asked or if this is a stupid question.. I struggled a lot about how to set a custom Header, the manual says: | get(options, parameters = [], referer = nil) | Fetches the URL passed in and returns a page. I assume that the third parameter will set a custom Referer, I try this: require 'mechanize' agent = WWW::Mechanize.new page = agent.get('http://apoc.sixserv.org/requestinfo/?referer', nil, 'http://google.com/this/is/a/custom/referer') puts page.body but this does not work, then ages later I try: page = agent.get(:url => 'http://apoc.sixserv.org/requestinfo/?referer', :referer => 'http://google.com/this/is/a/custom/referer') puts page.body this works, why the first one not? does referer in the documentation means something else? bye apoc -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkocuq0ACgkQWlhozqFVuMsgsACfV1SpcZwP7P7hrFX6gNt0pbhu nwwAn1D/whCX0OAiTkVjkD+zU5zaaOre =eQ+W -----END PGP SIGNATURE----- From astarr at wiredquote.com Fri May 29 00:20:05 2009 From: astarr at wiredquote.com (Aaron Starr) Date: Thu, 28 May 2009 21:20:05 -0700 Subject: [Mechanize-users] Problem with end-of-file solved Message-ID: <669cc1ca0905282120w15a4f33dhd9e7b2172dd2cf8e@mail.gmail.com> Hello, I've been struggling with a scraper that was getting an exception, "end of file reached". The solution for me: The exception was happening after agent.get(uri), when the server returned a 302 redirect. The "Location" header in the redirect response looked like this: ? ?? ?? httpS://blah.blahdy-blah.com/blahblahblah/blah.html For reasons that are not clear at all to me, the protocol of "httpS" was throwing mechanize into a bit of a tail-spin. To test that adjusting the location to "https://..." would fix the problem, I made this ugly little test method: def get_blah(uri) save_redirect_ok = @web_agent.redirect_ok @web_agent.redirect_ok = false begin pg = @web_agent.get(uri) while ["301","302"].include?(pg.code) uri = pg.response['location'] uri = uri.gsub(/^httpS:/, 'https:') # <--- TAH DAH! Fixed it! @web_agent.log.info("Redirecting to #{uri}") if @web_agent.log pg = @web_agent.get(uri) end @page = pg ensure @web_agent.redirect_ok = save_redirect_ok end end So, in case anyone else comes across a mysterious end-of-file condition, you might check that your redirects are valid URLs, and deal with it if they're not. Cheers, Aaron From voxxar at gmail.com Sun May 31 17:49:39 2009 From: voxxar at gmail.com (=?ISO-8859-1?Q?Patrik_Sj=F6berg?=) Date: Sun, 31 May 2009 23:49:39 +0200 Subject: [Mechanize-users] .search('xpath') In-Reply-To: <67d0dc9a0905311444tfe5cff1kba2a95128e6406f4@mail.gmail.com> References: <67d0dc9a0905311444tfe5cff1kba2a95128e6406f4@mail.gmail.com> Message-ID: <51ED2346-9F92-4A6F-BD1D-95E2CFCE1D67@gmail.com> Firefox inserts tags to the DOM and firebug gets the modified code. If you print the code mechanize gets I bet there are no tbody tags, so try removing tbody from the xpath. On 31 maj 2009, at 23.44, Tim Stinnett wrote: > I am trying to scrape a web page using xpath to search with. > > page.search("//tr/td").each do |my_tag| > This works fine, but there are about 1 million tags that fit this on > the page I am looking at. > > If I put the full xpath as reported by Firebug I get 0 matches. > Where oh where am I messing this up. > > page.search("/html/body/div[2]/table/tbody/tr/td/table/tbody/tr/ > td[3]/table[2]/tbody/tr/td/table/tbody/tr/td/div/table/tbody/tr/td/ > table/tbody/tr/td/table/tbody/tr/td/table/tbody/tr/td/table[2]/tbody/ > tr[2]/td").each do |my_tag| > > Maybe this should be asked in a Xpath forum but here I am. > _______________________________________________ > Mechanize-users mailing list > Mechanize-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mechanize-users