From gsmoraes2 at gmail.com Mon Nov 12 09:33:39 2007 From: gsmoraes2 at gmail.com (gmoraes) Date: Mon, 12 Nov 2007 12:33:39 -0200 Subject: [Mechanize-users] Weird error downloading a gzip'ed file Message-ID: Hi all, I've been using mechanize for a while and it rocks. Docs are pretty clear and so far I've been able to do it on my own. However, I'm stuck in a weird situation in a script to download my contact list from hotmail. I've used Firebug to check all urls, and tested it by hand while logged in via browser. Even in the script everything works well until the last 'agent.get_file', which gets stuck with a weird error: ------ snip ------ $ ruby msn-scrap.rb # "http://by124w.bay124.mail.live.com/mail/GetContacts.aspx" Err: unexpected end of file Trace: /usr/lib/ruby/1.8/mechanize.rb:372:in `read' /usr/lib/ruby/1.8/mechanize.rb:372:in `fetch_page' /usr/lib/ruby/1.8/net/http.rb:1050:in `request' /usr/lib/ruby/1.8/net/http.rb:2133:in `reading_body' /usr/lib/ruby/1.8/net/http.rb:1049:in `request' /usr/lib/ruby/1.8/mechanize.rb:345:in `fetch_page' /usr/lib/ruby/1.8/net/http.rb:543:in `start' /usr/lib/ruby/1.8/mechanize.rb:339:in `fetch_page' /usr/lib/ruby/1.8/mechanize.rb:139:in `get' /usr/lib/ruby/1.8/mechanize.rb:146:in `get_file' msn-scrap.rb:32 ----- snip ------ mech.log important part: D, [2007-11-12T12:22:35.925521 #24540] DEBUG -- : request-header: referer => http://by124w.bay124.mail.live.com/mail/TodayLight.aspx?&n=1573603203&gs=true D, [2007-11-12T12:22:36.589708 #24540] DEBUG -- : response-header: cache-control => private,max-age=86400 D, [2007-11-12T12:22:36.589853 #24540] DEBUG -- : response-header: vary => Accept-Encoding D, [2007-11-12T12:22:36.589934 #24540] DEBUG -- : response-header: connection => keep-alive D, [2007-11-12T12:22:36.590012 #24540] DEBUG -- : response-header: expires => Wed, 01 Jan 1997 12:00:00 GMT, Wed, 01 Jan 1997 12:00:00 GMT D, [2007-11-12T12:22:36.590089 #24540] DEBUG -- : response-header: p3p => CP="BUS CUR CONo FIN IVDo ONL OUR PHY SAMo TELo" D, [2007-11-12T12:22:36.590166 #24540] DEBUG -- : response-header: date => Mon, 12 Nov 2007 14:28:34 GMT D, [2007-11-12T12:22:36.590241 #24540] DEBUG -- : response-header: xxn => W4 D, [2007-11-12T12:22:36.590344 #24540] DEBUG -- : response-header: content-type => text/csv D, [2007-11-12T12:22:36.590430 #24540] DEBUG -- : response-header: msnserver => H: BAY124-W4 V: 12.0.1190.927 D: 2007-09-27T23:27:08 D, [2007-11-12T12:22:36.590509 #24540] DEBUG -- : response-header: content-encoding => gzip D, [2007-11-12T12:22:36.590586 #24540] DEBUG -- : response-header: content-disposition => attachment; filename="WLMContacts.csv" D, [2007-11-12T12:22:36.590663 #24540] DEBUG -- : response-header: server => Microsoft-IIS/6.0 D, [2007-11-12T12:22:36.590738 #24540] DEBUG -- : response-header: content-length => 4285 D, [2007-11-12T12:22:36.591732 #24540] DEBUG -- : gunzip body I've tried some ugly hacks, as altering headers and so on (BTW, how do I change request-headers w/o inheriting from www::mechanize ?), but no result. Am I doing something wrong ? Seems to me that the server encodes the file (Firebug shows it too), but mechanize receives a weird error while trying to fetch it. Any ideas ? I did another contact scrap for gmail and it worked wonders. There is a post of mine at http://zenmachine.wordpress.com where I show how to use firebug and mechanize to find the right URLs. Best regards, and keep up the excellent work. ---- msn-scrap.rb---- #!/usr/bin/env ruby # download msn contacts require 'rubygems' require 'mechanize' require 'logger' begin agent = WWW::Mechanize.new { |a| a.log = Logger.new("mech.log") } agent.user_agent_alias = "Windows IE 6" page = agent.get("https://login.live.com/login.srf") form = page.forms.name("f1").first form.login = 'user' form.passwd = 'pass' page = agent.submit(form) pageContact = agent.get('http://g.live.com/1MBAMen-us/sc_mail') p pageContact.uri baseURL=pageContact.uri.host contactURL='http://'+baseURL+'/mail/GetContacts.aspx' p contactURL page = agent.get_file(contactURL) p page if (page.code == '200') puts "saving contacts.csv" page.save_as('contacts_msn.csv') else puts "error downloading contacts" end rescue puts "Err: "+$! puts "Trace:" $@.each {|tl| puts tl } end -- More cowbell, please ! -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/mechanize-users/attachments/20071112/68742371/attachment-0001.html From mikemondragon at gmail.com Wed Nov 14 03:10:09 2007 From: mikemondragon at gmail.com (Mike Mondragon) Date: Wed, 14 Nov 2007 00:10:09 -0800 Subject: [Mechanize-users] Scraping AOL Webmail to login and fetch contacts? In-Reply-To: <967d3b9a0710101501v6b6a1c16qf4a2d5ee82e323b3@mail.gmail.com> References: <967d3b9a0710101501v6b6a1c16qf4a2d5ee82e323b3@mail.gmail.com> Message-ID: <967d3b9a0711140010x13ca2df3i1bc7533e0efe5ae5@mail.gmail.com> On Oct 10, 2007 2:01 PM, Mike Mondragon wrote: > I'm helping with a gem that is going to published under the > contentfree project on rubyforge > (http://rubyforge.org/projects/contentfree/). > > The gem is called "blackbook" and basically it will go and fetch your > contacts from the major webmail providers. So far Gmail, Yahoo!, and > MSN have been completed. > > We are trying to finish up with fetching contacts from AOL Webmail. > However its a bit more difficult because of the javascript-like > validation AOL has built into their sign-in service. > > The only resource I've found that talks about the correct strategy to > sign-in to AOL via a scraping tool is here: > http://apsquared.net/blog/2007/04/30/scraping-aol-webmail-for-contacts/ > > However we've not been able to recreate their experience with > mechanize. Any suggestions or experience would be appreciated. > Blackbook will be released onto rubyforge once we've completed AOL > Webmail integration. > > Thanks > Mike > > -- > Mike Mondragon > Work> http://sas.quat.ch/ > Blog> http://blog.mondragon.cc/ > Small URLs> http://hurl.it/ > Dave Myron paid a bounty to Marton Fabo to find a fix so that the upcoming blackbook Gem could scrape contacts properly from AOL webmail. Marton found a fix. >From the fix I was putting together a patch to submit to Mechanize but I ran into the following failing tests. First, here's the test that I wrote in test/tc_mech.rb that shows the broken behavior for the AOL webmail login test code as pretty pastie: http://pastie.caboo.se/private/q0xbdwhhhjamskjqm4niq def test_to_absolute_uri def @agent.public_to_absolute_uri(url) to_absolute_uri(url) end url = "http://localhost/?arg=val&jank=AAA%3D" assert_equal URI.parse(url), @agent.public_to_absolute_uri(url) # pattern of odd URL created by javascript validator in AOL webmail login # where to_absolute_uri strips out the last '=' encoded as %3D url = "http://localhost/?arg=val&jank=AAA%3D%3D" assert_equal URI.parse(url), @agent.public_to_absolute_uri(url) end After I apply Marton's fix test_to_absolute_uri passes but now the "test_link_with_unusual_characters" test in test/tc_links.rb fails Here is Matron's fix applied to to_absolute_uri in lib/mechanize.rb as petty pastie: http://pastie.caboo.se/private/cb5ara4rlnh9fxe8jl1dea Any suggestions how to proceede? Which solution is more valid, the exiting to_absolute_uri method in mechanize.rb or the fix that Marton has found for to_absolute_uri? We can open up Mechanize in the Blackbook Gem to utilize Marton's solution since it solves a problem specific to Blackbook's interaction with AOL webmail via Mechanize. If Marton's fix is more valid we would to contribute it to Mechanize to help the community out. Thanks Mike From mikemondragon at gmail.com Wed Nov 14 03:18:19 2007 From: mikemondragon at gmail.com (Mike Mondragon) Date: Wed, 14 Nov 2007 00:18:19 -0800 Subject: [Mechanize-users] Weird error downloading a gzip'ed file In-Reply-To: References: Message-ID: <967d3b9a0711140018n7d32c17as55aab156115cb23d@mail.gmail.com> On Nov 12, 2007 6:33 AM, gmoraes wrote: > Hi all, > > I've been using mechanize for a while and it rocks. Docs are pretty clear > and so far I've been able to do it on my own. > However, I'm stuck in a weird situation in a script to download my contact > list from hotmail. > I've used Firebug to check all urls, and tested it by hand while logged in > via browser. > Even in the script everything works well until the last 'agent.get_file', > which gets stuck with a weird error: > > ------ snip ------ > $ ruby msn-scrap.rb > # URL:http://by124w.bay124.mail.live.com/mail/TodayLight.aspx?&n=1573603203&gs=true > > > "http://by124w.bay124.mail.live.com/mail/GetContacts.aspx" > Err: unexpected end of file > Trace: > /usr/lib/ruby/1.8/mechanize.rb:372:in `read' > /usr/lib/ruby/1.8/mechanize.rb:372:in `fetch_page' > /usr/lib/ruby/1.8/net/http.rb:1050:in `request' > /usr/lib/ruby/1.8/net/http.rb:2133:in `reading_body' > /usr/lib/ruby/1.8/net/http.rb:1049:in `request' > /usr/lib/ruby/1.8/mechanize.rb:345:in `fetch_page' > /usr/lib/ruby/1.8/net/http.rb:543:in `start' > /usr/lib/ruby/1.8/mechanize.rb:339:in `fetch_page' > /usr/lib/ruby/1.8/mechanize.rb:139:in `get' > /usr/lib/ruby/1.8/mechanize.rb:146:in `get_file' > msn-scrap.rb:32 > ----- snip ------ > > mech.log important part: > > D, [2007-11-12T12:22:35.925521 #24540] DEBUG -- : request-header: referer => > http://by124w.bay124.mail.live.com/mail/TodayLight.aspx?&n=1573603203&gs=true > D, [2007-11-12T12:22:36.589708 #24540] DEBUG -- : response-header: > cache-control => private,max-age=86400 > D, [2007-11-12T12:22: 36.589853 #24540] DEBUG -- : response-header: vary => > Accept-Encoding > D, [2007-11-12T12:22:36.589934 #24540] DEBUG -- : response-header: > connection => keep-alive > D, [2007-11-12T12:22:36.590012 #24540] DEBUG -- : response-header: expires > => Wed, 01 Jan 1997 12:00:00 GMT, Wed, 01 Jan 1997 12:00:00 GMT > D, [2007-11-12T12:22:36.590089 #24540] DEBUG -- : response-header: p3p => > CP="BUS CUR CONo FIN IVDo ONL OUR PHY SAMo TELo" > D, [2007-11-12T12:22:36.590166 #24540] DEBUG -- : response-header: date => > Mon, 12 Nov 2007 14:28:34 GMT > D, [2007-11-12T12:22:36.590241 #24540] DEBUG -- : response-header: xxn => W4 > D, [2007-11-12T12:22:36.590344 #24540] DEBUG -- : response-header: > content-type => text/csv > D, [2007-11-12T12:22:36.590430 #24540] DEBUG -- : response-header: msnserver > => H: BAY124-W4 V: 12.0.1190.927 D: 2007-09-27T23:27:08 > D, [2007-11-12T12:22:36.590509 #24540] DEBUG -- : response-header: > content-encoding => gzip > D, [2007-11-12T12:22:36.590586 #24540] DEBUG -- : response-header: > content-disposition => attachment; filename=" WLMContacts.csv" > D, [2007-11-12T12:22:36.590663 #24540] DEBUG -- : response-header: server => > Microsoft-IIS/6.0 > D, [2007-11-12T12:22:36.590738 #24540] DEBUG -- : response-header: > content-length => 4285 > D, [2007-11-12T12:22:36.591732 #24540] DEBUG -- : gunzip body > > I've tried some ugly hacks, as altering headers and so on (BTW, how do I > change request-headers w/o inheriting from www::mechanize ?), but no result. > > Am I doing something wrong ? Seems to me that the server encodes the file > (Firebug shows it too), but mechanize receives a weird error while trying to > fetch it. Any ideas ? > > I did another contact scrap for gmail and it worked wonders. There is a post > of mine at http://zenmachine.wordpress.com where I show how to use firebug > and mechanize to find the right URLs. > > > > Best regards, and keep up the excellent work. > > > ---- msn-scrap.rb---- > > #!/usr/bin/env ruby > > # download msn contacts > > require 'rubygems' > require 'mechanize' > require 'logger' > > begin > agent = WWW::Mechanize.new { |a| a.log = Logger.new("mech.log") } > agent.user_agent_alias = "Windows IE 6" > > > page = agent.get("https://login.live.com/login.srf ") > > > > form = page.forms.name("f1").first > form.login = 'user' > form.passwd = 'pass' > > page = agent.submit(form) > > pageContact = agent.get ('http://g.live.com/1MBAMen-us/sc_mail') > p pageContact.uri > > baseURL=pageContact.uri.host > > > contactURL='http://'+baseURL+'/mail/GetContacts.aspx' > p contactURL > > page = agent.get_file(contactURL) > > p page > > if (page.code == '200') > puts "saving contacts.csv" > page.save_as('contacts_msn.csv') > else > puts "error downloading contacts" > end > > > > rescue > puts "Err: "+$! > puts "Trace:" > $@.each {|tl| > puts tl > } > end > > > -- > More cowbell, please ! > _______________________________________________ > Mechanize-users mailing list > Mechanize-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mechanize-users > gmoraes Even though the "Scraping AOL Webmail to login and fetch contacts?" thread is about scraping contacts from AOL it might prove helpful to your problem with Hotmail. However, we do have Hotmail solved. The Blackbook Gem will be released shortly and it scrapes contacts from GMail, Hotmail, AOL, Yahoo! and returns them in a convenient interface that your application can utilize. I'll post a note to the Mechanize list when Blackbook is released. Thanks Mike -- Mike Mondragon Work> http://sas.quat.ch/ Blog> http://blog.mondragon.cc/ Small URLs> http://hurl.it/ From ehudros at gmail.com Wed Nov 14 14:40:35 2007 From: ehudros at gmail.com (Ehud Rosenberg) Date: Wed, 14 Nov 2007 21:40:35 +0200 Subject: [Mechanize-users] Hpricot & mechanize fail to parse page after redirect1q Message-ID: Hi everyone, My quest with mechanize/Hpricot continues :) Something extremely strange happened today - some simple working code broke down, and i can't figure out why. I am trying to access a piratebay.org search page, which does a redirect to a relative url like this: original link: http://thepiratebay.org/s/?page=0&orderby=3&q=football+manager+2008&searchTitle=on redirects to: /search/football manager 2008/0/3/0 Now, this all worked dandily up till yesterday. The page was redirected fine, and mechanize even handled the cookie that was sent back from the site. But today, i am getting this strange error: "URI::InvalidURIError: bad URI(is not URI?): /search/football manager/2008/0/3/0" Mechanize gives the following message: "NoMethodError: You have a nil object when you didn't expect it! You might have expected an instance of Array. The error occurred while evaluating nil.last" from C:/Web/ruby/lib/ruby/gems/1.8/gems/mechanize-0.6.10/lib/mechanize.rb:402:in `to_absolute_uri' I have tested this on 2 different machines, and they both break down. Can someone please give it a go and see if they can figure it out? I would be very very thankful :) Thanks, Ehud PS - I am using hpricot 0.6, and the redirected page is parsed correctly when accessed directly. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/mechanize-users/attachments/20071114/5b9cf69f/attachment.html