From andrenth at gmail.com Mon Feb 9 17:27:02 2009 From: andrenth at gmail.com (Andre Nathan) Date: Mon, 9 Feb 2009 20:27:02 -0200 Subject: [Mechanize-users] Help scraping a site Message-ID: <53e651e90902091427t4800da5bnd482f16c1af59358@mail.gmail.com> Hello I'm trying to scrape www.claroideias.com.br. My idea was to build a Shoes interface to their SMS service, with the backend done in mechanize. I managed to get to the very last stage of submitting the response to the captcha file, but I always get a session error which I don't understand. The form is validated using javascript, and these are some of its return values: C?digo: 0 - sucessPost C?digo: 5 - invalidPwd C?digo: 8 - invalidSessionPwd C?digo: 9 - invalidSessionCtn I always get the "invalidSessionPwd" error, no matter what is written in the captcha form, so I believe this is a misuse of mechanize on my part. When using firefox, if I write the wrong captcha answer, the error returned is "invalidPwd". If before submitting the form I delete my cookies, I get "invalidSessionCtn". I tried running tcpdump and comparing the data from firefox and mechanize, but couldn't find any difference that could be related to session errors. The code follows. If someone could give it a try, it would be awesome. require 'rubygems' require 'mechanize' # # Use as: # # data = { # :src_name => 'XXXXXX', # :src_code => 'XX', # :src_phone => 'XXXXXXXX', # :dst_code => 'XX', # :dst_phone => 'XXXXXXXX', # :message => "XXXX" # } # # scraper = Scraper.new(data, "out.dir") # captcha = scraper.fetch_captcha # display to the user, etc. # scraper.submit_answer('xxxx') # the answer doesn't really matter, always gets error 8 # class Scraper SMS_MAX_LEN = 136 def initialize(data, out_dir) @data = data @agent = WWW::Mechanize.new @dir = out_dir @captcha_page = nil end def fetch_captcha path, id, ref = submit_state captcha_page = submit_sms(path, id, ref) fetch_image(captcha_page) end def submit_answer(answer) raise 'Not in captcha page!' if @captcha_page.nil? form = @captcha_page.form_with(:action => 'DeliveryServlet') form.fields.first.name form[form.fields.first.name] = answer @agent.get(:url => 'http://clarotorpedoweb.claro.com.br/ClaroTorpedoWeb/scripts/clarotorpedo.jsp', :referer => 'http://clarotorpedoweb.claro.com.br/ClaroTorpedoWeb/pwdForm.jsp') p form.submit end private def submit_state url = 'http://www.claroideias.com.br' page = @agent.get(url) form = page.form_with(:name => 'frmEntrada') act = form.action states = form.field_with(:name => 'links') val = states.options.find { |opt| opt.text == 'Rio de Janeiro' }.value path, id = val.split(/\|/) form['links'] = val form['txtUrl'] = path form['txtIDLocal'] = id form.submit ref = "#{url}#{act}?txtUrl=#{path}&txtIDLocal=#{id}&gclid=&links=#{val}" return path, id, URI.escape(ref) end private def submit_sms(path, id, ref) url = URI.escape("http://www.claroideias.com.br#{path}&idlocal=#{id}") page = @agent.get(:url => url, :referer => ref) form = page.form_with(:name => 'Main') form['ddd_para'] = @data[:dst_code] form['telefone_para'] = @data[:dst_phone] form['ddd_de'] = @data[:src_code] form['telefone_de'] = @data[:src_phone] form['nome_de'] = @data[:src_name] form['msg'] = @data[:message] form['caract'] = sms_max_len return form.submit end private def fetch_image(page) image = nil page.search('img[@src]').each do |img| if img['src'] =~ /^\/ClaroTorpedoWeb\// image = img['src'] break end end raise "Couldn't find captcha" if image.nil? FileUtils.mkdir_p(@dir) file = File.join(@dir, "#{File.basename(image)}.jpg") Net::HTTP.start(page.uri.host) do |http| resp = http.get(image) File.open(file, "wb") { |f| f.write(resp.body) } end @captcha_page = page return file end def sms_max_len len = SMS_MAX_LEN - @data[:src_name].size - @data[:message].size raise 'Message too long' if len < 0 return len end end Thanks in advance, Andre From mat.schaffer at gmail.com Tue Feb 10 09:10:24 2009 From: mat.schaffer at gmail.com (Mat Schaffer) Date: Tue, 10 Feb 2009 09:10:24 -0500 Subject: [Mechanize-users] Help scraping a site In-Reply-To: <53e651e90902091427t4800da5bnd482f16c1af59358@mail.gmail.com> References: <53e651e90902091427t4800da5bnd482f16c1af59358@mail.gmail.com> Message-ID: <6A4307D0-F104-496D-9537-225DAF82AA83@gmail.com> It might help to try to find the manual case that produces the 'invalidSessionPwd' error. It sounds like maybe the session variable is getting lost. But how the session is tracked varies from app to app. -Mat On Feb 9, 2009, at 5:27 PM, Andre Nathan wrote: > Hello > > I'm trying to scrape www.claroideias.com.br. My idea was to build a > Shoes interface to their SMS service, with the backend done in > mechanize. I managed to get to the very last stage of submitting the > response to the captcha file, but I always get a session error which I > don't understand. The form is validated using javascript, and these > are some of its return values: > > C?digo: 0 - sucessPost > C?digo: 5 - invalidPwd > C?digo: 8 - invalidSessionPwd > C?digo: 9 - invalidSessionCtn > > I always get the "invalidSessionPwd" error, no matter what is written > in the captcha form, so I believe this is a misuse of mechanize on my > part. When using firefox, if I write the wrong captcha answer, the > error returned is "invalidPwd". If before submitting the form I delete > my cookies, I get "invalidSessionCtn". I tried running tcpdump and > comparing the data from firefox and mechanize, but couldn't find any > difference that could be related to session errors. > > The code follows. If someone could give it a try, it would be awesome. > > require 'rubygems' > require 'mechanize' > > # > # Use as: > # > # data = { > # :src_name => 'XXXXXX', > # :src_code => 'XX', > # :src_phone => 'XXXXXXXX', > # :dst_code => 'XX', > # :dst_phone => 'XXXXXXXX', > # :message => "XXXX" > # } > # > # scraper = Scraper.new(data, "out.dir") > # captcha = scraper.fetch_captcha # display to the user, etc. > # scraper.submit_answer('xxxx') # the answer doesn't really matter, > always gets error 8 > # > > class Scraper > SMS_MAX_LEN = 136 > > def initialize(data, out_dir) > @data = data > @agent = WWW::Mechanize.new > @dir = out_dir > @captcha_page = nil > end > > def fetch_captcha > path, id, ref = submit_state > captcha_page = submit_sms(path, id, ref) > fetch_image(captcha_page) > end > > def submit_answer(answer) > raise 'Not in captcha page!' if @captcha_page.nil? > form = @captcha_page.form_with(:action => 'DeliveryServlet') > form.fields.first.name > form[form.fields.first.name] = answer > @agent.get(:url => > 'http://clarotorpedoweb.claro.com.br/ClaroTorpedoWeb/scripts/clarotorpedo.jsp' > , > :referer => 'http://clarotorpedoweb.claro.com.br/ClaroTorpedoWeb/pwdForm.jsp') > p form.submit > end > > private > def submit_state > url = 'http://www.claroideias.com.br' > page = @agent.get(url) > form = page.form_with(:name => 'frmEntrada') > act = form.action > states = form.field_with(:name => 'links') > val = states.options.find { |opt| opt.text == 'Rio de > Janeiro' }.value > path, id = val.split(/\|/) > form['links'] = val > form['txtUrl'] = path > form['txtIDLocal'] = id > form.submit > ref = "#{url}#{act}? > txtUrl=#{path}&txtIDLocal=#{id}&gclid=&links=#{val}" > return path, id, URI.escape(ref) > end > > private > def submit_sms(path, id, ref) > url = URI.escape("http:// > www.claroideias.com.br#{path}&idlocal=#{id}") > page = @agent.get(:url => url, :referer => ref) > form = page.form_with(:name => 'Main') > form['ddd_para'] = @data[:dst_code] > form['telefone_para'] = @data[:dst_phone] > form['ddd_de'] = @data[:src_code] > form['telefone_de'] = @data[:src_phone] > form['nome_de'] = @data[:src_name] > form['msg'] = @data[:message] > form['caract'] = sms_max_len > return form.submit > end > > private > def fetch_image(page) > image = nil > page.search('img[@src]').each do |img| > if img['src'] =~ /^\/ClaroTorpedoWeb\// > image = img['src'] > break > end > end > raise "Couldn't find captcha" if image.nil? > FileUtils.mkdir_p(@dir) > file = File.join(@dir, "#{File.basename(image)}.jpg") > Net::HTTP.start(page.uri.host) do |http| > resp = http.get(image) > File.open(file, "wb") { |f| f.write(resp.body) } > end > @captcha_page = page > return file > end > > def sms_max_len > len = SMS_MAX_LEN - @data[:src_name].size - @data[:message].size > raise 'Message too long' if len < 0 > return len > end > end > > Thanks in advance, > Andre > _______________________________________________ > Mechanize-users mailing list > Mechanize-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mechanize-users From samira.khansha at gmail.com Mon Feb 23 09:36:46 2009 From: samira.khansha at gmail.com (samira khansha) Date: Mon, 23 Feb 2009 18:06:46 +0330 Subject: [Mechanize-users] Error getting url Message-ID: <599defe0902230636t27566600u3ac2d57c39ad7293@mail.gmail.com> I`m using wwwMechanize to download a list of urls,when I run this program, when it is getting an url that doesn`t exist it stops getting rest of urls , would you please tell me how can I solve this problem? use strict; use WWW::Mechanize; use File::Basename; use HTML::Parser; use Crypt::SSLeay; #use LWP::Debug qw(+); open F,"urlList.txt" or die "I can not open it:$!"; my $mech = WWW::Mechanize->new( autocheck => 1 ); my @array = ; foreach my $link(@array){ my $url = $link; $link =~ s/\///g; my $filename = $link; $mech->get( $url); $mech->save_content( $filename ); print $filename."\n"; } Best regards, Samira Khonsha. -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.patterson at gmail.com Mon Feb 23 11:29:48 2009 From: aaron.patterson at gmail.com (Aaron Patterson) Date: Mon, 23 Feb 2009 08:29:48 -0800 Subject: [Mechanize-users] Error getting url In-Reply-To: <599defe0902230636t27566600u3ac2d57c39ad7293@mail.gmail.com> References: <599defe0902230636t27566600u3ac2d57c39ad7293@mail.gmail.com> Message-ID: <6959e1680902230829mb162acame5c823eec743cdcb@mail.gmail.com> On Mon, Feb 23, 2009 at 6:36 AM, samira khansha wrote: > I`m using wwwMechanize to download a list of urls,when I run this program, > when it is getting an url that doesn`t exist > it stops getting rest of urls , would you please tell me how can I solve > this problem? > > use strict; > use WWW::Mechanize; > use File::Basename; > use HTML::Parser; > use Crypt::SSLeay; > #use LWP::Debug qw(+); > open F,"urlList.txt" or die "I can not open it:$!"; > my $mech = WWW::Mechanize->new( autocheck => 1 ); > my @array = ; > foreach my $link(@array){ > my $url = $link; > $link =~ s/\///g; > my $filename = $link; > $mech->get( $url); > $mech->save_content( $filename ); > > print $filename."\n"; > } I think you've got the wrong list. This is the Ruby mechanize list, and you want the perl mechanize list. -- Aaron Patterson http://tenderlovemaking.com/ From aaron at tenderlovemaking.com Tue Feb 24 00:53:33 2009 From: aaron at tenderlovemaking.com (Aaron Patterson) Date: Mon, 23 Feb 2009 21:53:33 -0800 Subject: [Mechanize-users] [ANN] mechanize 0.9.1 Released Message-ID: <20090224055333.GA18517@Jordan.local> mechanize version 0.9.1 has been released! * * The Mechanize library is used for automating interaction with websites. Mechanize automatically stores and sends cookies, follows redirects, can follow links, and submit forms. Form fields can be populated and submitted. Mechanize also keeps track of the sites that you have visited as a history. Changes: ### 0.9.1 2009/02/23 * New Features: * Encoding may be specified for a page: Page#encoding= * Bug Fixes: * m17n fixes. ????? konn! * Fixed a problem with base tags. ????? Keisuke * HEAD requests do not record in the history * Default encoding to ISO-8859-1 instead of ASCII * Requests with URI instances should not be polluted RF #23472 * Nonce count fixed for digest auth requests. Thanks Adrian Slapa! * Fixed a referer issue with requests using a uri. RF #23472 * WAP content types will now be parsed * Rescued poorly formatted cookies. Thanks Kelley Reynolds! * * -- Aaron Patterson http://tenderlovemaking.com/ From arya.subhransu at gmail.com Tue Feb 24 03:32:08 2009 From: arya.subhransu at gmail.com (subhransu behera) Date: Tue, 24 Feb 2009 14:02:08 +0530 Subject: [Mechanize-users] problem scrapping ATnT site Message-ID: <8f00add50902240032t1e8b76dekf39203b2ce0af520@mail.gmail.com> Hi, I am trying to download the past call details from ATnT site in csv format. It requires to select the bill period and click on a radio button. Then clicking on "Submit" link downloads the call summary for that period. I tried to do it in mechanize in the following way, but it download the src of the page in stead of downloading the actual CSV file. # get the download page page_download = agent.get " https://www.wireless.att.com/view/billPayDownloadDetail.doview?execdownloadPage=true " # get the form for bill_period and select a bill period bill_period_form = page_download.forms[2] bill_period_form.field.options[2].select # click on the csv radio button download_format_form = page_download.forms[3] download_format_form.radiobuttons[1].click # click on the submit link that downloads the csv file. download_file = agent.click download_page.search("a")[41] download_file.save_as(".csv") The problem I am facing in the above code is: + Doesn't do anything special after selecting a particular bill period from the select options. + Download the page source in stead of downloading the actual csv file. Can you suggest something? Am I missing something here? Thanks, Shubh -------------- next part -------------- An HTML attachment was scrubbed... URL: From whitethunder922 at yahoo.com Tue Feb 24 10:18:09 2009 From: whitethunder922 at yahoo.com (Matt White) Date: Tue, 24 Feb 2009 07:18:09 -0800 (PST) Subject: [Mechanize-users] problem scrapping ATnT site References: <8f00add50902240032t1e8b76dekf39203b2ce0af520@mail.gmail.com> Message-ID: <190289.59284.qm@web53309.mail.re2.yahoo.com> One thing to be aware of is that Mechanize doesn't interpret Javascript. If the page changes dynamically as you select things on the page, Mechanize will not recognize these changes. If this is the problem you are having, you will have to have the script do whatever the Javascript is doing to get everything right. Matt White ________________________________ From: subhransu behera To: mechanize-users at rubyforge.org Sent: Tuesday, February 24, 2009 1:32:08 AM Subject: [Mechanize-users] problem scrapping ATnT site Hi, I am trying to download the past call details from ATnT site in csv format. It requires to select the bill period and click on a radio button. Then clicking on "Submit" link downloads the call summary for that period. I tried to do it in mechanize in the following way, but it download the src of the page in stead of downloading the actual CSV file. # get the download page page_download = agent.get "https://www.wireless.att.com/view/billPayDownloadDetail.doview?execdownloadPage=true" # get the form for bill_period and select a bill period bill_period_form = page_download.forms[2] bill_period_form.field.options[2].select # click on the csv radio button download_format_form = page_download.forms[3] download_format_form.radiobuttons[1].click # click on the submit link that downloads the csv file. download_file = agent.click download_page.search("a")[41] download_file.save_as(".csv") The problem I am facing in the above code is: + Doesn't do anything special after selecting a particular bill period from the select options. + Download the page source in stead of downloading the actual csv file. Can you suggest something? Am I missing something here? Thanks, Shubh -------------- next part -------------- An HTML attachment was scrubbed... URL: From reid.thompson at ateb.com Tue Feb 24 10:19:22 2009 From: reid.thompson at ateb.com (Reid Thompson) Date: Tue, 24 Feb 2009 10:19:22 -0500 Subject: [Mechanize-users] need guidance on following links to download files Message-ID: <1235488762.32688.25.camel@raker> The script below is a mod of one i found via google. I'm trying to figure out what i'm missing in order to download the files associated with the links. require 'mechanize' agent = WWW::Mechanize.new pagent = WWW::Mechanize.new agent.get("http://www.daytrotter.com/songs?offset=60/") links = agent.page.search('a') hrefs = links.map { |m| m['href'] }.select { |u| u =~ /\.mp3.link$/ } # just links ending in mfile #puts hrefs #FileUtils.mkdir_p('daytrotter') # keep it neat hrefs.each { |mfile| if mfile.match(/^\/download/) then next end #puts mfile filename = "#{mfile.split('/')[-1]}" filename.gsub!('.link','') puts "Saving #{mfile} as #{filename}" agent.get(mfile).save_as(filename) } This results in output of the following format: Saving http://daytrotter.com/file_download/76/TwoGallants_DaytrotterSession_2.mp3.link as TwoGallants_DaytrotterSession_2.mp3 I can't seem to get the final result to resolve to the actual file... I'd appreciate any pointers. Thanks, reid From arya.subhransu at gmail.com Tue Feb 24 14:23:36 2009 From: arya.subhransu at gmail.com (subhransu behera) Date: Wed, 25 Feb 2009 00:53:36 +0530 Subject: [Mechanize-users] problem scrapping ATnT site In-Reply-To: <190289.59284.qm@web53309.mail.re2.yahoo.com> References: <8f00add50902240032t1e8b76dekf39203b2ce0af520@mail.gmail.com> <190289.59284.qm@web53309.mail.re2.yahoo.com> Message-ID: <8f00add50902241123r403fc219ua5f30a9110b6e615@mail.gmail.com> Hi Matt, I did exactly what you suggested. And now it works as expected. Thanks a ton buddy! Regards, Shubh On Tue, Feb 24, 2009 at 8:48 PM, Matt White wrote: > One thing to be aware of is that Mechanize doesn't interpret Javascript. If > the page changes dynamically as you select things on the page, Mechanize > will not recognize these changes. If this is the problem you are having, you > will have to have the script do whatever the Javascript is doing to get > everything right. > > Matt White > > ------------------------------ > *From:* subhransu behera > *To:* mechanize-users at rubyforge.org > *Sent:* Tuesday, February 24, 2009 1:32:08 AM > *Subject:* [Mechanize-users] problem scrapping ATnT site > > Hi, > > I am trying to download the past call details from ATnT site > in csv format. > > It requires to select the bill period and click on a radio button. > Then clicking on "Submit" link downloads the call summary for > that period. > > I tried to do it in mechanize in the following way, but it download > the src of the page in stead of downloading the actual CSV file. > > # get the download page > > page_download = agent.get " > https://www.wireless.att.com/view/billPayDownloadDetail.doview?execdownloadPage=true > " > > # get the form for bill_period and select a bill period > > bill_period_form = page_download.forms[2] > bill_period_form.field.options[2].select > > # click on the csv radio button > > download_format_form = page_download.forms[3] > download_format_form.radiobuttons[1].click > > # click on the submit link that downloads the csv file. > > download_file = agent.click download_page.search("a")[41] > download_file.save_as(".csv") > > The problem I am facing in the above code is: > > + Doesn't do anything special after selecting a particular bill period from > the select options. > + Download the page source in stead of downloading the actual csv file. > > Can you suggest something? Am I missing something here? > > Thanks, > Shubh > > > _______________________________________________ > Mechanize-users mailing list > Mechanize-users at rubyforge.org > http://rubyforge.org/mailman/listinfo/mechanize-users > -- Innovator, Pune - India Phone : (+91)-98605-59976 Blog : http://sbehera.livejournal.com/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From barjunk at attglobal.net Tue Feb 24 17:24:53 2009 From: barjunk at attglobal.net (barsalou) Date: Tue, 24 Feb 2009 13:24:53 -0900 Subject: [Mechanize-users] Mechanize, history and memory Message-ID: <20090224132453.4yf9cr4so4ckw84g@lcgalaska.com> I recently wrote a script to read a web page over and over. I ran into an issue where the script would stop for seemingly an unknown reason. Turns out "browser history" was continually growing. The answer of course is to set agent.max_history to some lower number, in my case one. Have you ever considered implementing a warning or changing the default to max_history to something that won't eat up memory? Maybe a note in GUIDE.txt? I haven't tested 0.9.1 yet, so you may have changed the default...but the docs for 0.9.1 don't seem to be very specific about that. I'll provide a patch, but wanted to see which way you'd want to go. Mike B. ---------------------------------------------------------------- This message was sent using IMP, the Internet Messaging Program.