and which it looks like Mechanize doesn't get.
>>
>> I hope I haven't answered my own question!
>>
>> Regards
>>
>>
>>
>> _______________________________________________
>> Mechanize-users mailing list
>> Mechanize-users at rubyforge.org
>> http://rubyforge.org/mailman/listinfo/mechanize-users
>
> _______________________________________________
> Mechanize-users mailing list
> Mechanize-users at rubyforge.org
> http://rubyforge.org/mailman/listinfo/mechanize-users
>
From ross at roscommonhq.com Tue Mar 24 23:53:11 2009
From: ross at roscommonhq.com (Ross Cameron)
Date: Wed, 25 Mar 2009 14:53:11 +1100
Subject: [Mechanize-users] Capturing the result of submits
In-Reply-To: <967d3b9a0903242035t1c4313f0pe7552c264b7246ba@mail.gmail.com>
References: <49C96B79.7060103@roscommonhq.com>
Message-ID: <49C9AAA7.8080003@roscommonhq.com>
Mike
Most helpful. And a very elegant solution to the mechanize uri problem.
Regards
Ross
Mike Mondragon wrote:
> On Tue, Mar 24, 2009 at 6:17 PM, Mat Schaffer
wrote:
>
>> If the page doesn't refresh then javascript is involved. Of course, that's
>> not to say you couldn't parse the javascript response in ruby and get the
>> information you're looking for. I've done it a lot with good results. I
>> actually scripted most of the major webmail systems with mechanize a few
>> years back and AOL's webmail was the only javascript nut I couldn't crack.
>>
>
> I think a lot of people came up against the problem with scraping AOL
> webmail. They had an edgecase for URL formatting that Mechanize was
> handling a bit differently than a real web browser. Here's the duck
> punch on WWW::Mechanize::to_absolute_uri that can be used to scrape on
> AOL webmail properly.
>
> http://github.com/contentfree/blackbook/blob/ca9d90ff1be576bdbb42a1c6b81940d81840ed9d/lib/blackbook/importer/page_scraper.rb
>
> Mike
>
>
>> -Mat
>>
>> On Mar 24, 2009, at 7:23 PM, Ross Cameron wrote:
>>
>>
>>> Hi
>>>
>>> I apologize up front if this is a dumb question because I guess Ajax and
>>> thus Javascript is involved.
>>>
>>> Is there any way to capture the result of a submit if the current page is
>>> modified as result of the submit?
>>>
>>> For example, a couple of input fields, a submit and the result turns up in
>>> a modified and which it looks like Mechanize doesn't get.
>>>
>>> I hope I haven't answered my own question!
>>>
>>> Regards
>>>
>>>
>>>
>>> _______________________________________________
>>> Mechanize-users mailing list
>>> Mechanize-users at rubyforge.org
>>> http://rubyforge.org/mailman/listinfo/mechanize-users
>>>
>> _______________________________________________
>> Mechanize-users mailing list
>> Mechanize-users at rubyforge.org
>> http://rubyforge.org/mailman/listinfo/mechanize-users
>>
>>
> _______________________________________________
> Mechanize-users mailing list
> Mechanize-users at rubyforge.org
> http://rubyforge.org/mailman/listinfo/mechanize-users
>
--
------------------------------------------------------------------------
Ross Cameron | Director
Roscommon Pty Ltd | ABN 85 099 499 840
p: +61 2 9016 4133
| m: +61 4 3312 9087
| f: +61 2 9420 4525
| w: www.roscommonhq.com
| AIM: rossppc
Roscommon uses the five sentences email reply
policy. Please consider our environment before printing this email.
NOTE: This email and any attachments may be confidential. If received in
error, please delete the email. Because emails and attachments may be
interfered with, may contain computer viruses or other defects and may
not be successfully replicated on other systems, you must be cautious.
Roscommon cannot guarantee that what you receive is what we sent. If you
have any doubts about the authenticity of an email from Roscommon,
please contact us immediately.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From mat.schaffer at gmail.com Wed Mar 25 07:58:21 2009
From: mat.schaffer at gmail.com (Mat Schaffer)
Date: Wed, 25 Mar 2009 07:58:21 -0400
Subject: [Mechanize-users] Capturing the result of submits
In-Reply-To: <967d3b9a0903242035t1c4313f0pe7552c264b7246ba@mail.gmail.com>
References: <49C96B79.7060103@roscommonhq.com>
<967d3b9a0903242035t1c4313f0pe7552c264b7246ba@mail.gmail.com>
Message-ID:
On Mar 24, 2009, at 11:35 PM, Mike Mondragon wrote:
> I think a lot of people came up against the problem with scraping AOL
> webmail. They had an edgecase for URL formatting that Mechanize was
> handling a bit differently than a real web browser. Here's the duck
> punch on WWW::Mechanize::to_absolute_uri that can be used to scrape on
> AOL webmail properly.
>
> http://github.com/contentfree/blackbook/blob/ca9d90ff1be576bdbb42a1c6b81940d81840ed9d/lib/blackbook/importer/page_scraper.rb
>
> Mike
ha! Nice one, man. Sadly the project I was doing it for is long gone,
but thanks for this lovely gem. I'll sure be bookmarking this for later!
-Mat
From mat.schaffer at gmail.com Wed Mar 25 08:03:03 2009
From: mat.schaffer at gmail.com (Mat Schaffer)
Date: Wed, 25 Mar 2009 08:03:03 -0400
Subject: [Mechanize-users] Capturing the result of submits
In-Reply-To: <49C9A521.7050609@roscommonhq.com>
References: <49C96B79.7060103@roscommonhq.com>
<49C9A521.7050609@roscommonhq.com>
Message-ID:
On Mar 24, 2009, at 11:29 PM, Ross Cameron wrote:
> Hi Matt
>
> Many thanks. I sort of went and solved it in the case of a form GET
> method by scripting the full path for the form action. This wasn't
> too difficult because the action url can be discovered by
> inspection. POST is somewhat more difficult but I assume there are
> ways of finding out what is passed and setting those.
>
> But what would be nicer, if you wouldn't mind, is pointing me in the
> right direction to get at the JavaScript response - not sure how to
> do that. That would nail it.
I often use Charles in these situations (http://
www.charlesproxy.com/). There are other options too like TamperData or
Fiddler for windows, but charles feels a bit more organized/reliable
and usually the 30 minute time limit is enough to get simple jobs done.
Once you've figured out the right request, the response can be
obtained from #body in mechanize like usual.
-Mat
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From eatme444 at hotmail.com Thu Mar 26 18:30:53 2009
From: eatme444 at hotmail.com (Anthony F)
Date: Thu, 26 Mar 2009 15:30:53 -0700
Subject: [Mechanize-users] Can't get this site to open
Message-ID:
The site is: http://www.bcbid.gov.bc.ca
It's a weird, complicated piece of crap full of frames and cookies and all sorts of god-awful javascript navigation. However, before I even get into that stuff I can't even get the site to open in Mechanize. Can anyone else get this working, or is it just me?
_________________________________________________________________
Experience all of the new features, and Reconnect with your life.
http://go.microsoft.com/?linkid=9650730
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From mat.schaffer at gmail.com Thu Mar 26 23:23:15 2009
From: mat.schaffer at gmail.com (Mat Schaffer)
Date: Thu, 26 Mar 2009 23:23:15 -0400
Subject: [Mechanize-users] Can't get this site to open
In-Reply-To:
References:
Message-ID: <2D8FA06C-17DD-4424-B854-B455813268F6@gmail.com>
Loads for me, but it's also got a javascript redirect in there. You'll
have to do that yourself with something like
agent.click(page.links.first)
>> WWW::Mechanize.new.get('http://www.bcbid.gov.bc.ca').body
=> "\r\n\r\n\r\nRe-directing to BC Bid...\r\n
head>\r\n\r\nIf this page does not automatically re-direct you to BC
Bid®,
\r\nplease click here.\r\n\r\n"
On Mar 26, 2009, at 6:30 PM, Anthony F wrote:
>
> The site is: http://www.bcbid.gov.bc.ca
>
> It's a weird, complicated piece of crap full of frames and cookies
> and all sorts of god-awful javascript navigation. However, before I
> even get into that stuff I can't even get the site to open in
> Mechanize. Can anyone else get this working, or is it just me?
>
> Make your Messenger window look the way you want. Express Yourself!
> _______________________________________________
> Mechanize-users mailing list
> Mechanize-users at rubyforge.org
> http://rubyforge.org/mailman/listinfo/mechanize-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From eatme444 at hotmail.com Fri Mar 27 04:02:49 2009
From: eatme444 at hotmail.com (Anthony F)
Date: Fri, 27 Mar 2009 01:02:49 -0700
Subject: [Mechanize-users] Can't get this site to open
Message-ID:
Interesting. I tried it with mechanize 0.8.5 and it seemed to work
fine. With 0.9.2 it opens the page, but doesn't seem to parse it
properly (ie. frames => nil, link => nil, etc). What version are
you using?
Mat Schaffer wrote:
Loads for me, but it's also got a javascript redirect in there. You'll
have to do that yourself with something like
agent.click(page.links.first)
>> WWW::Mechanize.new.get('http://www.bcbid.gov.bc.ca').body
Re-directing to BC Bid...
=>
"\r\n\r\n\r\n\r\n\r\nhttp://www.bcbid.gov.bc.ca/open.dll/welcome'">\r\nIf
this page does not automatically re-direct you to BC Bid?,
\r\nplease http://www.bcbid.gov.bc.ca/open.dll/welcome">click
here.\r\n\r\n"
On Mar 26, 2009, at 6:30 PM, Anthony F wrote:
The site is: http://www.bcbid.gov.bc.ca
It's a weird, complicated piece of crap full of frames and cookies and
all sorts of god-awful javascript navigation. However, before I even
get into that stuff I can't even get the site to open in Mechanize. Can
anyone else get this working, or is it just me?
Make your Messenger window look the way you want. Express
Yourself! _______________________________________________
Mechanize-users mailing list
Mechanize-users at rubyforge.org
http://rubyforge.org/mailman/listinfo/mechanize-users
_______________________________________________
Mechanize-users mailing list
Mechanize-users at rubyforge.org
http://rubyforge.org/mailman/listinfo/mechanize-users
_________________________________________________________________
Reunite with the people closest to you, chat face to face with Messenger.
http://go.microsoft.com/?linkid=9650736
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From mat.schaffer at gmail.com Fri Mar 27 11:32:26 2009
From: mat.schaffer at gmail.com (Mat Schaffer)
Date: Fri, 27 Mar 2009 11:32:26 -0400
Subject: [Mechanize-users] Can't get this site to open
In-Reply-To:
References:
Message-ID: <4E8A088D-3F6A-4FD6-BB5D-78A9BE7CE2F7@gmail.com>
My previous was 0.9.0, but it works for me with 0.9.2 as well:
>> require 'mechanize'
=> true
>> WWW::Mechanize::VERSION
=> "0.9.2"
>> WWW::Mechanize.new.get('http://www.bcbid.gov.bc.ca').body
=> "\r\n\r\n\r\nRe-directing to BC Bid...\r\n
head>\r\n\r\nIf this page does not automatically re-direct you to BC
Bid®,
\r\nplease click here.\r\n\r\n"
I don't see any frames here. Do you maybe have a transparent web proxy
where you are? What does your response look like? You might want to
check using curl too.
-Mat
On Mar 27, 2009, at 4:02 AM, Anthony F wrote:
> Interesting. I tried it with mechanize 0.8.5 and it seemed to work
> fine. With 0.9.2 it opens the page, but doesn't seem to parse it
> properly (ie. frames => nil, link => nil, etc). What version are
> you using?
>
> Mat Schaffer wrote:
>>
>> Loads for me, but it's also got a javascript redirect in there.
>> You'll have to do that yourself with something like
>> agent.click(page.links.first)
>>
>> >> WWW::Mechanize.new.get('http://www.bcbid.gov.bc.ca').body
>> => "\r\n\r\n\r\n\r\n\r\nhttp://www.bcbid.gov.bc.ca/open.dll/
>> welcome'">\r\nIf this page does not automatically re-direct you to
>> BC Bid?,
>> \r\nplease http://www.bcbid.gov.bc.ca/open.dll/welcome">click here.
>> \r\n\r\n"
>>
>>
>> On Mar 26, 2009, at 6:30 PM, Anthony F wrote:
>>
>>>
>>> The site is: http://www.bcbid.gov.bc.ca
>>>
>>> It's a weird, complicated piece of crap full of frames and cookies
>>> and all sorts of god-awful javascript navigation. However, before
>>> I even get into that stuff I can't even get the site to open in
>>> Mechanize. Can anyone else get this working, or is it just me?
>>>
>>> Make your Messenger window look the way you want. Express
>>> Yourself! _______________________________________________
>>> Mechanize-users mailing list
>>> Mechanize-users at rubyforge.org
>>> http://rubyforge.org/mailman/listinfo/mechanize-users
>>
>>
>> _______________________________________________
>> Mechanize-users mailing list
>> Mechanize-users at rubyforge.org
>> http://rubyforge.org/mailman/listinfo/mechanize-users
>
>
> Messenger has tons of new features that make chatting more fun.
> Click here to learn more.
> _______________________________________________
> Mechanize-users mailing list
> Mechanize-users at rubyforge.org
> http://rubyforge.org/mailman/listinfo/mechanize-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From eatme444 at hotmail.com Fri Mar 27 12:01:27 2009
From: eatme444 at hotmail.com (Anthony F)
Date: Fri, 27 Mar 2009 09:01:27 -0700
Subject: [Mechanize-users] Can't get this site to open
Message-ID:
I should be more clear. Your code also works for me as is in 0.9.2.
However, if I do this:
irb(main):001:0> require 'mechanize'
=> true
irb(main):002:0> WWW::Mechanize.new.get('http://www.bcbid.gov.bc.ca')
=> #}
{meta}
{title nil}
{iframes}
{frames}
{links}
{forms}>
The problem I'm having is title => nil, links => [], etc. I
can't actually do anything with the page other than get the body. And
when I say frames is empty I mean that when I try to parse the
redirected page (http://www.bcbid.gov.bc.ca/open.dll/welcome) it comes
up empty as well even though I can do a page.body successfully.
I had it working with 0.8.5 last night, but now that doesn't work
anymore either. I'm baffled.
Mat Schaffer wrote:
My previous was 0.9.0, but it works for me with 0.9.2 as well:
>> require 'mechanize'
=> true
>> WWW::Mechanize::VERSION
=> "0.9.2"
>> WWW::Mechanize.new.get('http://www.bcbid.gov.bc.ca').body
Re-directing to BC Bid...
=>
"\r\n\r\n\r\n\r\n\r\nhttp://www.bcbid.gov.bc.ca/open.dll/welcome'">\r\nIf
this page does not automatically re-direct you to BC Bid?,
\r\nplease http://www.bcbid.gov.bc.ca/open.dll/welcome">click
here.\r\n\r\n"
I don't see any frames here. Do you maybe have a transparent web
proxy where you are? What does your response look like? You might want
to check using curl too.
-Mat
On Mar 27, 2009, at 4:02 AM, Anthony F wrote:
Interesting. I tried it with mechanize 0.8.5 and it seemed to
work fine. With 0.9.2 it opens the page, but doesn't seem to parse it
properly (ie. frames => nil, link => nil, etc). What version are
you using?
Mat Schaffer wrote:
Loads
for me, but it's also got a javascript redirect in there. You'll have
to do that yourself with something like agent.click(page.links.first)
>> WWW::Mechanize.new.get('http://www.bcbid.gov.bc.ca').body
=> "\r\n\r\n\r\n\r\n\r\nhttp://www.bcbid.gov.bc.ca/open.dll/welcome'">\r\nIf
this page does not automatically re-direct you to BC Bid?,
\r\nplease http://www.bcbid.gov.bc.ca/open.dll/welcome">click
here.\r\n\r\n"
On Mar 26, 2009, at 6:30 PM, Anthony F wrote:
The site is: http://www.bcbid.gov.bc.ca
It's a weird, complicated piece of crap full of frames and cookies and
all sorts of god-awful javascript navigation. However, before I even
get into that stuff I can't even get the site to open in Mechanize. Can
anyone else get this working, or is it just me?
Make your Messenger window look the way you want. Express
Yourself! _______________________________________________
Mechanize-users mailing list
Mechanize-users at rubyforge.org
http://rubyforge.org/mailman/listinfo/mechanize-users
_______________________________________________
Mechanize-users mailing list
Mechanize-users at rubyforge.org
http://rubyforge.org/mailman/listinfo/mechanize-users
Messenger has tons of new features that make chatting more fun. Click
here to learn more. _______________________________________________
Mechanize-users mailing list
Mechanize-users at rubyforge.org
http://rubyforge.org/mailman/listinfo/mechanize-users
_______________________________________________
Mechanize-users mailing list
Mechanize-users at rubyforge.org
http://rubyforge.org/mailman/listinfo/mechanize-users
_________________________________________________________________
Chat with the whole group, and bring everyone together.
http://go.microsoft.com/?linkid=9650735
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From eatme444 at hotmail.com Fri Mar 27 12:07:42 2009
From: eatme444 at hotmail.com (Anthony F)
Date: Fri, 27 Mar 2009 09:07:42 -0700
Subject: [Mechanize-users] Can't get this site to open
Message-ID:
Errr... I take that back. I can still get it to work in 0.8.5, but not
0.9.2. If I understand correctly the parser changed between those
versions? That's probably the issue...
Mat Schaffer wrote:
My previous was 0.9.0, but it works for me with 0.9.2 as well:
>> require 'mechanize'
=> true
>> WWW::Mechanize::VERSION
=> "0.9.2"
>> WWW::Mechanize.new.get('http://www.bcbid.gov.bc.ca').body
Re-directing to BC Bid...
=>
"\r\n\r\n\r\n\r\n\r\nhttp://www.bcbid.gov.bc.ca/open.dll/welcome'">\r\nIf
this page does not automatically re-direct you to BC Bid?,
\r\nplease http://www.bcbid.gov.bc.ca/open.dll/welcome">click
here.\r\n\r\n"
I don't see any frames here. Do you maybe have a transparent web
proxy where you are? What does your response look like? You might want
to check using curl too.
-Mat
On Mar 27, 2009, at 4:02 AM, Anthony F wrote:
Interesting. I tried it with mechanize 0.8.5 and it seemed to
work fine. With 0.9.2 it opens the page, but doesn't seem to parse it
properly (ie. frames => nil, link => nil, etc). What version are
you using?
Mat Schaffer wrote:
Loads
for me, but it's also got a javascript redirect in there. You'll have
to do that yourself with something like agent.click(page.links.first)
>> WWW::Mechanize.new.get('http://www.bcbid.gov.bc.ca').body
=> "\r\n\r\n\r\n\r\n\r\nhttp://www.bcbid.gov.bc.ca/open.dll/welcome'">\r\nIf
this page does not automatically re-direct you to BC Bid?,
\r\nplease http://www.bcbid.gov.bc.ca/open.dll/welcome">click
here.\r\n\r\n"
On Mar 26, 2009, at 6:30 PM, Anthony F wrote:
The site is: http://www.bcbid.gov.bc.ca
It's a weird, complicated piece of crap full of frames and cookies and
all sorts of god-awful javascript navigation. However, before I even
get into that stuff I can't even get the site to open in Mechanize. Can
anyone else get this working, or is it just me?
Make your Messenger window look the way you want. Express
Yourself! _______________________________________________
Mechanize-users mailing list
Mechanize-users at rubyforge.org
http://rubyforge.org/mailman/listinfo/mechanize-users
_______________________________________________
Mechanize-users mailing list
Mechanize-users at rubyforge.org
http://rubyforge.org/mailman/listinfo/mechanize-users
Messenger has tons of new features that make chatting more fun. Click
here to learn more. _______________________________________________
Mechanize-users mailing list
Mechanize-users at rubyforge.org
http://rubyforge.org/mailman/listinfo/mechanize-users
_______________________________________________
Mechanize-users mailing list
Mechanize-users at rubyforge.org
http://rubyforge.org/mailman/listinfo/mechanize-users
_________________________________________________________________
Share photos with friends on Windows Live Messenger
http://go.microsoft.com/?linkid=9650734
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From eatme444 at hotmail.com Fri Mar 27 12:29:16 2009
From: eatme444 at hotmail.com (Anthony F)
Date: Fri, 27 Mar 2009 09:29:16 -0700
Subject: [Mechanize-users] Can't get this site to open
Message-ID:
YES!!! That was it. When I switch the parser to Hpricot all is well.
Thanks for the help, Mat!
Now I'm off to scrape this god-awful website...
A F wrote:
Errr... I take that back. I can still get it to work in 0.8.5, but not
0.9.2. If I understand correctly the parser changed between those
versions? That's probably the issue...
Mat Schaffer wrote:
My previous was 0.9.0, but it works for me with 0.9.2 as well:
>> require 'mechanize'
=> true
>> WWW::Mechanize::VERSION
=> "0.9.2"
>> WWW::Mechanize.new.get('http://www.bcbid.gov.bc.ca').body
Re-directing to BC Bid...
=>
"\r\n\r\n\r\n\r\n\r\nhttp://www.bcbid.gov.bc.ca/open.dll/welcome'">\r\nIf
this page does not automatically re-direct you to BC Bid?,
\r\nplease http://www.bcbid.gov.bc.ca/open.dll/welcome">click
here.\r\n\r\n"
I don't see any frames here. Do you maybe have a transparent
web
proxy where you are? What does your response look like? You might want
to check using curl too.
-Mat
On Mar 27, 2009, at 4:02 AM, Anthony F wrote:
Interesting. I tried it with mechanize 0.8.5 and it seemed
to
work fine. With 0.9.2 it opens the page, but doesn't seem to parse it
properly (ie. frames => nil, link => nil, etc). What version are
you using?
Mat Schaffer wrote:
Loads
for me, but it's also got a javascript redirect in there. You'll have
to do that yourself with something like agent.click(page.links.first)
>> WWW::Mechanize.new.get('http://www.bcbid.gov.bc.ca').body
=> "\r\n\r\n\r\n\r\n\r\nhttp://www.bcbid.gov.bc.ca/open.dll/welcome'">\r\nIf
this page does not automatically re-direct you to BC Bid?,
\r\nplease http://www.bcbid.gov.bc.ca/open.dll/welcome">click
here.\r\n\r\n"
On Mar 26, 2009, at 6:30 PM, Anthony F wrote:
The site is: http://www.bcbid.gov.bc.ca
It's a weird, complicated piece of crap full of frames and cookies and
all sorts of god-awful javascript navigation. However, before I even
get into that stuff I can't even get the site to open in Mechanize. Can
anyone else get this working, or is it just me?
Make your Messenger window look the way you want. Express
Yourself! _______________________________________________
Mechanize-users mailing list
Mechanize-users at rubyforge.org
http://rubyforge.org/mailman/listinfo/mechanize-users
_______________________________________________
Mechanize-users mailing list
Mechanize-users at rubyforge.org
http://rubyforge.org/mailman/listinfo/mechanize-users
Messenger has tons of new features that make chatting more fun. Click
here to learn more. _______________________________________________
Mechanize-users mailing list
Mechanize-users at rubyforge.org
http://rubyforge.org/mailman/listinfo/mechanize-users
_______________________________________________
Mechanize-users mailing list
Mechanize-users at rubyforge.org
http://rubyforge.org/mailman/listinfo/mechanize-users
_________________________________________________________________
Reunite with the people closest to you, chat face to face with Messenger.
http://go.microsoft.com/?linkid=9650736
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From mat.schaffer at gmail.com Fri Mar 27 14:12:14 2009
From: mat.schaffer at gmail.com (Mat Schaffer)
Date: Fri, 27 Mar 2009 14:12:14 -0400
Subject: [Mechanize-users] Can't get this site to open
In-Reply-To:
References:
Message-ID:
Cool. I dunno if Aaron's on this or not, but it might be good to
figure out why nokogiri can't parse that page.
Here's a file captured with: File.open('response.html', 'w') { |f|
f.print WWW::Mechanize.new.get('http://www.bcbid.gov.bc.ca').body }
I may play with it myself this weekend, but if maybe Aaron will beat
me to it.
Thanks for finding the bug Anthony!
-Mat
On Mar 27, 2009, at 12:29 PM, Anthony F wrote:
> YES!!! That was it. When I switch the parser to Hpricot all is
> well. Thanks for the help, Mat!
>
> Now I'm off to scrape this god-awful website...
>
> A F wrote:
>>
>> Errr... I take that back. I can still get it to work in 0.8.5, but
>> not 0.9.2. If I understand correctly the parser changed between
>> those versions? That's probably the issue...
>>
>> Mat Schaffer wrote:
>>>
>>> My previous was 0.9.0, but it works for me with 0.9.2 as well:
>>>
>>> >> require 'mechanize'
>>> => true
>>> >> WWW::Mechanize::VERSION
>>> => "0.9.2"
>>> >> WWW::Mechanize.new.get('http://www.bcbid.gov.bc.ca').body
>>> => "\r\n\r\n\r\n\r\n\r\nhttp://www.bcbid.gov.bc.ca/open.dll/
>>> welcome'">\r\nIf this page does not automatically re-direct you to
>>> BC Bid?,
>>> \r\nplease http://www.bcbid.gov.bc.ca/open.dll/welcome">click here.
>>> \r\n\r\n"
>>>
>>> I don't see any frames here. Do you maybe have a transparent web
>>> proxy where you are? What does your response look like? You might
>>> want to check using curl too.
>>> -Mat
>>>
>>> On Mar 27, 2009, at 4:02 AM, Anthony F wrote:
>>>
>>>> Interesting. I tried it with mechanize 0.8.5 and it seemed to
>>>> work fine. With 0.9.2 it opens the page, but doesn't seem to
>>>> parse it properly (ie. frames => nil, link => nil, etc). What
>>>> version are you using?
>>>>
>>>> Mat Schaffer wrote:
>>>>>
>>>>> Loads for me, but it's also got a javascript redirect in there.
>>>>> You'll have to do that yourself with something like
>>>>> agent.click(page.links.first)
>>>>>
>>>>> >> WWW::Mechanize.new.get('http://www.bcbid.gov.bc.ca').body
>>>>> => "\r\n\r\n\r\n\r\n\r\nhttp://www.bcbid.gov.bc.ca/open.dll/welcome'
>>>>> ">\r\nIf this page does not automatically re-direct you to BC
>>>>> Bid?,
>>>>> \r\nplease http://www.bcbid.gov.bc.ca/open.dll/welcome">click
>>>>> here.\r\n\r\n"
>>>>>
>>>>>
>>>>> On Mar 26, 2009, at 6:30 PM, Anthony F wrote:
>>>>>
>>>>>>
>>>>>> The site is: http://www.bcbid.gov.bc.ca
>>>>>>
>>>>>> It's a weird, complicated piece of crap full of frames and
>>>>>> cookies and all sorts of god-awful javascript navigation.
>>>>>> However, before I even get into that stuff I can't even get the
>>>>>> site to open in Mechanize. Can anyone else get this working, or
>>>>>> is it just me?
>>>>>>
>>>>>> Make your Messenger window look the way you want. Express
>>>>>> Yourself! _______________________________________________
>>>>>> Mechanize-users mailing list
>>>>>> Mechanize-users at rubyforge.org
>>>>>> http://rubyforge.org/mailman/listinfo/mechanize-users
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Mechanize-users mailing list
>>>>> Mechanize-users at rubyforge.org
>>>>> http://rubyforge.org/mailman/listinfo/mechanize-users
>>>>
>>>>
>>>> Messenger has tons of new features that make chatting more fun.
>>>> Click here to learn more.
>>>> _______________________________________________
>>>> Mechanize-users mailing list
>>>> Mechanize-users at rubyforge.org
>>>> http://rubyforge.org/mailman/listinfo/mechanize-users
>>>
>>>
>>> _______________________________________________
>>> Mechanize-users mailing list
>>> Mechanize-users at rubyforge.org
>>> http://rubyforge.org/mailman/listinfo/mechanize-users
>>
>
>
> Messenger has tons of new features that make chatting more fun.
> Click here to learn more.
> _______________________________________________
> Mechanize-users mailing list
> Mechanize-users at rubyforge.org
> http://rubyforge.org/mailman/listinfo/mechanize-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From mat.schaffer at gmail.com Fri Mar 27 14:31:25 2009
From: mat.schaffer at gmail.com (Mat Schaffer)
Date: Fri, 27 Mar 2009 14:31:25 -0400
Subject: [Mechanize-users] Can't get this site to open
In-Reply-To:
References:
Message-ID: <7693C5EE-65FB-47F2-A7BD-28EADE3A03DB@gmail.com>
Just noticed this:
Looks like there's a UTF-8 copyright symbol or something that might be
throwing things off. Especially because the server doesn't appear to
mark it as UTF-8 in the headers.
-Mat
On Mar 27, 2009, at 2:12 PM, Mat Schaffer wrote:
> Cool. I dunno if Aaron's on this or not, but it might be good to
> figure out why nokogiri can't parse that page.
>
> Here's a file captured with: File.open('response.html', 'w') { |f|
> f.print WWW::Mechanize.new.get('http://www.bcbid.gov.bc.ca').body }
>
> I may play with it myself this weekend, but if maybe Aaron will beat
> me to it.
>
> Thanks for finding the bug Anthony!
> -Mat
>
>
>
>
>
> On Mar 27, 2009, at 12:29 PM, Anthony F wrote:
>
>> YES!!! That was it. When I switch the parser to Hpricot all is
>> well. Thanks for the help, Mat!
>>
>> Now I'm off to scrape this god-awful website...
>>
>> A F wrote:
>>>
>>> Errr... I take that back. I can still get it to work in 0.8.5,
>>> but not 0.9.2. If I understand correctly the parser changed
>>> between those versions? That's probably the issue...
>>>
>>> Mat Schaffer wrote:
>>>>
>>>> My previous was 0.9.0, but it works for me with 0.9.2 as well:
>>>>
>>>> >> require 'mechanize'
>>>> => true
>>>> >> WWW::Mechanize::VERSION
>>>> => "0.9.2"
>>>> >> WWW::Mechanize.new.get('http://www.bcbid.gov.bc.ca').body
>>>> => "\r\n\r\n\r\n\r\n\r\nhttp://www.bcbid.gov.bc.ca/open.dll/welcome'
>>>> ">\r\nIf this page does not automatically re-direct you to BC
>>>> Bid?,
>>>> \r\nplease http://www.bcbid.gov.bc.ca/open.dll/welcome">click
>>>> here.\r\n\r\n"
>>>>
>>>> I don't see any frames here. Do you maybe have a transparent web
>>>> proxy where you are? What does your response look like? You might
>>>> want to check using curl too.
>>>> -Mat
>>>>
>>>> On Mar 27, 2009, at 4:02 AM, Anthony F wrote:
>>>>
>>>>> Interesting. I tried it with mechanize 0.8.5 and it seemed to
>>>>> work fine. With 0.9.2 it opens the page, but doesn't seem to
>>>>> parse it properly (ie. frames => nil, link => nil, etc). What
>>>>> version are you using?
>>>>>
>>>>> Mat Schaffer wrote:
>>>>>>
>>>>>> Loads for me, but it's also got a javascript redirect in there.
>>>>>> You'll have to do that yourself with something like
>>>>>> agent.click(page.links.first)
>>>>>>
>>>>>> >> WWW::Mechanize.new.get('http://www.bcbid.gov.bc.ca').body
>>>>>> => "\r\n\r\n\r\n\r\n\r\nhttp://www.bcbid.gov.bc.ca/open.dll/welcome'
>>>>>> ">\r\nIf this page does not automatically re-direct you to BC
>>>>>> Bid?,
>>>>>> \r\nplease http://www.bcbid.gov.bc.ca/open.dll/welcome">click
>>>>>> here.\r\n\r\n"
>>>>>>
>>>>>>
>>>>>> On Mar 26, 2009, at 6:30 PM, Anthony F wrote:
>>>>>>
>>>>>>>
>>>>>>> The site is: http://www.bcbid.gov.bc.ca
>>>>>>>
>>>>>>> It's a weird, complicated piece of crap full of frames and
>>>>>>> cookies and all sorts of god-awful javascript navigation.
>>>>>>> However, before I even get into that stuff I can't even get
>>>>>>> the site to open in Mechanize. Can anyone else get this
>>>>>>> working, or is it just me?
>>>>>>>
>>>>>>> Make your Messenger window look the way you want. Express
>>>>>>> Yourself! _______________________________________________
>>>>>>> Mechanize-users mailing list
>>>>>>> Mechanize-users at rubyforge.org
>>>>>>> http://rubyforge.org/mailman/listinfo/mechanize-users
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Mechanize-users mailing list
>>>>>> Mechanize-users at rubyforge.org
>>>>>> http://rubyforge.org/mailman/listinfo/mechanize-users
>>>>>
>>>>>
>>>>> Messenger has tons of new features that make chatting more fun.
>>>>> Click here to learn more.
>>>>> _______________________________________________
>>>>> Mechanize-users mailing list
>>>>> Mechanize-users at rubyforge.org
>>>>> http://rubyforge.org/mailman/listinfo/mechanize-users
>>>>
>>>>
>>>> _______________________________________________
>>>> Mechanize-users mailing list
>>>> Mechanize-users at rubyforge.org
>>>> http://rubyforge.org/mailman/listinfo/mechanize-users
>>>
>>
>>
>> Messenger has tons of new features that make chatting more fun.
>> Click here to learn more.
>> _______________________________________________
>> Mechanize-users mailing list
>> Mechanize-users at rubyforge.org
>> http://rubyforge.org/mailman/listinfo/mechanize-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From mr.danielaquino at gmail.com Mon Mar 30 20:05:01 2009
From: mr.danielaquino at gmail.com (Daniel Aquino)
Date: Mon, 30 Mar 2009 19:05:01 -0500
Subject: [Mechanize-users] HTTP Headers Only
Message-ID: <66f0f93e0903301705n1e585a68t9e7635178e889b20@mail.gmail.com>
Is there a way to request only the http headers?
I have a bot that connects to sites and spits out the html title but
for binary files I like it to just read the http headers to get the
file size etc... and not read in the entire binary!
Thanks!
From mat.schaffer at gmail.com Mon Mar 30 20:58:37 2009
From: mat.schaffer at gmail.com (Mat Schaffer)
Date: Mon, 30 Mar 2009 20:58:37 -0400
Subject: [Mechanize-users] HTTP Headers Only
In-Reply-To: <66f0f93e0903301705n1e585a68t9e7635178e889b20@mail.gmail.com>
References: <66f0f93e0903301705n1e585a68t9e7635178e889b20@mail.gmail.com>
Message-ID:
Seems like WWW::Mechanize#head would work:
http://mechanize.rubyforge.org/mechanize/WWW/Mechanize.html#M000183
-Mat
On Mar 30, 2009, at 8:05 PM, Daniel Aquino wrote:
> Is there a way to request only the http headers?
>
> I have a bot that connects to sites and spits out the html title but
> for binary files I like it to just read the http headers to get the
> file size etc... and not read in the entire binary!
>
> Thanks!
> _______________________________________________
> Mechanize-users mailing list
> Mechanize-users at rubyforge.org
> http://rubyforge.org/mailman/listinfo/mechanize-users
From aaron.patterson at gmail.com Mon Mar 30 23:18:37 2009
From: aaron.patterson at gmail.com (Aaron Patterson)
Date: Mon, 30 Mar 2009 20:18:37 -0700
Subject: [Mechanize-users] HTTP Headers Only
In-Reply-To:
References: <66f0f93e0903301705n1e585a68t9e7635178e889b20@mail.gmail.com>
Message-ID: <6959e1680903302018n7c418751wd176cf524bdaca9@mail.gmail.com>
On Mon, Mar 30, 2009 at 5:58 PM, Mat Schaffer wrote:
> Seems like WWW::Mechanize#head would work:
>
> http://mechanize.rubyforge.org/mechanize/WWW/Mechanize.html#M000183
Yes. A head request sounds appropriate.
--
Aaron Patterson
http://tenderlovemaking.com/
From mr.danielaquino at gmail.com Tue Mar 31 01:17:03 2009
From: mr.danielaquino at gmail.com (Daniel Aquino)
Date: Tue, 31 Mar 2009 01:17:03 -0400
Subject: [Mechanize-users] HTTP Headers Only
In-Reply-To: <6959e1680903302018n7c418751wd176cf524bdaca9@mail.gmail.com>
References: <66f0f93e0903301705n1e585a68t9e7635178e889b20@mail.gmail.com>
<6959e1680903302018n7c418751wd176cf524bdaca9@mail.gmail.com>
Message-ID: <66f0f93e0903302217g29f2cfe8r72c245c6a0e711f1@mail.gmail.com>
Yea I really searched around for this and couldn't figure out how to do it...
Thanks so much...
Also I think I remember reading something that the http server has to
support a head request.
is this true?
On Mon, Mar 30, 2009 at 11:18 PM, Aaron Patterson
wrote:
> On Mon, Mar 30, 2009 at 5:58 PM, Mat Schaffer wrote:
>> Seems like WWW::Mechanize#head would work:
>>
>> http://mechanize.rubyforge.org/mechanize/WWW/Mechanize.html#M000183
>
> Yes. ?A head request sounds appropriate.
>
> --
> Aaron Patterson
> http://tenderlovemaking.com/
> _______________________________________________
> Mechanize-users mailing list
> Mechanize-users at rubyforge.org
> http://rubyforge.org/mailman/listinfo/mechanize-users
>
From mr.danielaquino at gmail.com Tue Mar 31 01:41:17 2009
From: mr.danielaquino at gmail.com (Daniel Aquino)
Date: Tue, 31 Mar 2009 01:41:17 -0400
Subject: [Mechanize-users] HTTP Headers Only
In-Reply-To: <66f0f93e0903302217g29f2cfe8r72c245c6a0e711f1@mail.gmail.com>
References: <66f0f93e0903301705n1e585a68t9e7635178e889b20@mail.gmail.com>
<6959e1680903302018n7c418751wd176cf524bdaca9@mail.gmail.com>
<66f0f93e0903302217g29f2cfe8r72c245c6a0e711f1@mail.gmail.com>
Message-ID: <66f0f93e0903302241m27ea26aavcc2d7c3d38615912@mail.gmail.com>
Is there anyway to limit the amount of data to read from any link?
Perhaps use a filter to detect the tag and abort connection?
Or set a timeout on how long data should be read from the link?
I'm sure a malicious person could easily still feed in a massively
large file and cause the daemon to stick around reading it all...
And the only thing I'm interested in is the
Thanks!
On Tue, Mar 31, 2009 at 1:17 AM, Daniel Aquino
wrote:
> Yea I really searched around for this and couldn't figure out how to do it...
>
> Thanks so much...
>
> Also I think I remember reading something that the http server has to
> support a head request.
>
> is this true?
>
> On Mon, Mar 30, 2009 at 11:18 PM, Aaron Patterson
> wrote:
>> On Mon, Mar 30, 2009 at 5:58 PM, Mat Schaffer wrote:
>>> Seems like WWW::Mechanize#head would work:
>>>
>>> http://mechanize.rubyforge.org/mechanize/WWW/Mechanize.html#M000183
>>
>> Yes. ?A head request sounds appropriate.
>>
>> --
>> Aaron Patterson
>> http://tenderlovemaking.com/
>> _______________________________________________
>> Mechanize-users mailing list
>> Mechanize-users at rubyforge.org
>> http://rubyforge.org/mailman/listinfo/mechanize-users
>>
>
From aaron.patterson at gmail.com Tue Mar 31 01:50:03 2009
From: aaron.patterson at gmail.com (Aaron Patterson)
Date: Mon, 30 Mar 2009 22:50:03 -0700
Subject: [Mechanize-users] HTTP Headers Only
In-Reply-To: <66f0f93e0903302217g29f2cfe8r72c245c6a0e711f1@mail.gmail.com>
References: <66f0f93e0903301705n1e585a68t9e7635178e889b20@mail.gmail.com>
<6959e1680903302018n7c418751wd176cf524bdaca9@mail.gmail.com>
<66f0f93e0903302217g29f2cfe8r72c245c6a0e711f1@mail.gmail.com>
Message-ID: <6959e1680903302250l5778b344p214ae0ff6b35a4f1@mail.gmail.com>
On Mon, Mar 30, 2009 at 10:17 PM, Daniel Aquino
wrote:
> Yea I really searched around for this and couldn't figure out how to do it...
>
> Thanks so much...
>
> Also I think I remember reading something that the http server has to
> support a head request.
>
> is this true?
Yes, but most do. I don't think I've run in to one that doesn't.
--
Aaron Patterson
http://tenderlovemaking.com/
From mr.danielaquino at gmail.com Tue Mar 31 05:47:39 2009
From: mr.danielaquino at gmail.com (Daniel Aquino)
Date: Tue, 31 Mar 2009 05:47:39 -0400
Subject: [Mechanize-users] HTTP Headers Only
In-Reply-To: <6959e1680903302250l5778b344p214ae0ff6b35a4f1@mail.gmail.com>
References: <66f0f93e0903301705n1e585a68t9e7635178e889b20@mail.gmail.com>
<6959e1680903302018n7c418751wd176cf524bdaca9@mail.gmail.com>
<66f0f93e0903302217g29f2cfe8r72c245c6a0e711f1@mail.gmail.com>
<6959e1680903302250l5778b344p214ae0ff6b35a4f1@mail.gmail.com>
Message-ID: <66f0f93e0903310247s63b50c0aod31e8763bfe05364@mail.gmail.com>
if I call agent.head and then call agent.get I end up with only head...
On Tue, Mar 31, 2009 at 1:50 AM, Aaron Patterson
wrote:
> On Mon, Mar 30, 2009 at 10:17 PM, Daniel Aquino
> wrote:
>> Yea I really searched around for this and couldn't figure out how to do it...
>>
>> Thanks so much...
>>
>> Also I think I remember reading something that the http server has to
>> support a head request.
>>
>> is this true?
>
> Yes, but most do. ?I don't think I've run in to one that doesn't.
>
> --
> Aaron Patterson
> http://tenderlovemaking.com/
> _______________________________________________
> Mechanize-users mailing list
> Mechanize-users at rubyforge.org
> http://rubyforge.org/mailman/listinfo/mechanize-users
>
From eatme444 at hotmail.com Tue Mar 31 13:42:17 2009
From: eatme444 at hotmail.com (Anthony F)
Date: Tue, 31 Mar 2009 10:42:17 -0700
Subject: [Mechanize-users] Can't get this site to open
In-Reply-To: <7693C5EE-65FB-47F2-A7BD-28EADE3A03DB@gmail.com>
References:
<7693C5EE-65FB-47F2-A7BD-28EADE3A03DB@gmail.com>
Message-ID:
Just another update...
If I switch the html parser to Nokogiri instead of Nokogiri::HTML it seems to work as well. It turns out I need Nokogiri's extra XPath goodness to deal with this rat's nest, so that's a good thing.
From: mat.schaffer at gmail.com
To: mat.schaffer at gmail.com
Date: Fri, 27 Mar 2009 14:31:25 -0400
CC: mechanize-users at rubyforge.org
Subject: Re: [Mechanize-users] Can't get this site to open
Just noticed this:
Looks like there's a UTF-8 copyright symbol or something that might be throwing things off. Especially because the server doesn't appear to mark it as UTF-8 in the headers.
-Mat
On Mar 27, 2009, at 2:12 PM, Mat Schaffer wrote:Cool. I dunno if Aaron's on this or not, but it might be good to figure out why nokogiri can't parse that page.
Here's a file captured with: File.open('response.html', 'w') { |f| f.print WWW::Mechanize.new.get('http://www.bcbid.gov.bc.ca').body }
I may play with it myself this weekend, but if maybe Aaron will beat me to it.
Thanks for finding the bug Anthony!-Mat
On Mar 27, 2009, at 12:29 PM, Anthony F wrote:YES!!! That was it. When I switch the parser to Hpricot all is well. Thanks for the help, Mat!
Now I'm off to scrape this god-awful website...
A F wrote:Errr... I take that back. I can still get it to work in 0.8.5, but not 0.9.2. If I understand correctly the parser changed between those versions? That's probably the issue...
Mat Schaffer wrote:My previous was 0.9.0, but it works for me with 0.9.2 as well:
>> require 'mechanize'=> true>> WWW::Mechanize::VERSION=> "0.9.2">> WWW::Mechanize.new.get('http://www.bcbid.gov.bc.ca').body=> "\r\n\r\n\r\n\r\n\r\nhttp://www.bcbid.gov.bc.ca/open.dll/welcome'">\r\nIf this page does not automatically re-direct you to BC Bid?,
\r\nplease http://www.bcbid.gov.bc.ca/open.dll/welcome">click here.\r\n\r\n"
I don't see any frames here. Do you maybe have a transparent web proxy where you are? What does your response look like? You might want to check using curl too.-Mat
On Mar 27, 2009, at 4:02 AM, Anthony F wrote:
Interesting. I tried it with mechanize 0.8.5 and it seemed to work fine. With 0.9.2 it opens the page, but doesn't seem to parse it properly (ie. frames => nil, link => nil, etc). What version are you using?
Mat Schaffer wrote:Loads for me, but it's also got a javascript redirect in there. You'll have to do that yourself with something like agent.click(page.links.first)
>> WWW::Mechanize.new.get('http://www.bcbid.gov.bc.ca').body=> "\r\n\r\n\r\n\r\n\r\nhttp://www.bcbid.gov.bc.ca/open.dll/welcome'">\r\nIf this page does not automatically re-direct you to BC Bid?,
\r\nplease http://www.bcbid.gov.bc.ca/open.dll/welcome">click here.\r\n\r\n"
On Mar 26, 2009, at 6:30 PM, Anthony F wrote:
The site is: http://www.bcbid.gov.bc.ca
It's a weird, complicated piece of crap full of frames and cookies and all sorts of god-awful javascript navigation. However, before I even get into that stuff I can't even get the site to open in Mechanize. Can anyone else get this working, or is it just me?
Make your Messenger window look the way you want. Express Yourself! _______________________________________________
Mechanize-users mailing list
Mechanize-users at rubyforge.org
http://rubyforge.org/mailman/listinfo/mechanize-users
_______________________________________________
Mechanize-users mailing list
Mechanize-users at rubyforge.org
http://rubyforge.org/mailman/listinfo/mechanize-users
Messenger has tons of new features that make chatting more fun. Click here to learn more. _______________________________________________
Mechanize-users mailing list
Mechanize-users at rubyforge.org
http://rubyforge.org/mailman/listinfo/mechanize-users
_______________________________________________
Mechanize-users mailing list
Mechanize-users at rubyforge.org
http://rubyforge.org/mailman/listinfo/mechanize-users
Messenger has tons of new features that make chatting more fun. Click here to learn more. _______________________________________________
Mechanize-users mailing list
Mechanize-users at rubyforge.org
http://rubyforge.org/mailman/listinfo/mechanize-users
_________________________________________________________________
Chat with the whole group, and bring everyone together.
http://go.microsoft.com/?linkid=9650735
-------------- next part --------------
An HTML attachment was scrubbed...
URL: