From mike at csa.net Thu Jun 12 10:31:36 2008 From: mike at csa.net (Mike Dalessio) Date: Thu, 12 Jun 2008 10:31:36 -0400 Subject: [Mechanize-users] setting request headers via get() In-Reply-To: <618c07250806120727j76b1eadem2d175bc05de4a1b5@mail.gmail.com> References: <618c07250806120727j76b1eadem2d175bc05de4a1b5@mail.gmail.com> Message-ID: <618c07250806120731m3fc0b169rc56efb2cf4bad218@mail.gmail.com> Hey all, Found a email thread from Jan 2007 discussing the inability to set request headers (like ETag and If-Modified-Since) through the API, and this is something that's bothering me a bit. Currently the "way" to do this is to subclass Mechanize and override set_headers(). That seems fine for headers that you'd like to send in every request or for classes of request, but is inconvenient for request-specific headers like ETag and If-Modified-Since. I've got a branch up on github with a (fairly invasive, though totally backwards-compatible) patch to allow headers to be passed in through the get() call: http://github.com/mdalessio/mechanize/commit/e7784de8326d2e4313dd4b2c2521a58b3ad52da3 Note that get(), fetch_page() and set_headers() all now have the ability to accept a hash of arguments as a parameter, allowing us to "stuff" extra parameters into each one while maintaining backwards compatibility (see tests). Special headers specified by :etag and :if_modified_since are recognized, for whatever that's worth. I realize this is kind of a fundamental change for such a simple feature, but it looks like there are some API changes planned for 0.8.0 anyway, so I figured I'd throw this out there and start a conversation about it. Also, (because someone will probably point it out) I'm aware of the @conditional_requests attribute, which uses If-Modified-Since if the page is in history. However, this isn't practical for the classic If-Modified-Since case: when you're scraping/downloading a large file. In this case, it's not optimal to keep the page in history (using memory) until I make my next request. I think it's probably more common to cache just the Last-Modified tag and start up a new agent for the next request. Thanks for reading, -mike -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron at tenderlovemaking.com Sat Jun 14 12:39:04 2008 From: aaron at tenderlovemaking.com (Aaron Patterson) Date: Sat, 14 Jun 2008 09:39:04 -0700 Subject: [Mechanize-users] setting request headers via get() In-Reply-To: <618c07250806120731m3fc0b169rc56efb2cf4bad218@mail.gmail.com> References: <618c07250806120727j76b1eadem2d175bc05de4a1b5@mail.gmail.com> <618c07250806120731m3fc0b169rc56efb2cf4bad218@mail.gmail.com> Message-ID: <20080614163904.GA25559@mac-mini.lan> Hey Mike, On Thu, Jun 12, 2008 at 10:31:36AM -0400, Mike Dalessio wrote: > Hey all, > > Found a email thread from Jan 2007 discussing the inability to set request > headers (like ETag and If-Modified-Since) through the API, and this is > something that's bothering me a bit. Currently the "way" to do this is to > subclass Mechanize and override set_headers(). That seems fine for headers > that you'd like to send in every request or for classes of request, but is > inconvenient for request-specific headers like ETag and If-Modified-Since. > > I've got a branch up on github with a (fairly invasive, though totally > backwards-compatible) patch to allow headers to be passed in through the > get() call: > > http://github.com/mdalessio/mechanize/commit/e7784de8326d2e4313dd4b2c2521a58b3ad52da3 > > Note that get(), fetch_page() and set_headers() all now have the ability to > accept a hash of arguments as a parameter, allowing us to "stuff" extra > parameters into each one while maintaining backwards compatibility (see > tests). > > Special headers specified by :etag and :if_modified_since are recognized, > for whatever that's worth. > > I realize this is kind of a fundamental change for such a simple feature, > but it looks like there are some API changes planned for 0.8.0 anyway, so I > figured I'd throw this out there and start a conversation about it. > > Also, (because someone will probably point it out) I'm aware of the > @conditional_requests attribute, which uses If-Modified-Since if the page is > in history. However, this isn't practical for the classic If-Modified-Since > case: when you're scraping/downloading a large file. In this case, it's not > optimal to keep the page in history (using memory) until I make my next > request. I think it's probably more common to cache just the Last-Modified > tag and start up a new agent for the next request. Sorry to get back to you so late, but I've been really busy! I like these changes, so I'm going to merge them to my master branch. I'd like to do a fairly intensive refactor for 0.8.0, but these changes look great! -- Aaron Patterson http://tenderlovemaking.com/