From normalperson at yhbt.net Sat Sep 5 17:50:45 2009 From: normalperson at yhbt.net (Eric Wong) Date: Sat, 5 Sep 2009 14:50:45 -0700 Subject: [Mongrel-development] merging Unicorn HTTP parser back to Mongrel Message-ID: <20090905215045.GB28829@dcvr.yhbt.net> Hello, (ok, this email got longer than expected, I now consider the most important parts the first and last paragraphs of the last footnote). The Unicorn HTTP parser is feature complete as far as I can tell and supports things the Mongrel one does not. I would very much like to see it used in places that Unicorn isn't suited for[1]. In fact, a chunk of the new features are much better suited for a server with better slow client handling like Mongrel. The big roadblock to getting this back into Mongrel is the Java/JRuby version of the parser Mongrel uses. Simply put, I don't do Java; somebody else will have to port it. But I'll have to convince you that these features are worth going into Mongrel, too :) I could provide a standalone C parser that can be wrapped with FFI, but I'm not sure if the performance would be acceptable. I'm fairly certain that a pure-Ruby version with Ragel-generated code would not provide acceptable performance anywhere; maybe a hand-coded one could, but I'm not particularly excited about doing that... The MRI-C parser should just work on Win32. Unlike the rest of Unicorn, the HTTP parser remains portable to non-UNIX platforms and thread-safe. There are no system-calls made directly through it (only memory allocations through the Ruby C API). New features that aren't in Mongrel are: * HTTP/0.9 support - blame a network BOFH hell bent for hell on saving bytes with a health-checker config for this :) The HttpParser#headers? method has been added to determine if headers should be sent in the response (HTTP/0.9 didn't have response headers). * "Transfer-Encoding: chunked" request decoding support I've been told mobile devices[2] do uploads like this (since they may lack the storage capacity to store large files). This will be useful to Mongrel since Mongrel can handle slow clients better (mobile devices). I also have a use case that goes like this: tar zc $BIG_DIRECTORY | curl -T- http://unicorn/path/to/upload This designed to be slurp-resistant so clients cannot control memory usage of the server and DoS it even with huge chunk sizes. * Trailers support (with Transfer-Encoding: chunked). I haven't run across applications that use this yet (Amazon S3 maybe?) but one use case that I can forsee is generating a Content-MD5 trailer with the above "tar | curl" command. * Multiline continuation headers - Pound sends them, I don't care for Pound but I figured I might as well do it just in case somebody else starts doing it... * Absolute Request URI parsing - It was done with URI.parse originally, I figured I might as well do it in Ragel since it's part of rfc 2616. I think client-side proxies use it so maybe one day somebody can turn Mongrel or a derived server into a client-side HTTP proxy... * Repeated headers handling - they're joined with commas now since Rack doesn't accept arrays in HTTP_* entries . I posted a standlone patch for this in <20090810001022.GA17572 at dcvr.yhbt.net> * HttpParser#keepalive? method - the parser can tell you if it's safe to handle a keepalive request. Not used with Unicorn at the moment. Chunk extensions is one thing that the parser currently just ignores, this is because I've yet to see any use of them anywhere and Rack does not mention them.. Parser Limits: Request body handling: Maximum Content-Length is the maximum value of off_t. I don't think this should be a problem for anyone as Ruby defaults to _FILE_OFFSET_BITS=64 on 32-bit arches. Mongrel does not have this limit in the parser, but since it buffers large uploads to a Tempfile, the limit always existed anyways. Maximum chunk size is also the maximum value of off_t, which is usually a 64-bit long (since Ruby defaults to _FILE_OFFSET_BITS=64 on my 32-bit boxes). I don't expect valid clients to send any values close to this limit, but that's just what it is. Headers: Mostly the same as Mongrel, all headers must fit into the same <=112K string object; which shouldn't be a problem for anything capable of running Ruby. Continuation lines can bypass the per-header size limit, but everything still stays under 112K which is a pretty large limit. Trailers: These can fit into another <=112K string, space taken up during header processing doesn't affect Trailer processing, so you could end up with 224K of combined metadata. You can get a full changelog since I branched from fauna/mongrel via: git log v0.0.0.. -- ext Finally, the new API is documented via RDoc here: http://unicorn.bogomips.org/Unicorn/HttpParser.html I don't consider the API set in stone, but I do consider the header handling part a bit simpler/less error prone than the old one. Disclaimer: Due to the large amounts of changes to the C/Ragel portions, another security audit/pair-of-eyes would be nice. All use of Unicorn so far has been on LANs with trusted clients or with nginx in front. While I'm very comfortable with C and fairly comfortable with Ragel, I'm far from infallible so close review from a second pair of eyes would be greatly appreciated. Future: I'm also planning on porting this to Rubinius, too. I haven't had a chance to look at it yet but the Mongrel/C one has already been ported so it shouldn't be too hard (I only know/can stomach a small amount of C++, though I suspect I won't even need it ...) Footnotes: [1] - Comet/long-polling/reverse HTTP, and sites that rely heavily on external services (including OpenID) are all badly suited for Unicorn. [2] - As a side effect, Unicorn also uses a TeeInput class that allows the request body to be read in real-time within the Rack application (while "tee-ing" to a temporary file to provide rewindability). This also allows Mongrel Upload Progress to be implemented in the future in a Rack::Lint-compliant manner. The one weird thing about TeeInput is that: env["rack.input"].read(NR_BYTES) Is not guaranteed to return NR_BYTES, only NR_BYTES at most. So every #read can provide "last block" semantics. Rack does not enforce this behavior, so it should be fine. This should not be a problem in practice since most read() and read()-like APIs provide no such guarantee even if implied when reading from "fast" devices like the filesystem. CGI apps that get a socket as stdin also got similar semantics as what apps under Unicorn get. I imagine this feature to be hugely useful for slow mobile clients that stream data slowly as it allows the server to start processing data as it is being uploaded. -- Eric Wong From normalperson at yhbt.net Sat Sep 12 19:57:29 2009 From: normalperson at yhbt.net (Eric Wong) Date: Sat, 12 Sep 2009 16:57:29 -0700 Subject: [Mongrel-development] [RFC Mongrel2] simpler response API + updated HTTP parser Message-ID: <20090912235729.GA9370@dcvr.yhbt.net> Hi all, I've pushed out some changes based on fauna/master[1] to git://git.bogomips.org/ur-mongrel that includes a good chunk of the platform-independent stuff found in Unicorn. The new HTTP parser is named "mongrel_http" to avoid loadtime conflicts with the old one ("http11") but maintains the same class name (Mongrel::HttpParser). This one even supports HTTP/0.9, so "http11" wasn't an appropriate name for it :) Problems: I'm having some trouble with Rake+Echoe 3.2 with an "uninitialized constant Platform" error but everything seems to work by hand without Rake+Echoe. I'm also getting some test failures under 1.9.1-p243 with the semaphore/threading tests. I haven't looked too hard at this current threading model, but my gut feeling is that it's too complicated and a "dumber" model in mongrel 1.x *or* a fixed number of worker threads doing accept() is sufficient... One thing that may be cool is to support multiple threading/concurrency models since 1.8/1.9/jruby/rubinius all implement threads differently and we can also get Actors with 1.9/Rubinius. shortlog and diffstat below: Eric Wong (6): http_response: replace old API with simpler one http_response: drop old API compatibility remove HeaderOut class Add new HTTP/{0.9,1.0,1.1} parser Start using the new HTTP parser + TeeInput Remove unused Const::HTTP_STATUS_CODES hash Manifest | 10 +- ext/mongrel_http/c_util.h | 107 ++++ ext/mongrel_http/common_field_optimization.h | 111 ++++ ext/mongrel_http/ext_help.h | 48 ++ ext/mongrel_http/extconf.rb | 8 + ext/mongrel_http/global_variables.h | 91 ++++ ext/mongrel_http/mongrel_http.rl | 708 ++++++++++++++++++++++++++ ext/mongrel_http/mongrel_http_common.rl | 74 +++ lib/mongrel.rb | 64 +--- lib/mongrel/const.rb | 46 +-- lib/mongrel/header_out.rb | 34 -- lib/mongrel/http_request.rb | 147 ++---- lib/mongrel/http_response.rb | 202 ++------ lib/mongrel/tee_input.rb | 144 ++++++ test/unit/test_http_parser.rb | 425 ++++++++++++++-- test/unit/test_http_parser_ng.rb | 307 +++++++++++ test/unit/test_response.rb | 12 +- test/unit/test_server.rb | 3 + 18 files changed, 2101 insertions(+), 440 deletions(-) Full changelog: commit 4e6ab7b7d608bd074107c6a1804401d8165062d4 Author: Eric Wong Date: Sat Sep 12 16:38:22 2009 -0700 Remove unused Const::HTTP_STATUS_CODES hash It's no longer used when we generate responses, instead we just use the one found in Rack (which was originally "stolen" from us) so it's one less thing for us to maintain. commit 46ca4a1c35b92109cedd59808908e7ad1d289abb Author: Eric Wong Date: Sat Sep 12 10:40:30 2009 -0700 Start using the new HTTP parser + TeeInput The new HTTP parser minimizes the amount of Ruby support code needed and the HttpRequest class has been changed to a single class method: HttpRequest.read As a result, this hooks up the TeeInput class into the request processing cycle. TeeInput lets us read the request body off the socket while the Rack application is being called (instead of being buffered before-hand) while providing rewindable semantics that the Rack spec requires. commit c5a63522bc7e323c706609f7d99ed9f09fe9975d Author: Eric Wong Date: Fri Sep 11 13:55:20 2009 -0700 Add new HTTP/{0.9,1.0,1.1} parser This is descended from the Mongrel parser but modified to support: * chunked transfer-encoding * trailers after chunked request bodies * HTTP/0.9 * absolute URI requests * multi-line headers with continuation lines * repeated headers (joined by commas) * #keepalive? boolean method * better integration with Rack This is not yet hooked into any existing parts of Mongrel, that is the next step. commit 8c1c7bdd3c1767708f8507d5aef8ded03b6f1796 Author: Eric Wong Date: Fri Sep 11 13:16:25 2009 -0700 remove HeaderOut class HttpResponse has been rewritten to just iterate through the headers Rack gives us in a GC-friendly way so we have no need for this any longer. commit 392ea08624e39faec8d5e10ba04b21dfd9ca19a1 Author: Eric Wong Date: Fri Sep 11 12:58:42 2009 -0700 http_response: drop old API compatibility Avoid needless overhead in allocating a HttpResponse object and instead just use a class method. This is alright with Rack applications since Rack specifies the response is already a tuple for writing. Of course the headers and body of the response can both be generated iteratively with #each. commit 469a507133bd20034df485f03b6eb7b0e82080d6 Author: Eric Wong Date: Fri Sep 11 12:52:20 2009 -0700 http_response: replace old API with simpler one The old API is completely dropped, a compatibility layer for the old one will be added as Rack middleware instead. This allows newly-written applications to go through fewer layers of abstraction. git: git://git.bogomips.org/ur-mongrel cgit: http://git.bogomips.org/cgit/ur-mongrel.git [1] 9f9a9d488ed32a2891dc3dd7d50a17a16357042d -- Eric Wong