From kweller at itcrucible.com Tue Jul 11 16:52:53 2006 From: kweller at itcrucible.com (Kevin Weller) Date: Tue, 11 Jul 2006 14:52:53 -0600 Subject: [Boulder-Denver Ruby Group] Decent HTML Parser? Message-ID: <44B40FA5.9090908@itcrucible.com> Hello everyone! I'm new to this group, and looking forward to my first meeting next week for some good professional and human connections. Hopefully soon there will be some more details about exactly when and where. Anybody have experience with a decent HTML parser for a Ruby application? I've looked around, and so far everything I've found is either unfinished, unstable, [relatively] undocumented, or just plain ugly in terms of API. I'd like a parser that can take a partial HTML file and return an easily-traversable data structure, in the same order that the elements appear in the file. I don't want or need a callback mechanism, only something I can iterate and tree-search. Though I don't hold much hope it will work, I will try using REXML on my text and see what it produces...results to be posted here. Thanks in advance! -- Kevin Weller Information Technology Crucible http://www.itcrucible.com Confidentiality Note: This is a confidential message intended solely for the recipient(s). This is neither a contract nor an order for goods or services. If you have any questions regarding this transmittal, please contact me at kweller at itcrucible.com. Thank you! From stimits at comcast.net Tue Jul 11 20:07:13 2006 From: stimits at comcast.net (D. Stimits) Date: Tue, 11 Jul 2006 18:07:13 -0600 Subject: [Boulder-Denver Ruby Group] Decent HTML Parser? In-Reply-To: <44B40FA5.9090908@itcrucible.com> References: <44B40FA5.9090908@itcrucible.com> Message-ID: <44B43D31.1010902@comcast.net> Kevin Weller wrote: >Hello everyone! I'm new to this group, and looking forward to my first >meeting next week for some good professional and human connections. >Hopefully soon there will be some more details about exactly when and where. > >Anybody have experience with a decent HTML parser for a Ruby >application? I've looked around, and so far everything I've found is >either unfinished, unstable, [relatively] undocumented, or just plain >ugly in terms of API. > >I'd like a parser that can take a partial HTML file and return an >easily-traversable data structure, in the same order that the elements >appear in the file. I don't want or need a callback mechanism, only >something I can iterate and tree-search. Though I don't hold much hope >it will work, I will try using REXML on my text and see what it >produces...results to be posted here. Thanks in advance! > > > Don't know about Ruby parsers, but in general, HTML is not XML, you can't parse it right unless you're lucky or someone writes their HTML knowing you're going to parse it as XML. On the other hand, if you're talking XHTML strict, you're in luck, it really is XML. Sadly, I don't know anything about ruby XML parsers (I've been using C/C++). If you want to parse only XHTML, you can use anything with the DOM interface, and it'll be the same interface (DOM level 1 or 2 are standards, level 3 I *think* is still in draft state). If it isn't XHTML strict, you should probably give up on a true XML interface. But I'm sort of interested in the same question, from a different angle. I'll take a risk and rephrase it and hope it's what you really mean. How many ways are there to access the DOM API of an XHTML strict document using ruby? Typically I'd expect this to be done in javascript/ECMAscript from the web browser, but on occasion there is a need to rework something like this from the server. XSLT could be useful, but a simple DOM interface should be much lower learning curve. Are you trying to use this at server side? Do you use rails? As an aside note, there is DOM interface availability for HTML, not just XHTML...but it's always flakey trying to figure out what that structure is in a thousand different web browsers that have their own idea of how to implement it (and it certainly mimics XML, but isn't really). D. Stimits, stimits AT comcast DOT net From wbruce at gmail.com Tue Jul 11 23:46:12 2006 From: wbruce at gmail.com (Bruce Williams) Date: Tue, 11 Jul 2006 21:46:12 -0600 Subject: [Boulder-Denver Ruby Group] Decent HTML Parser? In-Reply-To: <44B40FA5.9090908@itcrucible.com> References: <44B40FA5.9090908@itcrucible.com> Message-ID: <4896b9210607112046v6b1d2d1dhfab22ab40d0baff5@mail.gmail.com> On 7/11/06, Kevin Weller wrote: > Hello everyone! I'm new to this group, and looking forward to my first > meeting next week for some good professional and human connections. > Hopefully soon there will be some more details about exactly when and where. > > Anybody have experience with a decent HTML parser for a Ruby > application? I've looked around, and so far everything I've found is > either unfinished, unstable, [relatively] undocumented, or just plain > ugly in terms of API. > > I'd like a parser that can take a partial HTML file and return an > easily-traversable data structure, in the same order that the elements > appear in the file. I don't want or need a callback mechanism, only > something I can iterate and tree-search. Though I don't hold much hope > it will work, I will try using REXML on my text and see what it > produces...results to be posted here. Thanks in advance! Hi Kevin, Welcome! You may want to check out Hpricot, a recent release by whytheluckystiff (based on Tanaka Akira's HTree and John Resig's JQuery, but with the scanner recoded in C). http://code.whytheluckystiff.net/hpricot/ It may not be mature enough for your needs, but it looks pretty neat (some examples at http://code.whytheluckystiff.net/hpricot/wiki/AnHpricotShowcase). Cheers, Bruce Williams http://codefluency.com From 2006 at joshcarter.com Wed Jul 12 09:28:30 2006 From: 2006 at joshcarter.com (Josh Carter) Date: Wed, 12 Jul 2006 07:28:30 -0600 Subject: [Boulder-Denver Ruby Group] Decent HTML Parser? In-Reply-To: <44B43D31.1010902@comcast.net> References: <44B40FA5.9090908@itcrucible.com> <44B43D31.1010902@comcast.net> Message-ID: On Jul 11, 2006, at 6:07 PM, D. Stimits wrote: > But I'm sort of interested in the same question, from a different > angle. > I'll take a risk and rephrase it and hope it's what you really > mean. How > many ways are there to access the DOM API of an XHTML strict document > using ruby? There are definitely more ways than one. :) REXML, which is included in Ruby's standard library, provides 4 APIs. The tree parser is very Ruby-ish and should look familiar if you're used to using a DOM API: http://www.germane-software.com/software/rexml/docs/tutorial.html I haven't used the others myself; with an API like 'doc.elements.each ("*/section/item") { ... }' I've been spoiled rotten. (Historically I've used expat in C/C++ projects, which is SAX.) Of course, there are other XML parsing libraries entirely if REXML isn't suitable. Best regards, Josh From kweller at itcrucible.com Wed Jul 12 11:43:36 2006 From: kweller at itcrucible.com (Kevin Weller) Date: Wed, 12 Jul 2006 09:43:36 -0600 Subject: [Boulder-Denver Ruby Group] Decent HTML Parser? In-Reply-To: References: <44B40FA5.9090908@itcrucible.com> <44B43D31.1010902@comcast.net> Message-ID: <44B518A8.70907@itcrucible.com> D. Stimits wrote: > Don't know about Ruby parsers, but in general, HTML is not XML, you > can't parse it right unless you're lucky or someone writes their HTML > knowing you're going to parse it as XML. On the other hand, if you're > talking XHTML strict, you're in luck, it really is XML. Sadly, I don't > know anything about ruby XML parsers (I've been using C/C++). If you > want to parse only XHTML, you can use anything with the DOM interface, > and it'll be the same interface (DOM level 1 or 2 are standards, level 3 > I *think* is still in draft state). If it isn't XHTML strict, you should > probably give up on a true XML interface. > Thanks for the input. It's not XHTML, and I'm aware that traditional HTML is not XML-compliant, but I have in the past used XML parsers in other languages to analyze HTML documents, with some success. In this case I have managed to parse the thing with ruby-htmltools, extract a REXML document, then XPath my way to the specific content that I want. I would have preferred to select just the text I want to parse, then feed that to the parser, but htmltools won't accept anything less than perfectly balanced HTML...perhaps it won't accept anything less than a complete HTML document, but I have yet to test the null hypothesis on a partial-but-balanced subdoc. > But I'm sort of interested in the same question, from a different angle. > I'll take a risk and rephrase it and hope it's what you really mean. How > many ways are there to access the DOM API of an XHTML strict document > using ruby? Typically I'd expect this to be done in > javascript/ECMAscript from the web browser, but on occasion there is a > need to rework something like this from the server. XSLT could be > useful, but a simple DOM interface should be much lower learning curve. > Are you trying to use this at server side? Do you use rails? > There is no browser in this case...I'm working with static HTML files that have been pre-generated by an external (4GL) application. I basically need to convert specific elements within the HTML tables into CSV files that can be uploaded to another application. There is no need for an ongoing feed from one system to the other...this is a one-time operation, so a simple batch conversion is all I'm after. I suppose XSLT might do it, but the interpretation of the data in the HTML can be a bit tricky (again, it's not XML, and there is some variability in the placement of specific fields). So I suspect that just writing an interpretive algorithm in Ruby would be easiest, though admittedly I have [as of yet] little direct experience with XSLT to know how relatively easy/hard it would be. -- Kevin Weller Information Technology Crucible http://www.itcrucible.com Confidentiality Note: This is a confidential message intended solely for the recipient(s). This is neither a contract nor an order for goods or services. If you have any questions regarding this transmittal, please contact me at kweller at itcrucible.com. Thank you! From mghaught at gmail.com Thu Jul 13 21:55:34 2006 From: mghaught at gmail.com (Marty Haught) Date: Thu, 13 Jul 2006 19:55:34 -0600 Subject: [Boulder-Denver Ruby Group] Boulder-Denver Ruby Group - July 19th Message-ID: <57f29e620607131855j2f381a7eo2e07e929880239d7@mail.gmail.com> Hi Guys, Sorry about the delay in getting out the details of our July meeting. I've been really busy with many other things that have kept this email from going out earlier. We'll hold our next meeting on Wednesday, July 19th at a new location. Collective Intellect has been hosting us since February and moved out of the Mobius building to Pearl Street mall area in Boulder. They've invited us to have the meeting in their new space. We'll hold the July meeting but we're still trying to figure out the best location going forward. We greatly appreciate Collective Intellect's continued support. The directions to the new location are listed below. We'll be going with 'summer hours' for our July meeting, starting at 7pm. Chad Fowler will be presenting on Gems starting shortly after some announcements. We'll also talk briefly about RubyConf, which will be held in Denver this October. I hope to see you there next Wednesday. Cheers, Marty Directions: Collective Intellect 1414 Pearl St., Suite 200 Boulder, CO 80302 It is located on the East end of the walking mall above a store called "Baby Doll" on the South side. The Collective Intellect name is on the door of a stairway leading up to the office. URL to google maps: http://tinyurl.com/s4zng From kweller at itcrucible.com Sun Jul 30 16:21:52 2006 From: kweller at itcrucible.com (Kevin Welller) Date: Sun, 30 Jul 2006 14:21:52 -0600 Subject: [Boulder-Denver Ruby Group] Decent HTML Parser? In-Reply-To: <44B518A8.70907@itcrucible.com> References: <44B40FA5.9090908@itcrucible.com> <44B43D31.1010902@comcast.net> <44B518A8.70907@itcrucible.com> Message-ID: <44CD14E0.40001@itcrucible.com> An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/bdrg-members/attachments/20060730/f1dd2c75/attachment.html From kweller at itcrucible.com Sun Jul 30 16:25:25 2006 From: kweller at itcrucible.com (Kevin Welller) Date: Sun, 30 Jul 2006 14:25:25 -0600 Subject: [Boulder-Denver Ruby Group] Ruby Conference In-Reply-To: <44CD14E0.40001@itcrucible.com> References: <44B40FA5.9090908@itcrucible.com> <44B43D31.1010902@comcast.net> <44B518A8.70907@itcrucible.com> <44CD14E0.40001@itcrucible.com> Message-ID: <44CD15B5.8010905@itcrucible.com> How do I sign up for this upcoming Ruby conference I heard about at the last meeting? Is there a need for any more volunteers? Kevin Weller Information Technology Crucible www.itcrucible.com From mghaught at gmail.com Sun Jul 30 19:17:05 2006 From: mghaught at gmail.com (Marty Haught) Date: Sun, 30 Jul 2006 17:17:05 -0600 Subject: [Boulder-Denver Ruby Group] Ruby Conference In-Reply-To: <44CD15B5.8010905@itcrucible.com> References: <44B40FA5.9090908@itcrucible.com> <44B43D31.1010902@comcast.net> <44B518A8.70907@itcrucible.com> <44CD14E0.40001@itcrucible.com> <44CD15B5.8010905@itcrucible.com> Message-ID: <57f29e620607301617t60851e39n5347a80d767f6019@mail.gmail.com> On 7/30/06, Kevin Welller wrote: > How do I sign up for this upcoming Ruby conference I heard about at the > last meeting? Is there a need for any more volunteers? Chad will announce on the list when registration opens as it has not started yet. I suspect http://www.rubyconf.org/ will have a link to a registration page once it's available. AFAIK, they have all the volunteers they need. Cheers, Marty From chad at chadfowler.com Mon Jul 31 00:11:17 2006 From: chad at chadfowler.com (Chad Fowler) Date: Sun, 30 Jul 2006 22:11:17 -0600 Subject: [Boulder-Denver Ruby Group] Ruby Conference In-Reply-To: <57f29e620607301617t60851e39n5347a80d767f6019@mail.gmail.com> References: <44B40FA5.9090908@itcrucible.com> <44B43D31.1010902@comcast.net> <44B518A8.70907@itcrucible.com> <44CD14E0.40001@itcrucible.com> <44CD15B5.8010905@itcrucible.com> <57f29e620607301617t60851e39n5347a80d767f6019@mail.gmail.com> Message-ID: FYI: Registration will open on August 2nd. Stay alert! :) Chad On Jul 30, 2006, at 5:17 PM, Marty Haught wrote: > On 7/30/06, Kevin Welller wrote: >> How do I sign up for this upcoming Ruby conference I heard about >> at the >> last meeting? Is there a need for any more volunteers? > > Chad will announce on the list when registration opens as it has not > started yet. I suspect http://www.rubyconf.org/ will have a link to a > registration page once it's available. AFAIK, they have all the > volunteers they need. > > Cheers, > Marty > _______________________________________________ > Bdrg-members mailing list > Bdrg-members at rubyforge.org > http://rubyforge.org/mailman/listinfo/bdrg-members From nshb at inimit.com Mon Jul 31 12:05:29 2006 From: nshb at inimit.com (Nathaniel Brown) Date: Mon, 31 Jul 2006 09:05:29 -0700 Subject: [Boulder-Denver Ruby Group] Ruby Conference In-Reply-To: References: <44B40FA5.9090908@itcrucible.com> <44B43D31.1010902@comcast.net> <44B518A8.70907@itcrucible.com> <44CD14E0.40001@itcrucible.com> <44CD15B5.8010905@itcrucible.com> <57f29e620607301617t60851e39n5347a80d767f6019@mail.gmail.com> Message-ID: http://rubyconfmi.org as well :) On 7/30/06, Chad Fowler wrote: > > FYI: Registration will open on August 2nd. Stay alert! :) > > Chad > > > On Jul 30, 2006, at 5:17 PM, Marty Haught wrote: > > > On 7/30/06, Kevin Welller wrote: > >> How do I sign up for this upcoming Ruby conference I heard about > >> at the > >> last meeting? Is there a need for any more volunteers? > > > > Chad will announce on the list when registration opens as it has not > > started yet. I suspect http://www.rubyconf.org/ will have a link to a > > registration page once it's available. AFAIK, they have all the > > volunteers they need. > > > > Cheers, > > Marty > > _______________________________________________ > > Bdrg-members mailing list > > Bdrg-members at rubyforge.org > > http://rubyforge.org/mailman/listinfo/bdrg-members > > _______________________________________________ > Bdrg-members mailing list > Bdrg-members at rubyforge.org > http://rubyforge.org/mailman/listinfo/bdrg-members > -- Kind regards, Nathaniel Brown President & CEO Inimit Innovations Inc. - http://inimit.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://rubyforge.org/pipermail/bdrg-members/attachments/20060731/32344d87/attachment.html