[Wtr-general] HTML Pages with bad XML

John Castellucci johnc at testdev.net
Wed Jan 31 15:17:55 EST 2007


Howdy all,

I'm working on a short project where I am parsing a page that happens to
contain some nodes that cause REXML to die -- some specific examples are:

<page _extended="true" user:="user:" per="per" Views="Views" />
<gis at 5r _extended="true" />
<j _extended="true" 221,546="221,546" />

The nodes with @, : and , all throw:

c:/ruby/lib/ruby/site_ruby/1.8/rexml/parsers/treeparser.rb:90:in `parse':
#<REXML::ParseException: malformed XML: missing tag start
(REXML::ParseException)


I've hacked in a workaround (see below) that will massage the html source
before passing it to REXML, but then I have to search the Document object
for the nodes I am looking for (instead of using the spiffy
IE.elements_by_xpath)

Any tips on getting Watir to be happy with lousy XML source? 

--john



# Hack for the Watir::IE object to return an XML document that has been
scrubbed of offending node names from the html source
#
module Watir
	class IE
		def xml_source
			xmlSource = html_source(document.body, "<?xml
version=\"1.0\" encoding=\"us-ascii\"?>\n<HTML>\n", " ")
			xmlSource += "\n</HTML>\n"
			xmlSource = xmlSource.gsub(/&nbsp;/, '&#160;')
			xmlSource = xmlSource.gsub(/user:/, 'user')
			xmlSource = xmlSource.gsub(/@/, '_')
			xmlSource = xmlSource.gsub(/,/, '')
			return REXML::Document.new(xmlSource)
		end
	end
end



More information about the Wtr-general mailing list