[libxml-devel] Disabling substitution of UTF-8 chars with entities
Dan Janowski
danj at 3skel.com
Tue Nov 27 15:08:21 EST 2007
The handling of encoding is not coherent in the extension, as my last
patch on the topic illustrates. While I have no doubt that there are
issues to resolve, in this particular instance I do not get the
result you do.
Anyone wanting to look at the way encoding is handled is welcome to
make a recommendation.
Dan
On Nov 27, 2007, at 11:41, Paul Dlug wrote:
> There is a serious inconsistency when "round tripping" XML containing
> UTF-8 characters. If you output the document to a string after parsing
> you get the UTF-8 back out, if you just grab a node and convert to a
> string you get UTF-8 characters substituted with entities:
>
> utf8test.rb:
>
> require 'xml/libxml'
>
> xml = <<XML
> <?xml version="1.0" encoding="UTF-8"?>
> <title>This is a UTF-8 pi: ¹</title>
> XML
>
> parser = XML::Parser.new
> parser.string = xml
>
> doc = parser.parse
>
> puts doc.to_s
> puts doc.root.to_s
>
>
> This outputs:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <title>This is a UTF-8 pi: ¹</title>
> <title>This is a UTF-8 pi: π</title>
>
>
> I would think that the behavior of to_s by default would be to write
> the XML out as a string just as it was parsed. Another variant should
> be provided if character conversion is desirable.
>
>
> --Paul
> _______________________________________________
> libxml-devel mailing list
> libxml-devel at rubyforge.org
> http://rubyforge.org/mailman/listinfo/libxml-devel
More information about the libxml-devel
mailing list