[libxml-devel] Disabling substitution of UTF-8 chars with entities

Dan Janowski danj at 3skel.com
Tue Nov 27 15:08:21 EST 2007


The handling of encoding is not coherent in the extension, as my last  
patch on the topic illustrates. While I have no doubt that there are  
issues to resolve, in this particular instance I do not get the  
result you do.

Anyone wanting to look at the way encoding is handled is welcome to  
make a recommendation.

Dan

On Nov 27, 2007, at 11:41, Paul Dlug wrote:

> There is a serious inconsistency when "round tripping" XML containing
> UTF-8 characters. If you output the document to a string after parsing
> you get the UTF-8 back out, if you just grab a node and convert to a
> string you get UTF-8 characters substituted with entities:
>
> utf8test.rb:
>
> require 'xml/libxml'
>
> xml = <<XML
> <?xml version="1.0" encoding="UTF-8"?>
> <title>This is a UTF-8 pi: ¹</title>
> XML
>
> parser = XML::Parser.new
> parser.string = xml
>
> doc = parser.parse
>
> puts doc.to_s
> puts doc.root.to_s
>
>
> This outputs:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <title>This is a UTF-8 pi: ¹</title>
> <title>This is a UTF-8 pi: &#x3C0;</title>
>
>
> I would think that the behavior of to_s by default would be to write
> the XML out as a string just as it was parsed. Another variant should
> be provided if character conversion is desirable.
>
>
> --Paul
> _______________________________________________
> libxml-devel mailing list
> libxml-devel at rubyforge.org
> http://rubyforge.org/mailman/listinfo/libxml-devel



More information about the libxml-devel mailing list