[libxml-devel] Disabling substitution of UTF-8 chars with entities
Paul Dlug
paul at aps.org
Tue Nov 27 15:24:48 EST 2007
On Nov 27, 2007, at 3:08 PM, Dan Janowski wrote:
> The handling of encoding is not coherent in the extension, as my
> last patch on the topic illustrates. While I have no doubt that
> there are issues to resolve, in this particular instance I do not
> get the result you do.
>
> Anyone wanting to look at the way encoding is handled is welcome to
> make a recommendation.
I just did a few more experiments, it seems I only get this on Mac OS
X, it works just fine on FreeBSD and Linux (gentoo). I'll do some more
digging to see if I can identify the cause.
--Paul
> On Nov 27, 2007, at 11:41, Paul Dlug wrote:
>
>> There is a serious inconsistency when "round tripping" XML containing
>> UTF-8 characters. If you output the document to a string after
>> parsing
>> you get the UTF-8 back out, if you just grab a node and convert to a
>> string you get UTF-8 characters substituted with entities:
>>
>> utf8test.rb:
>>
>> require 'xml/libxml'
>>
>> xml = <<XML
>> <?xml version="1.0" encoding="UTF-8"?>
>> <title>This is a UTF-8 pi: π</title>
>> XML
>>
>> parser = XML::Parser.new
>> parser.string = xml
>>
>> doc = parser.parse
>>
>> puts doc.to_s
>> puts doc.root.to_s
>>
>>
>> This outputs:
>>
>> <?xml version="1.0" encoding="UTF-8"?>
>> <title>This is a UTF-8 pi: π</title>
>> <title>This is a UTF-8 pi: π</title>
>>
>>
>> I would think that the behavior of to_s by default would be to write
>> the XML out as a string just as it was parsed. Another variant should
>> be provided if character conversion is desirable.
>>
>>
>> --Paul
>> _______________________________________________
>> libxml-devel mailing list
>> libxml-devel at rubyforge.org
>> http://rubyforge.org/mailman/listinfo/libxml-devel
>
> _______________________________________________
> libxml-devel mailing list
> libxml-devel at rubyforge.org
> http://rubyforge.org/mailman/listinfo/libxml-devel
More information about the libxml-devel
mailing list