如何使用 ElementTree 正确解析 utf-8 xml?

How to correctly parse utf-8 xml with ElementTree?(如何使用 ElementTree 正确解析 utf-8 xml?)
本文介绍了如何使用 ElementTree 正确解析 utf-8 xml?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着跟版网的小编来一起学习吧!

问题描述

我需要帮助来了解为什么使用 xml.etree.ElementTree 解析我的 xml 文件* 会产生以下错误.

I need help to understand why parsing my xml file* with xml.etree.ElementTree produces the following errors.

*我的测试 xml 文件包含阿拉伯字符.

任务:打开并解析 utf8_file.xml 文件.

我的第一次尝试:

import xml.etree.ElementTree as etree
with codecs.open('utf8_file.xml', 'r', encoding='utf-8') as utf8_file:
    xml_tree = etree.parse(utf8_file)

结果 1:

UnicodeEncodeError: 'ascii' codec can't encode characters in position 236-238: ordinal not in range(128)

我的第二次尝试:

import xml.etree.ElementTree as etree
with codecs.open('utf8_file.xml', 'r', encoding='utf-8') as utf8_file:
    xml_string = etree.tostring(utf8_file, encoding='utf-8', method='xml')
    xml_tree  = etree.fromstring(xml_string)

结果 2:

AttributeError: 'file' object has no attribute 'getiterator'

请解释上述错误并评论可能的解决方案.

Please explain the errors above and comment on the possible solution.

推荐答案

将字节解码留给解析器;先解码:

Leave decoding the bytes to the parser; do not decode first:

import xml.etree.ElementTree as etree
with open('utf8_file.xml', 'r') as xml_file:
    xml_tree = etree.parse(xml_file)

一个 XML 文件必须在第一行包含足够的信息来处理解析器的解码.如果缺少标头,解析器必须假定使用 UTF-8.

An XML file must contain enough information in the first line to handle decoding by the parser. If the header is missing, the parser must assume UTF-8 is used.

因为保存这些信息的是 XML 标头,所以解析器负责进行所有解码.

Because it is the XML header that holds this information, it is the responsibility of the parser to do all decoding.

您的第一次尝试失败了,因为 Python 试图再次编码 Unicode 值,以便解析器可以按预期处理字节字符串.第二次尝试失败,因为 etree.tostring() 期望解析树作为第一个参数,而不是 unicode 字符串.

Your first attempt failed because Python was trying to encode the Unicode values again so that the parser could handle byte strings as it expected. The second attempt failed because etree.tostring() expects a parsed tree as first argument, not a unicode string.

这篇关于如何使用 ElementTree 正确解析 utf-8 xml?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持跟版网!

本站部分内容来源互联网,如果有图片或者内容侵犯了您的权益,请联系我们,我们会在确认后第一时间进行删除!

相关文档推荐

groupby multiple coords along a single dimension in xarray(在xarray中按单个维度的多个坐标分组)
Group by and Sum in Pandas without losing columns(Pandas中的GROUP BY AND SUM不丢失列)
Group by + New Column + Grab value former row based on conditionals(GROUP BY+新列+基于条件的前一行抓取值)
Groupby and interpolate in Pandas(PANDA中的Groupby算法和插值算法)
Pandas - Group Rows based on a column and replace NaN with non-null values(PANAS-基于列对行进行分组,并将NaN替换为非空值)
Grouping pandas DataFrame by 10 minute intervals(按10分钟间隔对 pandas 数据帧进行分组)