问题描述
我尝试获取打开的 xml 标记和它的关闭对应项之间的全部内容.
I try to get the whole content between an opening xml tag and it's closing counterpart.
像下面的 title 这样直接获取内容很容易,但是如果 mixed-content 被使用,我想保留内部标签?
Getting the content in straight cases like title below is easy, but how can I get the whole content between the tags if mixed-content is used and I want to preserve the inner tags?
<?xml version="1.0" encoding="UTF-8"?>
<review>
<title>Some testing stuff</title>
<text sometimes="attribute">Some text with <extradata>data</extradata> in it.
It spans <sometag>multiple lines: <tag>one</tag>, <tag>two</tag>
or more</sometag>.</text>
</review>
我想要的是两个text标签之间的内容,包括任何标签:Some text with <extradata>data</extradata>在里面.它跨越<sometag>多行:<tag>one</tag>、<tag>two</tag>或更多</sometag>.
现在我使用正则表达式,但它有点乱,我不喜欢这种方法.我倾向于基于 XML 解析器的解决方案.我查看了 minidom、etree、lxml 和 BeautifulSoup,但找不到适合这种情况的解决方案(整个内容,包括内部标签).
For now I use regular expressions but it get's kinda messy and I don't like this approach. I lean towards a XML parser based solution. I looked over minidom, etree, lxml and BeautifulSoup but couldn't find a solution for this case (whole content, including inner tags).
推荐答案
from lxml import etree
t = etree.XML(
"""<?xml version="1.0" encoding="UTF-8"?>
<review>
<title>Some testing stuff</title>
<text>Some text with <extradata>data</extradata> in it.</text>
</review>"""
)
(t.text + ''.join(map(etree.tostring, t))).strip()
这里的诀窍是 t 是可迭代的,并且在迭代时会产生所有子节点.因为etree避免了文本节点,所以还需要恢复第一个子标签之前的文本,用t.text.
The trick here is that t is iterable, and when iterated, yields all child nodes. Because etree avoids text nodes, you also need to recover the text before the first child tag, with t.text.
In [50]: (t.text + ''.join(map(etree.tostring, t))).strip()
Out[50]: '<title>Some testing stuff</title>
<text>Some text with <extradata>data</extradata> in it.</text>'
或者:
In [6]: e = t.xpath('//text')[0]
In [7]: (e.text + ''.join(map(etree.tostring, e))).strip()
Out[7]: 'Some text with <extradata>data</extradata> in it.'
这篇关于如何获取 Python 中两个 xml 标签之间的全部内容?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持跟版网!


大气响应式网络建站服务公司织梦模板
高端大气html5设计公司网站源码
织梦dede网页模板下载素材销售下载站平台(带会员中心带筛选)
财税代理公司注册代理记账网站织梦模板(带手机端)
成人高考自考在职研究生教育机构网站源码(带手机端)
高端HTML5响应式企业集团通用类网站织梦模板(自适应手机端)