How to get the opening and closing tag in beautiful soup from HTML string?(如何从 HTML 字符串中获取美丽汤中的开始和结束标记?)
问题描述
我正在使用漂亮的汤编写一个 python 脚本,我必须从包含一些 HTML 代码的字符串中获取一个开始标签.
I am writing a python script using beautiful soup, where i have to get an opening tag from a string containing some HTML code.
这是我的字符串:
string = <p>...</p>
我想在名为 opening_tag 的变量中获取 <p> 并在名为 的变量中获取 .我已经搜索了文档,但似乎没有找到解决方案.谁能给我建议?</p>关闭标签
I want to get <p> in a variable called opening_tag and </p> in a variable called closing_tag. I have searched the documentation but don't seem to find the solution. Can anyone advise me with that?
推荐答案
有一种方法可以使用 BeautifulSoup 和一个简单的 reg-ex:
There is a way to do this with BeautifulSoup and a simple reg-ex:
将段落放在 BeautifulSoup 对象中,例如,soupParagraph.
Put the paragraph in a BeautifulSoup object, e.g., soupParagraph.
对于开始 (<p>) 和结束 (</p>) 标记之间的内容,将内容移动到另一个 BeautifulSoup 对象,例如,soupInnerParagraph.(通过移动内容,它们不会被删除).
For the contents between the opening (<p>) and closing (</p>) tags, move the contents to another BeautifulSoup object, e.g., soupInnerParagraph. (By moving the contents, they are not deleted).
然后,soupParagraph 将只有开始和结束标签.
Then, soupParagraph will just have the opening and closing tags.
将 soupParagraph 转换为 HTML 文本格式并将其存储在字符串变量中
Convert soupParagraph to HTML text-format and store that in a string variable
要获取开始标签,请使用正则表达式从字符串变量中删除结束标签.
To get the opening tag, use a regular expression to remove the closing tag from the string variable.
一般来说,用正则表达式解析 HTML 是有问题的,通常最好避免.但是,这里可能是合理的.
In general, parsing HTML with a regular-expression is problematic, and usually best avoided. However, it may be reasonable here.
结束标签很简单.它没有为其定义属性,并且不允许在其中添加注释.
A closing tag is simple. It does not have attributes defined for it, and a comment is not allowed within it.
我可以在结束标签上有属性吗?
元素开始标签内的HTML注释
此代码从 <body...> ... </body> 部分获取开始标记.代码已经过测试.
This code gets the opening tag from a <body...> ... </body> section. The code has been tested.
# The variable "body" is a BeautifulSoup object that contains a <body> section.
bodyInnerHtml = BeautifulSoup("", 'html.parser')
bodyContentsList = body.contents
for i in range(0, len(bodyContentsList)):
# .append moves the HTML element from body to bodyInnerHtml
bodyInnerHtml.append(bodyContentsList[0])
# Convert the <body> opening and closing tags to HTML text format
bodyTags = body.decode(formatter='html')
# Extract the opening tag, by removing the closing tag
regex = r"(s*</bodys*>s*$)"
substitution = ""
bodyOpeningTag, substitutionCount = re.subn(regex, substitution, bodyTags, 0, re.M)
if (substitutionCount != 1):
print("")
print("ERROR. The expected HTML </body> tag was not found.")
这篇关于如何从 HTML 字符串中获取美丽汤中的开始和结束标记?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!
本文标题为:如何从 HTML 字符串中获取美丽汤中的开始和结束标记?
基础教程推荐
- 修改列表中的数据帧不起作用 2022-01-01
- 求两个直方图的卷积 2022-01-01
- PANDA VALUE_COUNTS包含GROUP BY之前的所有值 2022-01-01
- 无法导入 Pytorch [WinError 126] 找不到指定的模块 2022-01-01
- 包装空间模型 2022-01-01
- Plotly:如何设置绘图图形的样式,使其不显示缺失日期的间隙? 2022-01-01
- PermissionError: pip 从 8.1.1 升级到 8.1.2 2022-01-01
- 使用大型矩阵时禁止 Pycharm 输出中的自动换行符 2022-01-01
- 在同一图形上绘制Bokeh的烛台和音量条 2022-01-01
- 在Python中从Azure BLOB存储中读取文件 2022-01-01
