解剖HTML的魔幻术：揭秘BeautifulSoup模块的元素操控艺术

近年来，网络爬虫在数据分析和网站开发等领域得到了广泛的应用。Python是一种流行的编程语言，具有强大的网络爬虫能力。在Python中，可以使用beautifulSoup模块从网站中获取有用的信息，这里我们介绍一些常用的方法。

安装beautifulSoup

首先，需要在本地安装beautifulSoup。打开终端或者命令提示符窗口，输入以下命令：

1	pip install beautifulsoup4

解释器对比

下表列出了Python中BeautifulSoup4模块的主要解释器以及它们的使用方法、优点和缺点：

解释器	使用方法	优点	缺点
lxml	使用 `BeautifulSoup(markup, 'lxml')` 调用	非常快速、高效	需要安装额外的C库
html.parser	使用 `BeautifulSoup(markup, 'html.parser')` 调用	默认情况下即可使用，无需额外安装	解析速度相对较慢，不够灵活
html5lib	使用 `BeautifulSoup(markup, 'html5lib')` 调用	创建一个完整的HTML文档模型	解析速度较慢，消耗内存较多，不够高效

注意：上述使用方法中的 markup 参数代表要解析的HTML文档。

请注意，BeautifulSoup4模块的默认解析器取决于你的安装情况和Python版本。在大多数情况下，lxml和html.parser是最常用的解析器。

对于选择解析器，需要根据你的具体使用需求来进行选择。如果速度是你关注的重点，那么lxml解析器是一个不错的选择。如果使用方便性和兼容性是首要考虑因素，则可以选择html.parser解析器。如果需要解析不完整或有损的HTML文档，以及构建完整的文档模型，那么html5lib解析器是一个好的选择。

对象种类

BeautifulSoup4模块中的主要对象种类包括 BeautifulSoup 对象、Tag 对象、NavigableString 对象和 Comment 对象。下面我将详细介绍每种对象并整理它们的使用方法：

BeautifulSoup 对象：
- 介绍：BeautifulSoup 对象表示整个解析的文档，并提供了许多方法和属性来遍历和搜索文档中的元素和内容。
- 使用方法：
  - 创建 BeautifulSoup 对象: 使用 BeautifulSoup(markup, parser) 来创建对象，其中 markup 是要解析的文档，parser 是解析器类型。
  - 方法和属性：可以使用 find()、find_all()、select() 等方法来查找文档中的元素，使用 prettify() 方法美化输出文档结构，使用各种属性来访问文档信息。
Tag 对象：
- 介绍：Tag 对象表示 HTML 或 XML 文档中的标签，包含了标签的名称和属性，以及标签内的子标签和文本内容。
- 使用方法：
  - 创建 Tag 对象: 通常是通过 find()、find_all() 等方法来获取，也可以使用 new_tag() 来创建新的标签对象。
  - 方法和属性：可以使用 name 属性获取标签名称，使用 ['attr'] 或 attrs 属性获取标签的属性，使用 find()、find_all() 方法在标签内查找子标签，使用 text 属性获取标签内的文字内容，使用 append()、insert() 等方法插入新的标签等。
NavigableString 对象：
- 介绍：NavigableString 对象表示解析文档中的字符串内容，是对字符串的包装，具有额外的属性和方法。
- 使用方法：
  - 创建 NavigableString 对象: 通常是通过获取标签的 string 属性来获得。
  - 方法和属性：可以使用 string 属性获取字符串内容，使用 replace_with() 方法替换字符串内容，使用 strip()、split() 等方法处理字符串。
Comment 对象：
- 介绍：Comment 对象表示HTML或XML文档中的注释，是对注释内容的包装。
- 使用方法：
  - 创建 Comment 对象: 通常是通过获取注释内容后得到的。
  - 方法和属性：可以使用 string 属性获取注释内容，也可以使用 replace_with() 方法替换注释内容。

以上是BeautifulSoup4模块中主要对象种类的详细介绍和使用方法。通过灵活运用这些对象及其方法，可以方便地处理和操作HTML/XML文档，从中获取所需的数据或进行特定的操作。

用法详解

获取网页

使用requests模块获取网页：

import requests
url = "http://www.cheneyblog.com/"
response = requests.get(url)
html = response.text

这里我们用 www.cheneyblog.com 作为一个示例网站。

解析HTML

我们可以使用beautifulSoup模块来解析HTML代码：

1 2	from bs4 import BeautifulSoup soup = BeautifulSoup(html, "html.parser")

查找标签

我们可以寻找网页中特定的标签，如<title>或者<p>：

1	title = soup.title

我们还可以获取标签的文本：

1	title_text = title.text

查找所有的标签

如果希望获取一个页面中的所有特定标签，可以使用 find_all() 方法：

all_items = soup.find_all('a')
# <a href="/archives/">
#     <div class="headline">文章</div>
#     <div class="length-num">98</div>
# </a>

我们可以像以下的方式获取其中一个标签的文本：

1
2
3

item_text = all_items[0].text
# 文章
# 98

通过类名查找标签

有时候，网站中的标签都没有ID或者没有唯一标识。这时候，可以利用class名字来查找所有的标签：

all_items = soup.find_all('ul', class_="menus_item_child")
print(all_items[0])

# <ul class="menus_item_child">
#     <li><a class="site-page child" href="/categories/HtmlCss"><i
#             class="fa-fw fa fa-area-chart"></i><span> Html/Css</span></a></li>
# </ul>

我们还可以根据其它属性进行查找，例如：

all_items = soup.find_all('ul', attrs={"class": "card-category-list", "id": "aside-cat-list"})
# <ul class="card-category-list" id="aside-cat-list">
#     <li class="card-category-list-item"><a class="card-category-list-link" href="/categories/Digitalize/"><span
#             class="card-category-list-name">Digitalize</span><span class="card-category-list-count">1</span></a>
#     </li>
#     <li class="card-category-list-item"><a class="card-category-list-link" href="/categories/Docker/"><span
#             class="card-category-list-name">Docker</span><span class="card-category-list-count">3</span></a></li>
#     <li class="card-category-list-item"><a class="card-category-list-link" href="/categories/Hexo/"><span
#             class="card-category-list-name">Hexo</span><span class="card-category-list-count">1</span></a></li>
#     <li class="card-category-list-item"><a class="card-category-list-link" href="/categories/HtmlCss/"><span
#             class="card-category-list-name">HtmlCss</span><span class="card-category-list-count">1</span></a></li>
#     <li class="card-category-list-item"><a class="card-category-list-link" href="/categories/Internet/"><span
#             class="card-category-list-name">Internet</span><span class="card-category-list-count">1</span></a></li>
#     <li class="card-category-list-item"><a class="card-category-list-link" href="/categories/Java/"><span
#             class="card-category-list-name">Java</span><span class="card-category-list-count">15</span></a></li>
#     <li class="card-category-list-item"><a class="card-category-list-link" href="/categories/Oracle/"><span
#             class="card-category-list-name">Oracle</span><span class="card-category-list-count">43</span></a></li>
#     <li class="card-category-list-item"><a class="card-category-list-link" href="/categories/Python/"><span
#             class="card-category-list-name">Python</span><span class="card-category-list-count">14</span></a></li>
# </ul>

实例演示

假设我们想从一个网页中提取出所有的新闻标题和链接。下面是使用BeautifulSoup模块实现的示例代码：

from bs4 import BeautifulSoup
import requests

header = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 "
                  "Safari/537.36 Edg/120.0.0.0"}

# 发送HTTP请求，获取网页内容
response = requests.get('https://www.baidu.com', headers=header)

# 创建BeautifulSoup对象
soup = BeautifulSoup(response.text, 'html.parser')

# 获取所有新闻标题和链接
news_list = soup.find('ul', class_='s-hotsearch-content')
#
# <ul class="s-hotsearch-content" id="hotsearch-content-wrapper">
#     <li class="hotsearch-item odd" data-index="0"><a class="title-content c-link c-font-medium c-line-clamp1"
#                                                      href="https://www.baidu.com/s?wd=%E7%94%A8%E5%A5%BD%E2%80%9C%E6%94%B9%E9%9D%A9%E5%BC%80%E6%94%BE%E2%80%9D%E8%BF%99%E5%85%B3%E9%94%AE%E4%B8%80%E6%8B%9B&amp;sa=fyb_n_homepage&amp;rsv_dl=fyb_n_homepage&amp;from=super&amp;cl=3&amp;tn=baidutop10&amp;fr=top1000&amp;rsv_idx=2&amp;hisfilter=1"
#                                                      target="_blank">
#         <div class="title-content-noindex" style="display: none;"></div>
#         <i class="c-icon title-content-top-icon c-color-red c-gap-right-small" style="display: ;"></i><span
#             class="title-content-index c-index-single c-index-single-hot0" style="display: none;">0</span><span
#             class="title-content-title">用好“改革开放”这关键一招</span></a><span
#             class="title-content-mark ie-vertical c-text c-gap-left-small"></span></li>
#     <li class="hotsearch-item even" data-index="3"><a class="title-content c-link c-font-medium c-line-clamp1"
#                                                       href="https://www.baidu.com/s?wd=%E7%BE%8E%E4%B8%BD%E4%B9%A1%E6%9D%91+%E5%B9%B8%E7%A6%8F%E7%94%9F%E6%B4%BB&amp;sa=fyb_n_homepage&amp;rsv_dl=fyb_n_homepage&amp;from=super&amp;cl=3&amp;tn=baidutop10&amp;fr=top1000&amp;rsv_idx=2&amp;hisfilter=1"
#                                                       target="_blank">
#         <div class="title-content-noindex" style="display: none;"></div>
#         <i class="c-icon title-content-top-icon c-color-red c-gap-right-small" style="display: none;"></i><span
#             class="title-content-index c-index-single c-index-single-hot3" style="display: ;">3</span><span
#             class="title-content-title">美丽乡村 幸福生活</span></a><span
#             class="title-content-mark ie-vertical c-text c-gap-left-small"></span></li>
#     <li class="hotsearch-item odd" data-index="1"><a class="title-content c-link c-font-medium c-line-clamp1"
#                                                      href="https://www.baidu.com/s?wd=%E4%B8%BB%E6%8C%81%E4%BA%BA%E9%9F%B3%E4%B9%90%E4%BC%9A%E6%B1%82%E5%A9%9A%E8%A7%82%E4%BC%97%E9%BD%90%E5%96%8A%E9%80%80%E7%A5%A8&amp;sa=fyb_n_homepage&amp;rsv_dl=fyb_n_homepage&amp;from=super&amp;cl=3&amp;tn=baidutop10&amp;fr=top1000&amp;rsv_idx=2&amp;hisfilter=1"
#                                                      target="_blank">
#         <div class="title-content-noindex" style="display: none;"></div>
#         <i class="c-icon title-content-top-icon c-color-red c-gap-right-small" style="display: none;"></i><span
#             class="title-content-index c-index-single c-index-single-hot1" style="display: ;">1</span><span
#             class="title-content-title">主持人音乐会求婚观众齐喊退票</span></a><span
#             class="title-content-mark ie-vertical c-text c-gap-left-small"></span></li>
#     <li class="hotsearch-item even" data-index="4"><a class="title-content c-link c-font-medium c-line-clamp1"
#                                                       href="https://www.baidu.com/s?wd=%E5%A5%B3%E5%AD%A9%E7%94%A8%E7%A7%91%E7%9B%AE%E4%B8%89%E8%B7%B3%E7%BB%B3+%E8%8E%B7%E7%9C%81%E7%BA%A7%E6%AF%94%E8%B5%9B%E7%AC%AC1%E5%90%8D&amp;sa=fyb_n_homepage&amp;rsv_dl=fyb_n_homepage&amp;from=super&amp;cl=3&amp;tn=baidutop10&amp;fr=top1000&amp;rsv_idx=2&amp;hisfilter=1"
#                                                       target="_blank">
#         <div class="title-content-noindex" style="display: none;"></div>
#         <i class="c-icon title-content-top-icon c-color-red c-gap-right-small" style="display: none;"></i><span
#             class="title-content-index c-index-single c-index-single-hot4" style="display: ;">4</span><span
#             class="title-content-title">女孩用科目三跳绳 获省级比赛第1名</span></a><span
#             class="title-content-mark ie-vertical c-text c-gap-left-small"></span></li>
#     <li class="hotsearch-item odd" data-index="2"><a class="title-content c-link c-font-medium c-line-clamp1"
#                                                      href="https://www.baidu.com/s?wd=%E7%94%B7%E5%AD%90%E5%86%AC%E9%92%93%E5%A4%B1%E8%81%94+%E9%81%97%E4%BD%93%E5%9C%A8%E5%86%B0%E7%BC%9D%E4%B8%AD%E8%A2%AB%E5%8F%91%E7%8E%B0&amp;sa=fyb_n_homepage&amp;rsv_dl=fyb_n_homepage&amp;from=super&amp;cl=3&amp;tn=baidutop10&amp;fr=top1000&amp;rsv_idx=2&amp;hisfilter=1"
#                                                      target="_blank">
#         <div class="title-content-noindex" style="display: none;"></div>
#         <i class="c-icon title-content-top-icon c-color-red c-gap-right-small" style="display: none;"></i><span
#             class="title-content-index c-index-single c-index-single-hot2" style="display: ;">2</span><span
#             class="title-content-title">男子冬钓失联 遗体在冰缝中被发现</span></a><span
#             class="title-content-mark ie-vertical c-text c-gap-left-small"></span></li>
#     <li class="hotsearch-item even" data-index="5"><a class="title-content tag-width c-link c-font-medium c-line-clamp1"
#                                                       href="https://www.baidu.com/s?wd=%E6%A0%BC%E5%8A%9B&amp;sa=fyb_n_homepage&amp;rsv_dl=fyb_n_homepage&amp;from=super&amp;cl=3&amp;tn=baidutop10&amp;fr=top1000&amp;rsv_idx=2&amp;hisfilter=1"
#                                                       target="_blank">
#         <div class="title-content-noindex" style="display: ;"></div>
#         <i class="c-icon title-content-top-icon c-color-red c-gap-right-small" style="display: none;"></i><span
#             class="title-content-index c-index-single c-index-single-hot-100" style="display: none;">-100</span><span
#             class="title-content-title">格力电器年终好物推荐</span></a><span
#             class="title-content-mark ie-vertical c-text c-gap-left-small c-text-business">商</span></li>
# </ul>


news = []
# 打印新闻标题和链接
for li in news_list.find_all('li'):
    title = li.find('span', class_='title-content-title').get_text()
    url = li.find('a').get('href')
    new_dict = {"title": title, "url": url}
    news.append(new_dict)

for new in news:
    print(new)
# {'title': '用好“改革开放”这关键一招', 'url': 'https://www.baidu.com/s?wd=%E7%94%A8%E5%A5%BD%E2%80%9C%E6%94%B9%E9%9D%A9%E5%BC%80%E6%94%BE%E2%80%9D%E8%BF%99%E5%85%B3%E9%94%AE%E4%B8%80%E6%8B%9B&sa=fyb_n_homepage&rsv_dl=fyb_n_homepage&from=super&cl=3&tn=baidutop10&fr=top1000&rsv_idx=2&hisfilter=1'}
# {'title': '美丽乡村 幸福生活', 'url': 'https://www.baidu.com/s?wd=%E7%BE%8E%E4%B8%BD%E4%B9%A1%E6%9D%91+%E5%B9%B8%E7%A6%8F%E7%94%9F%E6%B4%BB&sa=fyb_n_homepage&rsv_dl=fyb_n_homepage&from=super&cl=3&tn=baidutop10&fr=top1000&rsv_idx=2&hisfilter=1'}
# {'title': '男子冬钓失联 遗体在冰缝中被发现', 'url': 'https://www.baidu.com/s?wd=%E7%94%B7%E5%AD%90%E5%86%AC%E9%92%93%E5%A4%B1%E8%81%94+%E9%81%97%E4%BD%93%E5%9C%A8%E5%86%B0%E7%BC%9D%E4%B8%AD%E8%A2%AB%E5%8F%91%E7%8E%B0&sa=fyb_n_homepage&rsv_dl=fyb_n_homepage&from=super&cl=3&tn=baidutop10&fr=top1000&rsv_idx=2&hisfilter=1'}
# {'title': '李家超：基本法第23条明年内实施', 'url': 'https://www.baidu.com/s?wd=%E6%9D%8E%E5%AE%B6%E8%B6%85%EF%BC%9A%E5%9F%BA%E6%9C%AC%E6%B3%95%E7%AC%AC23%E6%9D%A1%E6%98%8E%E5%B9%B4%E5%86%85%E5%AE%9E%E6%96%BD&sa=fyb_n_homepage&rsv_dl=fyb_n_homepage&from=super&cl=3&tn=baidutop10&fr=top1000&rsv_idx=2&hisfilter=1'}
# {'title': '女孩用科目三跳绳 获省级比赛第1名', 'url': 'https://www.baidu.com/s?wd=%E5%A5%B3%E5%AD%A9%E7%94%A8%E7%A7%91%E7%9B%AE%E4%B8%89%E8%B7%B3%E7%BB%B3+%E8%8E%B7%E7%9C%81%E7%BA%A7%E6%AF%94%E8%B5%9B%E7%AC%AC1%E5%90%8D&sa=fyb_n_homepage&rsv_dl=fyb_n_homepage&from=super&cl=3&tn=baidutop10&fr=top1000&rsv_idx=2&hisfilter=1'}
# {'title': '山东两幼师出租房内遇害', 'url': 'https://www.baidu.com/s?wd=%E5%B1%B1%E4%B8%9C%E4%B8%A4%E5%B9%BC%E5%B8%88%E5%87%BA%E7%A7%9F%E6%88%BF%E5%86%85%E9%81%87%E5%AE%B3&sa=fyb_n_homepage&rsv_dl=fyb_n_homepage&from=super&cl=3&tn=baidutop10&fr=top1000&rsv_idx=2&hisfilter=1'}