跟著知識追尋者學BeautifulSoup,你學不會打不還口,罵不還手
一 前言
Beautiful Soup 是一個可以從HTML或XML檔案中提取資料的Python庫;其強大的提取能力讓知識追尋者放棄了使用正則匹配查詢HTML節點;Beautifu Soup 其能直接通過HTML標籤獲取相應的節點,或者通過函式直接獲得節點,大大提高了程式設計人員的開發效率;看完本篇學不會Beautiful Soup ,滿天神佛都救不了你;覺得知識追尋者的文章有點意思,關注加點贊謝謝;
二 Beautiful Soup 簡單使用
Beautiful Soup 的直譯器如下:
直譯器 | 使用示例 |
---|---|
Python標準庫 | BeautifulSoup(markup, "html.parser") |
lxml HTML 解析器 | BeautifulSoup(markup, "lxml") |
lxml XML 解析器 | BeautifulSoup(markup, "xml") |
html5lib | BeautifulSoup(markup, "html5lib") |
本篇的直譯器讀者可以使用Python標準庫或者lxml HTML 解析器都可以;下午中獲取標籤其實都是獲取標籤物件,讀者謹記;
簡要概括下屬性的說明:
屬性 | 含義 |
---|---|
soup.tag.name | 獲取標籤tag名稱 |
soup.tag.string | 獲取標籤tag文字內容 |
soup.tag | 獲取標籤tag |
soup.tag.attrs | 獲取標籤tag所有屬性 |
soup.tag.attrs['class'] | 獲取標籤指定class的屬性 |
soup.tag1.tag2 | 獲取子標籤tag2 |
soup.tag.contents | 獲取tag所有直接子標籤以列表輸出 |
soup.tag.children | 獲取直接子標籤,返回生成器 |
soup.tag.descendants | 獲取所有子標籤,返回生成器 |
soup.tag.parent | 獲取直接父節點 |
soup.tag.parents | 獲取祖先節點,返回生成器 |
soup.tag.next_sibling | 獲取後一個兄弟節點 |
soup.tag.previous_sibling | 獲取前一個兄弟節點 |
soup.tag.next_siblings | 獲取後一個兄弟節點,返回生成器 |
soup.tag.previous_siblings | 獲取前一個兄弟節點,返回生成器 |
2.1 格式化HTML
- 例項化一個Beautiful Soup 例項,入參是HTML,和html.parser
- 呼叫
prettify()
方法會格式化HTML文件
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
html = """
<div class="filter-box d-flex align-items-center">
<form action="" id=seeOriginal>
<dl class="filter-sort-box d-flex align-items-center">
<dt>排序:</dt>
<dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">預設</a></dd>
<dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg class="icon" aria-hidden="true">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS訂閱</a>
</dd>
</dl>"""
# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
print(soup.prettify())
輸出結果下,是不是很美觀,結構是不是很清楚;而且還補全了缺失的標籤</form>
, </div>
;
<div class="filter-box d-flex align-items-center">
<form action="" id="seeOriginal">
<dl class="filter-sort-box d-flex align-items-center">
<dt>
排序:
</dt>
<dd>
<a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">
預設
</a>
</dd>
<dd>
<a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg aria-hidden="true" class="icon">
<use xlink:href="#csdnc-rss">
</use>
</svg>
RSS訂閱
</a>
</dd>
</dl>
</form>
</div>
2.2 獲取標籤節點
- 呼叫soup.dt 會直接獲得第一個匹配到dt標籤物件;
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
html = """
<div class="filter-box d-flex align-items-center">
<form action="" id=seeOriginal>
<dl class="filter-sort-box d-flex align-items-center">
<dt>排序:</dt>
<dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">預設</a></dd>
<dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg class="icon" aria-hidden="true">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS訂閱</a>
</dd>
</dl>"""
# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
# 輸出節點 <dt>排序:</dt>
print(soup.dt)
2.3 獲取節點文字
soup.dt.string 獲得dt標籤包含的內容;
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
html = """
<div class="filter-box d-flex align-items-center">
<form action="" id=seeOriginal>
<dl class="filter-sort-box d-flex align-items-center">
<dt>排序:</dt>
<dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">預設</a></dd>
<dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg class="icon" aria-hidden="true">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS訂閱</a>
</dd>
</dl>"""
# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
# 輸出文字內容 排序:
print(soup.dt.string)
2.4獲取節點名稱
soup.dt.name 直接獲得標籤dt的名稱;
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
html = """
<div class="filter-box d-flex align-items-center">
<form action="" id=seeOriginal>
<dl class="filter-sort-box d-flex align-items-center">
<dt>排序:</dt>
<dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">預設</a></dd>
<dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg class="icon" aria-hidden="true">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS訂閱</a>
</dd>
</dl>"""
# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
# 輸出dt
print(soup.dt.name)
2.5 獲得節點物件種類
直接獲得標籤後使用type方法可以顯示出標籤型別是<class 'bs4.element.Tag'>
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
html = """
<div class="filter-box d-flex align-items-center">
<form action="" id=seeOriginal>
<dl class="filter-sort-box d-flex align-items-center">
<dt>排序:</dt>
<dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">預設</a></dd>
<dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg class="icon" aria-hidden="true">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS訂閱</a>
</dd>
</dl>"""
# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
dt = soup.dt
# <class 'bs4.element.Tag'>
print(type(dt))
2.6 獲取所有屬性
soup.a.attrs 獲取匹配到第一個a標籤的所有屬性;
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
html = """
<div class="filter-box d-flex align-items-center">
<form action="" id=seeOriginal>
<dl class="filter-sort-box d-flex align-items-center">
<dt>排序:</dt>
<dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">預設</a></dd>
<dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg class="icon" aria-hidden="true">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS訂閱</a>
</dd>
</dl>"""
# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
print(soup.a.attrs)
輸出預設匹配第一個a標籤的全部屬性
{'href': 'javascript:void(0);', 'data-report-query': '', 'class': ['btn-filter-sort', 'active'], 'target': '_self'}
2.7 獲取特定屬性
soup.a.attrs['href'],獲取匹配到第一個a標籤的href屬性內容
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
html = """
<div class="filter-box d-flex align-items-center">
<form action="" id=seeOriginal>
<dl class="filter-sort-box d-flex align-items-center">
<dt>排序:</dt>
<dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">預設</a></dd>
<dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg class="icon" aria-hidden="true">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS訂閱</a>
</dd>
</dl>"""
# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
# 輸出javascript:void(0);
print(soup.a.attrs['href'])
2.8 獲取子節點
soup.form.dd 會獲得form標籤下第一個dd標籤
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
html = """
<div class="filter-box d-flex align-items-center">
<form action="" id=seeOriginal>
<dl class="filter-sort-box d-flex align-items-center">
<dt>排序:</dt>
<dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">預設</a></dd>
<dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg class="icon" aria-hidden="true">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS訂閱</a>
</dd>
</dl>"""
# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
print(soup.form.dd)
輸出
<dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">預設</a></dd>
2.9 獲取所有直接子節點
soup.form.contents 將會以列表的形式輸出form所有的子標籤;
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
html = """
<div class="filter-box d-flex align-items-center">
<form action="" id=seeOriginal>
<dl class="filter-sort-box d-flex align-items-center">
<dt>排序:</dt>
<dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">預設</a></dd>
<dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg class="icon" aria-hidden="true">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS訂閱</a>
</dd>
</dl>"""
# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
print(soup.form.contents)
輸出結果:
['\n', <dl class="filter-sort-box d-flex align-items-center">
<dt>排序:</dt>
<dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">預設</a></dd>
<dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg aria-hidden="true" class="icon">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS訂閱</a>
</dd>
</dl>]
2.10 獲取直接子節點生成器
soup.svg.children 會獲得dd所有子節點的生成器;
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
html = """
<div class="filter-box d-flex align-items-center">
<form action="" id=seeOriginal>
<dl class="filter-sort-box d-flex align-items-center">
<dt>排序:</dt>
<dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">預設</a></dd>
<dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg class="icon" aria-hidden="true">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS訂閱</a>
</dd>
</dl>"""
# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
for index, child in enumerate(soup.svg.children):
print(index, child)
輸出結果:
0
1 <use xlink:href="#csdnc-rss"></use>
2
2.11 獲取所有子節點生成器
soup.dl.descendants 會獲取dl 標籤所有的子節點(more than direct child node),
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
html = """
<div class="filter-box d-flex align-items-center">
<form action="" id=seeOriginal>
<dl class="filter-sort-box d-flex align-items-center">
<dt>排序:</dt>
<dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">預設</a></dd>
<dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg class="icon" aria-hidden="true">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS訂閱</a>
</dd>
</dl>"""
# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
for index, child in enumerate(soup.dl.descendants):
print(index, child)
輸出結果:
0
1 <dt>排序:</dt>
2 排序:
3
4 <dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">預設</a></dd>
5 <a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">預設</a>
6 預設
7
8 <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg aria-hidden="true" class="icon">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS訂閱</a>
</dd>
9 <a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg aria-hidden="true" class="icon">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS訂閱</a>
10
11 <svg aria-hidden="true" class="icon">
<use xlink:href="#csdnc-rss"></use>
</svg>
12
13 <use xlink:href="#csdnc-rss"></use>
14
15 RSS訂閱
16
17
2.12 獲取直接父節點
soup.a.parent 或獲取第一個匹配到a標籤的父標籤物件;
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
html = """
<div class="filter-box d-flex align-items-center">
<form action="" id=seeOriginal>
<dl class="filter-sort-box d-flex align-items-center">
<dt>排序:</dt>
<dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">預設</a></dd>
<dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg class="icon" aria-hidden="true">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS訂閱</a>
</dd>
</dl>"""
# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
print(soup.a.parent)
輸出結果:
<dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">預設</a></dd>
2.13 獲取祖先節點生成器
soup.a.parents 會獲得第一個匹配到a標籤的所有父節點,也就是祖先節點,返回生成器;
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
html = """
<div class="filter-box d-flex align-items-center">
<form action="" id=seeOriginal>
<dl class="filter-sort-box d-flex align-items-center">
<dt>排序:</dt>
<dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">預設</a></dd>
<dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg class="icon" aria-hidden="true">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS訂閱</a>
</dd>
</dl>"""
# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
for node in soup.a.parents:
if node is None:
print(node)
else:
print(node.name)
輸出結果:
dd
dl
form
div
[document]
2.14 獲取兄弟節點
兄弟節點有個坑,通常是返回空白,就不做過多講解
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
html = """
<div class="filter-box d-flex align-items-center">
<form action="" id=seeOriginal>
<dl class="filter-sort-box d-flex align-items-center">
<dt>排序:</dt>
<dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">預設</a></dd>
<dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg class="icon" aria-hidden="true">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS訂閱</a>
</dd>
</dl>"""
# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
print(soup.dt.next_sibling)
輸出是空白;其它兄弟節點屬性就不寫了,感覺沒啥意義,不是空白就是None;
三 搜尋文件
學完第二節內容,讀者們其實就是打了個基礎,重點是這章節;
函式 | 含義 |
---|---|
find_all(name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs) | 查詢所有匹配節點 |
find(name=None, attrs={}, recursive=True, text=None, **kwargs) | 查詢第一個匹配節點 |
find_parent(name=None, attrs={}, **kwargs) | 返回當前節點的父輩節 |
find_parents(name=None, attrs={}, **kwargs) | 返回當前節點的祖先節點 |
find_next_sibling(name=None, attrs={}, text=None, **kwargs) | 返回符合條件的後面的第一個tag節點 |
find_next_siblings(name=None, attrs={}, text=None, **kwargs) | 返回所有符合條件的後面的兄弟節點 |
find_previous_sibling(self, name=None, attrs={}, text=None, **kwargs) | 返回第一個符合條件的前面的兄弟節點 |
find_previous_siblings(self, name=None, attrs={}, text=None, **kwargs) | 返回所有符合條件的前面的兄弟節點 |
find_next(name=None, attrs={}, text=None, **kwargs) | 返回第一個符合條件的節點 |
find_all_next(name=None, attrs={}, text=None, limit=None, **kwargs) | 返回所有符合條件的節點 |
find_previous(name=None, attrs={}, text=None, **kwargs) | 返回第一個符合條件的節點 |
find_all_previousname=None, attrs={}, text=None, limit=None, **kwargs) | 返回所有符合條件的節點 |
- name 表示輸出的tag名稱
- attrs 表示指定屬性查詢
- recursive 表示是否遞迴所有子節點,預設是;設定為false返回直接子節點
- limit 表示 限制 輸出數量
- **kwargs 可以指定經常出現的屬性搜尋,比如 id = 'zszxz'
- text 是過濾條件
本節著重講解find_all方法,find方法於find_all一致,學一個就會用另一個;
3.1 name引數示例
soup.find_all(name='dd') 會獲得所有dd標籤物件,並且返回列表;
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
html = """
<div class="filter-box d-flex align-items-center">
<form action="" id=seeOriginal>
<dl class="filter-sort-box d-flex align-items-center">
<dt>排序:</dt>
<dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">預設</a></dd>
<dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg class="icon" aria-hidden="true">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS訂閱</a>
</dd>
</dl>"""
# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
print(soup.find_all(name='dd'))
輸出結果
[<dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">預設</a></dd>, <dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg aria-hidden="true" class="icon">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS訂閱</a>
</dd>]
注:soup.find_all(name='dd') 與 soup.find_all('dd') 一致;
3.2 attrs 屬性示例
soup.find_all(attrs={'id':'seeOriginal'}) 獲取 屬性 id = seeOriginal 所有標籤物件
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
html = """
<div class="filter-box d-flex align-items-center">
<form action="" id=seeOriginal>
<dl class="filter-sort-box d-flex align-items-center">
<dt>排序:</dt>
<dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">預設</a></dd>
<dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg class="icon" aria-hidden="true">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS訂閱</a>
</dd>
</dl>"""
# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
print(soup.find_all(attrs={'id':'seeOriginal'}))
輸出
[<form action="" id="seeOriginal">
<dl class="filter-sort-box d-flex align-items-center">
<dt>排序:</dt>
<dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">預設</a></dd>
<dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg aria-hidden="true" class="icon">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS訂閱</a>
</dd>
</dl></form>]
3.3 recursive 示例
soup.find_all('dl',recursive=False)
會查詢dl標籤子節點,當recursive 設定為False之後就找不到了;
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
html = """
<div class="filter-box d-flex align-items-center">
<form action="" id=seeOriginal>
<dl class="filter-sort-box d-flex align-items-center">
<dt>排序:</dt>
<dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">預設</a></dd>
<dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg class="icon" aria-hidden="true">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS訂閱</a>
</dd>
</dl>"""
# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
print(soup.find_all('dl',recursive=False))
輸出空列表[]
3.4limit示例
soup.find_all('dd',limit=1)
會限制輸出結果為一條
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
html = """
<div class="filter-box d-flex align-items-center">
<form action="" id=seeOriginal>
<dl class="filter-sort-box d-flex align-items-center">
<dt>排序:</dt>
<dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">預設</a></dd>
<dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg class="icon" aria-hidden="true">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS訂閱</a>
</dd>
</dl>"""
# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
print(soup.find_all('dd',limit=1))
輸出
[<dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">預設</a></dd>]
3.5 kwargs 示例之屬性匹配
soup.find_all(id='seeOriginal')
直接指定id屬性查詢
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
html = """
<div class="filter-box d-flex align-items-center">
<form action="" id=seeOriginal>
<dl class="filter-sort-box d-flex align-items-center">
<dt>排序:</dt>
<dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">預設</a></dd>
<dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg class="icon" aria-hidden="true">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS訂閱</a>
</dd>
</dl>"""
# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
print(soup.find_all(id='seeOriginal'))
輸出
[<form action="" id="seeOriginal">
<dl class="filter-sort-box d-flex align-items-center">
<dt>排序:</dt>
<dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">預設</a></dd>
<dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg aria-hidden="true" class="icon">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS訂閱</a>
</dd>
</dl></form>]
3.6 kwargs 示例之正則匹配
soup.find_all(href=re.compile("java.*?"))
匹配屬性 href 正則 java開頭的屬性標籤;
# -*- coding: utf-8 -*-
import re
import requests
from bs4 import BeautifulSoup
html = """
<div class="filter-box d-flex align-items-center">
<form action="" id=seeOriginal>
<dl class="filter-sort-box d-flex align-items-center">
<dt>排序:</dt>
<dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">預設</a></dd>
<dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg class="icon" aria-hidden="true">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS訂閱</a>
</dd>
</dl>"""
# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
print(soup.find_all(href=re.compile("java.*?")))
輸出結果
[<a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">預設</a>]
3.7 按CSS搜尋
soup.find_all("a", class_="btn")
查詢a標籤,class屬性帶有btn
# -*- coding: utf-8 -*-
import re
import requests
from bs4 import BeautifulSoup
html = """
<div class="filter-box d-flex align-items-center">
<form action="" id=seeOriginal>
<dl class="filter-sort-box d-flex align-items-center">
<dt>排序:</dt>
<dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">預設</a></dd>
<dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg class="icon" aria-hidden="true">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS訂閱</a>
</dd>
</dl>"""
# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
print(soup.find_all("a", class_="btn"))
輸出結果
[<a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg aria-hidden="true" class="icon">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS訂閱</a>]
四CSS選擇器
Beautiful Soup 還直接支援CSS選擇器搜尋,下面列出了經常使用的方法示例;
# -*- coding: utf-8 -*-
import re
import requests
from bs4 import BeautifulSoup
html = """
<div class="filter-box d-flex align-items-center">
<form action="" id=seeOriginal>
<dl class="filter-sort-box d-flex align-items-center">
<dt>排序:</dt>
<dd><a href="javascript:void(0);" data-report-query="" class="btn-filter-sort active" target="_self">預設</a></dd>
<dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg class="icon" aria-hidden="true">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS訂閱</a>
</dd>
</dl>"""
# 初始化 soup
soup = BeautifulSoup(html,'html.parser')
# 選取 dl 標籤下面的 dt標籤
lt = soup.select('dl dt')
print(lt)
dd = soup.select('dl dd')
print(dd[0])
# id 選擇器搜尋
id = soup.select('#seeOriginal')
print(id)
# class選擇器 搜尋
cla = soup.select('.btn-filter-sort')
print(cla[0])
分別輸出如下
soup.select('dl dt')
[<dt>排序:</dt>]
soup.select('dl dd')[0]
<dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">預設</a></dd>
soup.select('#seeOriginal')
[<form action="" id="seeOriginal">
<dl class="filter-sort-box d-flex align-items-center">
<dt>排序:</dt>
<dd><a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">預設</a></dd>
<dd><a class="btn btn-sm rss" href="https://blog.csdn.net/youku1327/rss/list">
<svg aria-hidden="true" class="icon">
<use xlink:href="#csdnc-rss"></use>
</svg>RSS訂閱</a>
</dd>
</dl></form>]
soup.select('.btn-filter-sort')[0]
<a class="btn-filter-sort active" data-report-query="" href="javascript:void(0);" target="_self">預設</a>