BeautifulSoup庫使用詳解

阿新 • • 發佈：2018-11-01

BeautifulSoup庫是靈活又方便的網頁解析庫，處理高效，支援多種解析器。利用它不用編寫正則表示式即可方便地實現網頁資訊的提取。
安裝：pip3 install beautifulsoup4

解析庫

解析器	使用方法	優勢	劣勢
Python標準庫	BeautifulSoup(markup, “html.parser”)	Python的內建標準庫、執行速度適中、文件容錯能力強	Python 2.7.3 or 3.2.2)前的版本中文容錯能力差
lxml HTML 解析器	BeautifulSoup(markup, “lxml”)	速度快、文件容錯能力強	需要安裝C語言庫
lxml XML 解析器	BeautifulSoup(markup, “xml”)	速度快、唯一支援XML的解析器	需要安裝C語言庫
html5lib	BeautifulSoup(markup, “html5lib”)	最好的容錯性、以瀏覽器的方式解析文件、生成HTML5格式的文件	速度慢、不依賴外部擴充套件

基本使用

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
""" 

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#傳入解析器：lxml
print(soup.prettify())#格式化程式碼，自動補全
print(soup.title.string)#得到title標籤裡的內容

在這裡插入圖片描述

標籤選擇器

選擇元素

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#傳入解析器：lxml
print(soup.title)#選擇了title標籤
print(type(soup.title))#檢視型別
print(soup.head)

獲取名稱

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#傳入解析器：lxml
print(soup.title.name)

也就是獲得了標籤的名稱。
在這裡插入圖片描述

獲取屬性


from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#傳入解析器：lxml
print(soup.p.attrs['name'])#獲取p標籤中，name這個屬性的值
print(soup.p['name'])#另一種寫法，比較直接

在這裡插入圖片描述

獲取內容

print(soup.p.string)

巢狀選擇

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#傳入解析器：lxml

print(soup.head.title.string)

在這裡插入圖片描述

子節點和子孫節點

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""


from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#傳入解析器：lxml
print(soup.p.contents)#獲取指定標籤的子節點，型別是list

在這裡插入圖片描述
另一個方法，child：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#傳入解析器：lxml
print(soup.p.children)#獲取指定標籤的子節點的迭代器物件
for i,children in enumerate(soup.p.children):#i接受索引，children接受內容
	print(i,children)

輸出結果與上面的一樣，多了一個索引。注意，只能用迴圈來迭代出子節點的資訊。因為直接返回的只是一個迭代器物件。
在這裡插入圖片描述
獲取子孫節點：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#傳入解析器：lxml
print(soup.p.descendants)#獲取指定標籤的子孫節點的迭代器物件
for i,child in enumerate(soup.p.descendants):#i接受索引，child接受內容
	print(i,child)

在這裡插入圖片描述

父節點和祖先節點

parent

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#傳入解析器：lxml
print(soup.a.parent)#獲取指定標籤的父節點

在這裡插入圖片描述
如圖，打印出了a節點的父節點：p標籤。

parents

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#傳入解析器：lxml
print(list(enumerate(soup.a.parents)))#獲取指定標籤的祖先節點

##兄弟節點

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#傳入解析器：lxml
print(list(enumerate(soup.a.next_siblings)))#獲取指定標籤的後面的兄弟節點
print(list(enumerate(soup.a.previous_siblings)))#獲取指定標籤的前面的兄弟節點

在這裡插入圖片描述

標準選擇器

find_all( name , attrs , recursive , text , **kwargs )
可根據標籤名、屬性、內容查詢文件。

name

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all('ul'))#查詢所有ul標籤下的內容
print(type(soup.find_all('ul')[0]))#檢視其型別

在這裡插入圖片描述
看看如何巢狀地查詢標籤下的子標籤吧~
下面的例子就是查詢所有ul標籤下的li標籤：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.find_all('ul'):
    print(ul.find_all('li'))

在這裡插入圖片描述

attrs

（屬性）
通過屬性進行元素的查詢：

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1" name="elements">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''


from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(attrs={'id': 'list-1'}))#傳入的是一個字典型別，也就是想要查詢的屬性
print(soup.find_all(attrs={'name': 'elements'}))

在這裡插入圖片描述
查詢到的是同樣的內容，因為這兩個屬性是在同一個標籤裡面的。
特殊型別的引數查詢：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(id='list-1'))#id是個特殊的屬性，可以直接使用
print(soup.find_all(class_='element')) #class是關鍵字所以要用class_

text

根據文字內容來進行選擇：

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(text='Foo'))#查詢文字為Foo的內容，但是返回的不是標籤

在這裡插入圖片描述
所以說這個text在做內容匹配的時候比較方便，但是在做內容查詢的時候並不是太方便。

方法

find
find用法和findall一模一樣，但是返回的是找到的第一個符合條件的內容輸出。
find_parents()， find_parent()
find_parents()返回所有祖先節點，find_parent()返回直接父節點。

find_next_siblings() ,find_next_sibling()
1返回後面的所有兄弟節點，2返回後面的第一個兄弟節點

find_previous_siblings(),find_previous_sibling()
1返回前面所有兄弟節點…

find_all_next(),find_next()
1返回節點後所有符合條件的節點，2返回後面第一個符合條件的節點

find_all_previous()和find_previous()
同理。

CSS選擇器

通過select()直接傳入CSS選擇器即可完成選擇

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.select('.panel .panel-heading'))#.代表class，中間需要空格來分隔
print(soup.select('ul li')) #選擇ul標籤下面的li標籤
print(soup.select('#list-2 .element')) #'#'代表id。這句的意思是查詢id為"list-2"的標籤下的，class=element的元素
print(type(soup.select('ul')[0]))#列印節點型別

在這裡插入圖片描述

再看看層層巢狀的選擇：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
	print(ul.select('li'))

把每個ul標籤中的li標籤輸出：

在這裡插入圖片描述

獲取屬性

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
    print(ul['id'])# 用[ ]即可獲取屬性
    print(ul.attrs['id'])#另一種寫法

在這裡插入圖片描述

獲取內容

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for li in soup.select('li'):
    print(li.get_text())

用get_text（）方法就能獲取內容了。
在這裡插入圖片描述

總結

推薦使用lxml解析庫，必要時使用html.parser
標籤選擇篩選功能弱但是速度快
建議使用find()、find_all() 查詢匹配單個結果或者多個結果
如果對CSS選擇器熟悉建議使用select()
記住常用的獲取屬性和文字值的方法

python BeautifulSoup庫詳解

BeautifulSoup Beautiful Soup 是一個可以從HTML或XML檔案中提取資料的Python庫.它能夠通過你喜歡的轉換器實現慣用的文件導航,查詢,修改文件的方式官方文件連結，相同效果的庫還有pyquery模組，詳見此解析器對網頁進行析取時，若未規定解析器，此時使用

Python3爬蟲學習筆記（4.BeautifulSoup庫詳解）

這是一個功能強大的庫，可以代替很多需要寫正則的地方這是一個第三方解析庫，常規安裝方法：調出cmd：pip install bs4 簡單瞭解： html = """ <html lang=

C++11 std::chrono庫詳解

toolbar space max mil exp 值類型 cond 精度 ++i 所謂的詳解只不過是參考www.cplusplus.com的說明整理了一下，因為沒發現別人有詳細講解。　　chrono是一個time library, 源於boost，現在已經是C++標準。

mysql數據庫詳解之自學成才1

簡化註意可能方法 after 字符型專題 mar 建議一、學習目錄 1.認識數據庫和mysql 2.mysql連接 3.入門語句 4.詳解列類型 5.增刪改查 INSERT INTO 表名（列1,…… 列n） VALUES(值 1,…… 值 n); *（列

使用JDBC連接數據庫詳解

found 註冊加載驅動創建數據庫 delete ole red dsta 庫存 JDBC連接數據庫詳細流程 ?創建一個以JDBC連接數據庫的程序，包含7個步驟： 1、加載JDBC驅動程序：在連接數據庫之前，首先要加載想要連接的數據庫的驅

exp/expdp 與 imp/impdp命令導入導出數據庫詳解

鏈接 lena attach 64bit lap 路徑定時存放位置 traints 一、exp命令導出數據庫如何使exp的幫助以不同的字符集顯示：set nls_lang=simplified chinese_china.zhs16gbk，通過設置環境變量，可以讓ex

10.javaweb核心標簽庫詳解

bean對象靜態管理請求技術範例 begin odin url 一、JSTL簡介及在項目中安裝配置 1，簡介使用JSTL標簽的目的就是不希望jsp中出現java邏輯代碼分類 2， JSTL的安裝配置首先將jar包中的各個標簽庫配置

13.javaweb xml標簽庫詳解

方便 2.3 http 進行 log 標簽庫 xml文件 for 路徑一、XML標簽簡介 1，作用 2，標簽分類 XPath 路徑標記查找節點元素示例 2.1<x:out> 2.

爬蟲（二）：Urllib庫詳解

lib lwp ces lin 設置內置 col http測試 url 什麽是Urllib： python內置的HTTP請求庫 urllib.request : 請求模塊 urllib.error : 異常處理模塊 urllib.parse： url解析模塊 urllib

Oracle 11g數據庫詳解

adding 查看端口 linu oci 資料標記錯誤 urn locate 常見異常： ORA-14025：不能為實體化視圖或實體化視圖日誌指定PARTITION ORA-14026：PARTITION和CLUSTER子句互相排斥 ORA-14027：僅可以指定一個P

urllib庫詳解

www. ID erro quest port 異常處理 http 基本 parse urllib庫是python內置的一個http請求庫其實urllib庫不好用，沒有requests庫好用，下一篇文章再寫requests庫，requests庫是基於urllib庫實現的

Requests庫詳解

OS nbsp 數據請求 delet 直接 ade imp 比較 urllib庫作為基本庫，requests庫也是在urllib庫基礎上發展的但是urllib在使用上不如requests便利，比如上篇文章在寫urllib庫的時候，比如代理設置，處理cookie時，沒有寫

PyQuery庫詳解

則表達式 HR ID div TE DG -c contain filename 強大又靈活的網頁解析庫，如果覺得正則表達式寫起來太麻煩，而BeautifulSoup語法太難記，但是熟悉jQuery的語法，那麽PyQuery就是一個絕佳選擇。安裝：pip3 install

Selenium庫詳解

ava javascrip 網頁渲染用法自動化測試工具工具源代碼解決 AS 自動化測試工具，支持多種瀏覽器，在爬蟲中主要用來解決JavaSript渲染的問題。 (驅動瀏覽器，發送一些指令，讓瀏覽器完成一些動作) requests urllib這些庫無法正常獲取網頁

mongodb 數據庫詳解

大小 _id 檢索 www 擴展經緯 rdb reboot 持久第1章數據庫管理系統 1.1 前言 01.數據的定義：文字、圖像、地理位置信息(坐標、經緯度)等 02.數據庫管理系統的定義：建立、存取和管理數據，保證數據安全和完整性的軟件 03.常見的數據庫管理系統：

MongoDB 數據庫詳解，以及 MongoDB4.0版本的安裝

primary 所在高可用使用進行 remove 存儲空間標記詳解 MongoDB 簡介 MongDB 是一款跨平臺，面向文檔的數據庫，可以實現高性能，高可用性，並且能夠輕松擴展，是一個基於分布式文件存儲的開源數據庫系統。在高負載的情況下，添加更多的節點，可以保

Java jdbc 操作數據庫詳解

drive stat man println span 都是 sets for 連接原文地址https://www.cnblogs.com/huguodong/p/5910859.html JDBC（Java Data Base Connectivity,java數據庫連

BS4庫詳解

his 單個 ant lan css選擇器 all ice pre ces 1 from bs4 import BeautifulSoup 2 3 4 5 6 html = """ 7 <html><head>

Python--urllib3庫詳解

Python--urllib3庫詳解　　　Urllib3是一個功能強大，條理清晰，用於HTTP客戶端的Python庫，許多Python的原生系統已經開始使用urllib3。Urllib3提供了很多python標準庫裡所沒有的重要特性：　　　　　　　　　　　　

BeautifulSoup模組詳解

BeautifulSoup是一個模組，該模組用於接收一個HTML或XML字串，然後將其進行格式化，之後遍可以使用他提供的方法進行快速查詢指定元素，從而使得在HTML或XML中查詢指定元素變得簡單。官方文件： http:/

BeautifulSoup庫使用詳解

解析庫

基本使用

標籤選擇器

選擇元素

獲取名稱

獲取屬性

獲取內容

巢狀選擇

子節點和子孫節點

父節點和祖先節點

parent

parents

標準選擇器

name

attrs

text

方法

CSS選擇器

獲取屬性

獲取內容

總結

相關推薦