【Python3 爬蟲學習筆記】解析庫的使用 3 —— Beautiful Soup 1

阿新 • • 發佈：2018-11-11

Beautiful Soup可以藉助網頁的結構和屬性等特性來解析網頁。有了Beautiful Soup，我們不用再去寫一些複雜的正則表示式，只需要簡單的幾條語句，就可以完成網頁中某個元素的提取。

Beautiful Soup是Python的一個HTML或XML的解析庫，可以用它來方便地從網頁中提取資料。官方解釋如下：
Beautiful Soup提供一些簡單的、Python式的函式來處理導航、搜尋、修改分析樹等功能。它是一個工具箱，通過解析文件為使用者提供需要抓取的資料，因為簡單，所以不需要多少程式碼就可以寫出一個完整的應用程式。
Beautiful Soup自動將輸入文件轉換為Unicode編碼，輸出文件轉換為UTF-8編碼。你不需要考慮編碼方式，除非文件沒有指定一個編碼方式，這時你僅僅需要說明一下原始編碼方式就可以了。
Beautiful Soup已成為和lxml、html6lib一樣出色的Python直譯器，為使用者靈活地提供不同的接卸策略或強勁的速度。

Beautiful Soup在解析時實際上依賴解析器，它除了支援Python標準庫中的HTML解析器外，還支援一些第三方解析器（比如lxml）。

解析器	使用方法	優勢	劣勢
Python標準庫	BeautifulSoup(markup,“html.parser”)	Python的內建標準庫、執行速度適中、文件容錯能力強	Python2.7.3及Python3.2.2之前的版本文件容錯能力差
lxml HTML解析器	BeautifulSoup(markup,“lxml”)	速度快、文件容錯能力強	需要安裝C語言庫
lxml XML解析器	BeautifulSoup(markup,“xml”)	速度快、唯一支援XML的解析器	需要安裝C語言庫
html5lib	BeautifulSoup(markup,“html5lib”)	最好的容錯性、以瀏覽器的方式解析文件、生成HTML5格式的文件	速度慢、不依賴外部擴充套件

通過以上對比可以看出，lxml解析器有解析HTML和XML的功能，而且速度快，容錯能力強。
一個例項展示Beautiful Soup的基本用法：

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.prettify())
print(soup.title.string)

執行結果如下：

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title" name="dromouse">
   <b>
    The Dormouse's
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    <!-- Elsie -->
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>
The Dormouse's story

這裡首先宣告變數html，它是一個HTML字串。但是需要注意的是，它並不是一個完成的HTML字串，因為body和html節點都沒有閉合。接著，我們將它當做第一個引數傳給BeautifulSoup物件，該物件的第二個引數為解析器的型別（這裡使用lxml），此時就完成了BeautifulSoup物件的初始化。然後，將這個物件賦值給soup變數。
接下來，就可以呼叫soup的各個方法和屬性解析這串HTML程式碼了。
首先，呼叫prettify()方法。這個方法可以把要解析的字串以標準的縮排格式輸出。這裡需要注意的是，輸出結果裡面包含body和html節點，也就是說對於不標準的HTML字串BeautifulSoup，可以自動更正格式。這一步不是由prettify()方法做的，而是在初始化BeautifulSoup時就完成了。
然後呼叫soup.title.string，這實際上是輸出HTML中title節點的文字內容。所以，soup.title可以選出HTML中的title節點，再呼叫string屬性就可以得到裡面的文字了。

節點選擇器

直接呼叫節點的名稱就可以選擇節點元素，再呼叫string屬性就可以得到節點內的文字了，這種選擇方式速度非常快。如果單個節點結構層次非常清晰，可以選用這種方式來解析。

選擇元素

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.title)
print(type(soup.title))
print(soup.title.string)
print(soup.head)
print(soup.p)

執行結果如下：

<title>The Dormouse's story</title>
<class 'bs4.element.Tag'>
The Dormouse's story
<head><title>The Dormouse's story</title></head>
<p class="title" name="dromouse"><b>The Dormouse's</b></p>

這裡依然選用剛才的HTML程式碼，首先列印輸出title節點的選擇結果，輸出結果正是title節點加里面的文字內容。接下來，輸出它的型別，是bs4.element.Tag型別，這時Beautiful Soup中一個重要的資料結構。經過選擇器選擇後，選擇結果都是這種Tag型別。Tag具有一些屬性，比如string屬性，呼叫該屬性，可以得到節點的文字內容，所以接下來的輸出結果正是節點的文字內容。
接下來，我們又嘗試選擇了head節點，結果也是節點加其內部的所有內容。最後，選擇了p節點。不過這次情況比較特殊，我們發現結果是第一個p節點的內容，後面的幾個p節點並沒有選到。

提取資訊

獲取名稱

可以利用name屬性獲取節點的名稱。這裡還是以上面文字為例，選取title節點，然後呼叫name屬性就可以得到節點名稱：
print(soup.title.name)
執行結果如下：
title

獲取屬性

每個節點可能有多個屬性，比如id和class等，選擇這個節點元素後，可以呼叫attrs獲取所有屬性：

print(soup.p.attrs)
print(soup.p.attrs['name'])

執行結果如下：

{'class':['title'], 'name':'dromouse'}
dromouse

可以看到，attrs的返回結果是字典形式，它把選擇的節點的所有屬性和屬性值組合成一個字典。接下來，如果要獲取name屬性，就相當於從字典中獲取某個鍵值，只需要用中括號加屬性名就可以了。
還有一種更簡單的獲取方式：可以不用寫attrs，直接在節點元素後面加中括號，傳入屬性名就可以獲取屬性值了。

print(soup.p['name'])
print(soup.p['class'])

執行結果如下：

dromouse
['title']

獲取內容

可以利用string屬性獲取節點元素包含的內容，比如要抓取第一個p節點的文字：

print(soup.p.string)

執行結果如下：

The Dormouse's story

巢狀選擇

在上面的例子中，我們知道每一個返回結果都是bs4.element.Tag型別，它同樣可以繼續呼叫節點進行下一步的選擇。比如，我們獲取了head節點元素，我們可以繼續呼叫head選取其內部的head節點元素：

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.head.title)
print(type(soup.head.title))
print(soup.head.title.string)

執行結果如下：

<title>The Dormouse's story</title>
<class 'bs4.element.Tag'>
The Dormouse's story

第一行結果是呼叫head之後再次呼叫title而選擇的title節點元素。然後列印輸出了它的型別，可以看到，它仍然是bs4.element.Tag型別。也就是說，我們在Tag型別的基礎上再次選擇得到的依然還是Tag型別，每次返回的結果都相同，所以這樣就可以做巢狀選擇了。
最後，輸出它的string屬性，也就是節點裡的文字內容。

關聯選擇

在做選擇的時候，有時候不能做到一步就選到想要的節點元素，需要先選中某一個節點元素，然後以它為基準再選擇它的子節點、父節點、兄弟節點等。

子節點和子孫節點

選取節點元素之後，如果想要獲取它的直接子節點，可以呼叫contents屬性，示例如下：

html = """
<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
    Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">
<span>Elsie</span>
</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
and they lived at the bottom of a well.
</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.contents)

執行結果如下：

['\n    Once upon a time there were three little sisters; and their names were\n    ', <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>, '\n', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, '\nand\n', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, '\nand they lived at the bottom of a well.\n']

可以看到，返回結果是列表形式。p節點裡既包含文字，又包含節點，最後會將它們以列表形式統一返回。
需要注意的是，列表中的每個元素都是p節點的直接子節點。比如第一個a節點裡麵包含一層span節點，這相當於孫子節點了，但是返回結果並沒有單獨把span節點選出來。所以說，contents屬性得到的結果是直接子節點的列表。
同樣，我們可以呼叫children屬性得到相應的結果：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.children)
for i, child in enumerate(soup.p.children):
    print(i, child)

執行結果如下：

<list_iterator object at 0x000002347B2FD748>
0
    Once upon a time there were three little sisters; and their names were

1 <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
2

3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
4
and

5 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
6
and they lived at the bottom of a well.

還是同樣的HTML文字，這裡呼叫了children屬性來選擇，返回結果是生成器型別。接下來，我們用for迴圈輸出相應的內容。
如果要得到所有的子孫節點的話，可以呼叫descendants屬性：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.descendants)
for i,child in enumerate(soup.p.descendants):
    print(i, child)

執行結果如下：

<generator object descendants at 0x000002347B603AF0>
0
    Once upon a time there were three little sisters; and their names were

1 <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
2

3 <span>Elsie</span>
4 Elsie
5

6

7 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
8 Lacie
9
and

10 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
11 Tillie
12
and they lived at the bottom of a well.

此時返回結果還是生成器。遍歷輸出一下可以看到，這次的輸出結果就包含了span節點。descendants會遞迴查詢所有子節點，得到所有的子孫節點。

【Python3 爬蟲學習筆記】解析庫的使用 3 —— Beautiful Soup 1

節點選擇器

選擇元素

提取資訊

獲取名稱

獲取屬性

獲取內容

巢狀選擇

關聯選擇

子節點和子孫節點

【Python3 爬蟲學習筆記】解析庫的使用 3 —— Beautiful Soup 1

【Python3 爬蟲學習筆記】解析庫的使用 2 —— 使用XPath 2

【Python3 爬蟲學習筆記】解析庫的使用 1 —— 使用XPath 1

【Python3 爬蟲學習筆記】解析庫的使用 7 —— Beautiful Soup 5

【Python3 爬蟲學習筆記】解析庫的使用 5 —— Beautiful Soup 3

【Python3 爬蟲學習筆記】解析庫的使用 4 —— Beautiful Soup 2

【Python3 爬蟲學習筆記】解析庫的使用 10 —— 使用pyquery 3

【Python3 爬蟲學習筆記】解析庫的使用 9 —— 使用pyquery 2

【Python3 爬蟲學習筆記】解析庫的使用 8 —— 使用pyquery 1

【Python3 爬蟲學習筆記】解析庫的使用 11 —— 使用pyquery 4

【Python3 爬蟲學習筆記】解析庫的安裝

【Python3 爬蟲學習筆記】解析庫的使用 6 —— Beautiful Soup 4

【Python3 爬蟲學習筆記】基本庫的使用 8—— 正則表示式 1

【Python3 爬蟲學習筆記】基本庫的使用 7 —— 使用requests

【Python3 爬蟲學習筆記】基本庫的使用 12—— 正則表示式 5

【Python3 爬蟲學習筆記】基本庫的使用 11—— 正則表示式 4

【Python3 爬蟲學習筆記】基本庫的使用 10—— 正則表示式 3

【Python3 爬蟲學習筆記】基本庫的使用 9—— 正則表示式 2

【Python3 爬蟲學習筆記】基本庫的使用 13 —— 抓取貓眼電影排行

【Python3 爬蟲學習筆記】Web庫的安裝

【Python3 爬蟲學習筆記】解析庫的使用 3 —— Beautiful Soup 1

節點選擇器

選擇元素

提取資訊

獲取名稱

獲取屬性

獲取內容

巢狀選擇

關聯選擇

子節點和子孫節點

相關推薦