Python Beautiful Soup 解析庫的使用

阿新 • • 發佈：2018-05-02

syn nts ID 輸出 ner 瀏覽器 lib enumerate ace

Beautiful Soup

借助網頁的結構和屬性等特性來解析網頁，這樣就可以省去復雜的正則表達式的編寫。

Beautiful Soup是Python的一個HTML或XML的解析庫。

1.解析器

解析器	使用方法	優勢	劣勢
Python標準庫	BeautifulSoup(markup,"html.parser")	執行速度適中、文檔容錯能力強	2.7.3和3.2.2之前的版本容錯能力差
lxml HTML解析器	BeautifulSoup(markup,"lxml")	速度快、文檔容錯能力強	需要安裝C語言庫
lxml XML解析器	BeautifulSoup(markup,"xml")	速度快，唯一支持XML的解析器	需要安裝C語言庫
html5lib	BeautifulSoup(markup,"html5lib")	最好的容錯性、以瀏覽器的方式解析文檔、生成HTML5格式的文檔	速度慢、不依賴外部擴展

綜上所述，推薦lxml HTML解析器

1 2 3 from bs4 import BeautifulSoup soup = BeautifulSoup(‘Hello World‘,‘lxml‘) print(soup.p.string)

2.基本用法：

1 2 3 4 5 6 7 8 9 10 11

html = ‘‘‘ <html> <head><title>Infi-chu example</title></head> <body> title example link <a href="http://example.com/elsie" class="sister" id="link1">elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">lacie</a>,

<a href="http://example.com/tillie" class="sister" id="link3">tillie</a>, last sentence ‘‘‘

1 2 3 4 from bs4 import BeautifulSoup soup = BeautifulSoup(html,‘lxml‘) print(soup.prettify()) # 修復html print(soup.title.string) # 輸出title節點的字符串內容

3.節點選擇器：

選擇元素

使用soup.元素的方式獲取

提取信息

（1）獲取名稱

使用soup.元素.name獲取元素名稱

（2）獲取屬性

使用soup.元素.attrs

使用soup.元素.attrs[‘name‘]

（3）元素內容

使用soup.元素.string獲取內容

嵌套選擇

使用soup.父元素.元素.string獲取內容

關聯選擇

（1）子節點和子孫節點

1 2 3 4 5 6 7 8 9 10 11 html = ‘‘‘ <html> <head><title>Infi-chu example</title></head> <body> title example link <a href="http://example.com/elsie" class="sister" id="link1">elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">lacie</a>, <a href="http://example.com/tillie" class="sister" id="link3">tillie</a>, last sentence ‘‘‘

1 2 3 4 5 6 7 8 9 10 11 12 from bs4 import BeautifulSoup # 得到直接子節點，children屬性 soup = BeautifulSoup(html,‘lxml‘) print(soup.p.children) for i ,child in enumerate(soup.p.children): print(i,child) # 得到所有的子孫節點，descendants屬性 soup = BeautifulSoup(html,‘lxml‘) print(soup.p.descendants) for i,child in enmuerate(soup.p.descendants): print(i,child)

（2）父節點和祖先節點

調用父節點，使用parent屬性

獲取所有祖先節點，使用parents屬性

（3）兄弟節點

next_sibling　　下一個兄弟元素

previous_sibling　　上一個兄弟元素

next_siblings　　所有前面兄弟節點

previous_siblings　　所有後面兄弟節點

（4）提取信息

4.方法選擇器：

find_all()

find_all(name,attrs,recursize,text,**kwargs)

（1）name

1 2 3 soup.find_all(name=‘ul‘) for ul in soup.find_all(name=‘ul‘): print(ul.find_all(name=‘ul‘))

1 2 3 4 for ul in soup.find_all(name=‘ul‘): print(ul.find_all(name=‘li‘)) for li in ul.find_all(name=‘li‘): print(li.string)

（2）attes

1 2 3 4 5 6 7 # 根據節點名查詢 print(soup.find_all(attrs={‘id‘:‘list1‘})) print(soup.find_all(attrs={‘name‘:‘elements‘})) # 也可以寫成 print(soup.find_all(id=‘list1‘)) print(soup.find_all(class=‘elements‘))

（3）text

text參數可以用來匹配節點的文本，傳入的形式可以是字符串，可以是正則表達式對象

1 2 3 from bs4 import BeautifulSoup soup = BeautifulSoup(html,‘lxml‘) print(soup.find_all(text=re.compile(‘link‘)))

find()

返回一個元素

【註】

find_parents()和find_parent()

find_next_siblings()和find_next_sibling()

find_previous_siblings()和find_previous_sibling()

find_all_next()和find_next()

find_all_previous()和find_previous()

5.CSS選擇器：

嵌套選擇

1 2	`for` `ul` `in` `soup.select(‘ul‘):` `print(ul.select(‘li‘))`

獲取屬性

1 2 3 4 for ul in soup.select(‘ul‘): print(ul[‘id‘]) # 等價於 print(ul.attrs[‘id‘])

獲取文本

獲取文本除了string屬性還有get_text()方法

1 2 3 4 for li in soup.select(‘li‘): # 效果一樣 print(li.get_text()) print(li.string)

Python Beautiful Soup 解析庫的使用

syn nts ID 輸出 ner 瀏覽器 lib enumerate ace Beautiful Soup 借助網頁的結構和屬性等特性來解析網頁，這樣就可以省去復雜的正則表達式的編寫。 Beautiful Soup是Python的一個HTML或XML的解析庫。 1.解析器

Python Beautiful Soup 解析庫的使用

Python Beautiful Soup 解析庫的使用

Python爬蟲之Beautiful Soup解析庫的使用（五）

Python爬蟲系列（四）：Beautiful Soup解析HTML之把HTML轉成Python對象

python beautiful soup庫的用法

Beautiful Soup 解析html表格示例

Python命令行解析庫——argarse、docopt、click、invoke

Python爬蟲【解析庫之beautifulsoup】

常用的python命令列解析庫

Python Beautiful Soup類的基本元素

Beautiful Soup 解析html表格

ubuntu下的python網頁解析庫的安裝——lxml, Beautiful Soup, pyquery, tesserocr

小白學 Python 爬蟲（21）：解析庫 Beautiful Soup（上）

小白學 Python 爬蟲（22）：解析庫 Beautiful Soup（下）

【Python爬蟲學習實踐】基於Beautiful Soup的網站解析及數據可視化

Windows環境下python爬蟲常用庫和工具的安裝（UrlLib、Re、Requests、Selenium、lxml、Beautiful Soup、PyQuery 、PyMySQL等等）

【Python3 爬蟲學習筆記】解析庫的使用 3 —— Beautiful Soup 1

【Python3 爬蟲學習筆記】解析庫的使用 7 —— Beautiful Soup 5

【Python3 爬蟲學習筆記】解析庫的使用 5 —— Beautiful Soup 3

【Python3 爬蟲學習筆記】解析庫的使用 4 —— Beautiful Soup 2

python 理解Beautiful Soup庫的基本元素

Python Beautiful Soup 解析庫的使用

相關推薦