Beautiful Soup的使用

阿新 • • 發佈：2017-06-25

code 解析器創建正則表達式簡介 fin new ble ref

Beautiful Soup簡單實用，功能也算比較全，之前下載都是自己使用xpath去獲取信息，以後簡單的解析可以用這個，方便省事。

Beautiful Soup 是用 Python 寫的一個 HTML/XML 的解析器，它可以很好的處理不規範標記並生成剖析樹。通常用來分析爬蟲抓取的web文檔。對於不規則的 Html文檔，也有很多的補全功能，節省了開發者的時間和精力。

Beautiful Soup 的官方文檔齊全，將官方給出的例子實踐一遍就能掌握。官方英文文檔，中文文檔

一安裝 Beautiful Soup

安裝 BeautifulSoup 很簡單，下載 BeautifulSoup 源碼。解壓運行

python setup.py install 即可。

測試安裝是否成功。鍵入 import BeautifulSoup 如果沒有異常，即成功安裝

二使用 BeautifulSoup

1. 導入BeautifulSoup ，創建BeautifulSoup 對象

from BeautifulSoup import BeautifulSoup           # HTML
from BeautifulSoup import BeautifulStoneSoup      # XML
import BeautifulSoup                              # ALL
                                                                                               
doc = [
    ‘<html><head><title>Page title</title></head>‘,
    ‘<body><p id="firstpara" align="center">This is paragraph <b>one</b>.‘,
    ‘<p id="secondpara" align="blah">This is paragraph <b>two</b>.‘,
    ‘</html>‘
]
# BeautifulSoup 接受一個字符串參數
soup = BeautifulSoup(‘‘.join(doc))

2. BeautifulSoup對象簡介

用BeautifulSoup 解析 html文檔時，BeautifulSoup將 html文檔類似 dom文檔樹一樣處理。BeautifulSoup文檔樹有三種基本對象。

2.1. soup BeautifulSoup.BeautifulSoup

type(soup)
<class ‘BeautifulSoup.BeautifulSoup‘>

2.2. 標記 BeautifulSoup.Tag

type(soup.html)
<class ‘BeautifulSoup.Tag‘>

2.3 文本 BeautifulSoup.NavigableString

type(soup.title.string)
<class ‘BeautifulSoup.NavigableString‘>

3. BeautifulSoup 剖析樹

3.1 BeautifulSoup.Tag對象方法

獲取標記對象（Tag）

標記名獲取法，直接用 soup對象加標記名，返回 tag對象.這種方式，選取唯一標簽的時候比較有用。或者根據樹的結構去選取，一層層的選擇

>>> html = soup.html
>>> html
<html><head><title>Page title</title></head><body><p id="firstpara" align="center">This is paragraph<b>one</b>.</p><p id="secondpara" align="blah">This is paragraph<b>two</b>.</p></body></html>
>>> type(html)
<class ‘BeautifulSoup.Tag‘>
>>> title = soup.title
<title>Page title</title>

content方法

content方法根據文檔樹進行搜索，返回標記對象（tag）的列表

>>> soup.contents
[<html><head><title>Page title</title></head><body><p id="firstpara" align="center">This is paragraph<b>one</b>.</p><p id="secondpara" align="blah">This is paragraph<b>two</b>.</p></body></html>]

>>> soup.contents[0].contents
[<head><title>Page title</title></head>, <body><p id="firstpara" align="center">This is paragraph<b>one</b>.</p><p id="secondpara" align="blah">This is paragraph<b>two</b>.</p></body>]
>>> len(soup.contents[0].contents)
2
>>> type(soup.contents[0].contents[1])
<class ‘BeautifulSoup.Tag‘>

使用contents向後遍歷樹，使用parent向前遍歷樹

next 方法

獲取樹的子代元素，包括 Tag 對象和 NavigableString 對象。。。

>>> head.next
<title>Page title</title>
>>> head.next.next
u‘Page title‘

>>> p1 = soup.p
>>> p1
<p id="firstpara" align="center">This is paragraph<b>one</b>.</p>
>>> p1.next
u‘This is paragraph‘

nextSibling 下一個兄弟對象包括 Tag 對象和 NavigableString 對象

>>> head.nextSibling
<body><p id="firstpara" align="center">This is paragraph<b>one</b>.</p><p id="secondpara" align="blah">This is paragraph<b>two</b>.</p></body>
>>> p1.next.nextSibling
<b>one</b>

與 nextSibling 相似的是 previousSibling，即上一個兄弟節點。

replacewith方法

將對象替換為，接受字符串參數

>>> head = soup.head
>>> head
<head><title>Page title</title></head>
>>> head.parent
<html><head><title>Page title</title></head><body><p id="firstpara" align="center">This is paragraph<b>one</b>.</p><p id="secondpara" align="blah">This is paragraph<b>two</b>.</p></body></html>
>>> head.replaceWith(‘head was replace‘)
>>> head
<head><title>Page title</title></head>
>>> head.parent
>>> soup
<html>head was replace<body><p id="firstpara" align="center">This is paragraph<b>one</b>.</p><p id="secondpara" align="blah">This is paragraph<b>two</b>.</p></body></html>
>>>

搜索方法

搜索提供了兩個方法，一個是 find，一個是findAll。這裏的兩個方法(findAll和 find)僅對Tag對象以及，頂層剖析對象有效，但 NavigableString不可用。

`findAll(`name, attrs, recursive, text, limit, **kwargs)

接受一個參數，標記名

尋找文檔所有 P標記，返回一個列表

>>> soup.findAll(‘p‘)
[<p id="firstpara" align="center">This is paragraph<b>one</b>.</p>, <p id="secondpara" align="blah">This is paragraph<b>two</b>.</p>]
>>> type(soup.findAll(‘p‘))
<type ‘list‘>

尋找 id="secondpara"的 p 標記，返回一個結果集

>>> pid = type(soup.findAll(‘p‘,id=‘firstpara‘))
>>> pid
<class ‘BeautifulSoup.ResultSet‘>

傳一個屬性或多個屬性對

>>> p2 = soup.findAll(‘p‘,{‘align‘:‘blah‘})
>>> p2
[<p id="secondpara" align="blah">This is paragraph<b>two</b>.</p>]
>>> type(p2)
<class ‘BeautifulSoup.ResultSet‘>

利用正則表達式

>>> soup.findAll(id=re.compile("para$"))
[<p id="firstpara" align="center">This is paragraph<b>one</b>.</p>, <p id="secondpara" align="blah">This is paragraph<b>two</b>.</p>]

讀取和修改屬性

>>> p1 = soup.p
>>> p1
<p id="firstpara" align="center">This is paragraph<b>one</b>.</p>
>>> p1[‘id‘]
u‘firstpara‘
>>> p1[‘id‘] = ‘changeid‘
>>> p1
<p id="changeid" align="center">This is paragraph<b>one</b>.</p>
>>> p1[‘class‘] = ‘new class‘
>>> p1
<p id="changeid" align="center" class="new class">This is paragraph<b>one</b>.</p>
>>>

剖析樹基本方法就這些，還有其他一些，以及如何配合正則表達式。具體請看官方文檔

3.2 BeautifulSoup.NavigableString對象方法

NavigableString 對象方法比較簡單，獲取其內容

>>> soup.title
<title>Page title</title>
>>> title = soup.title.next
>>> title
u‘Page title‘
>>> type(title)
<class ‘BeautifulSoup.NavigableString‘>
>>> title.string
u‘Page title‘

至於如何遍歷樹，進而分析文檔，已經 XML 文檔的分析方法，可以參考官方文檔。

Beautiful Soup的使用

code 解析器創建正則表達式簡介 fin new ble ref Beautiful Soup簡單實用，功能也算比較全，之前下載都是自己使用xpath去獲取信息，以後簡單的解析可以用這個，方便省事。 Beautiful Soup 是用 Python 寫的一個 HTM

Beautiful Soup的使用

`findAll(`name, attrs, recursive, text, limit, **kwargs)

Beautiful Soup的使用

Beautiful Soup 解析html表格示例

2017.08.11 Python網絡爬蟲實戰之Beautiful Soup爬蟲

python下很帥氣的爬蟲包 - Beautiful Soup 示例

Python爬蟲系列（四）：Beautiful Soup解析HTML之把HTML轉成Python對象

Python爬蟲利器：Beautiful Soup

爬蟲-Beautiful Soup模塊

Beautiful Soup:4 kinds of objects

【Python3 爬蟲】Beautiful Soup庫的使用

爬蟲學習筆記（五） Beautiful Soup使用

Python Beautiful Soup 解析庫的使用

Beautiful Soup 的使用

Beautiful Soup是一個爬蟲的神級庫！今天教你完全摸透它！

使用Beautiful Soup

beautiful soup庫—總結

beautiful soup的用法

ubuntu下的python網頁解析庫的安裝——lxml, Beautiful Soup, pyquery, tesserocr

【Python爬蟲學習實踐】基於Beautiful Soup的網站解析及數據可視化

Beautiful Soup模塊

Windows環境下python爬蟲常用庫和工具的安裝（UrlLib、Re、Requests、Selenium、lxml、Beautiful Soup、PyQuery 、PyMySQL等等）

Beautiful Soup的使用

findAll(name, attrs, recursive, text, limit, **kwargs)

相關推薦

`findAll(`name, attrs, recursive, text, limit, **kwargs)