爬蟲 BeatifulSoup 模塊

阿新 • • 發佈：2019-02-10

支持能夠 enumerate 其他 ror 順序 sss find() 用法

BeatifulSoup 模塊

介紹

Beautiful Soup 是一個可以從HTML或XML文件中提取數據的Python庫

安裝

pip install beautifulsoup4

解析器下載

Beautiful Soup支持Python標準庫中的HTML解析器,還支持一些第三方的解析器,其中一個是 lxml .根據操作系統不同,可以選擇下列方法來安裝lxml:

$ apt-get install Python-lxml

$ easy_install lxml

$ pip install lxml

另一個可供選擇的解析器是純Python實現的 html5lib , html5lib的解析方式與瀏覽器相同,可以選擇下列方法來安裝html5lib:

$ apt 
-get install Python-html5lib

$ easy_install html5lib

$ pip install html5lib

解析器

下表列出了主要的解析器,以及它們的優缺點,

官網推薦使用lxml作為解析器,因為效率更高.

在Python2.7.3之前的版本和Python3中3.2.2之前的版本,必須安裝lxml或html5lib, 因為那些Python版本的標準庫中內置的HTML解析方法不夠穩定.

技術分享圖片

基本使用

html_doc = """
<html><head><title>The Dormouse‘s story</title></head>
<body>
<p class="title"><b>The Dormouse‘s story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
 
"""

#基本使用：容錯處理,文檔的容錯能力指的是在html代碼不完整的情況下,使用該模塊可以識別該錯誤。使用BeautifulSoup解析上述代碼,能夠得到一個 BeautifulSoup 的對象,並能按照標準的縮進格式的結構輸出
from bs4 import BeautifulSoup
soup=BeautifulSoup(html_doc,‘lxml‘) #具有容錯功能
res=soup.prettify() #處理好縮進，結構化顯示
print(res)

遍歷文檔樹

即直接通過標簽名字選擇，特點是選擇速度快，

但如果存在多個相同的標簽則只返回第一個

#遍歷文檔樹：即直接通過標簽名字選擇，特點是選擇速度快，但如果存在多個相同的標簽則只返回第一個 

#1、用法
#2、獲取標簽的名稱
#3、獲取標簽的屬性
#4、獲取標簽的內容
#5、嵌套選擇
#6、子節點、子孫節點
#7、父節點、祖先節點
#8、兄弟節點

#遍歷文檔樹：即直接通過標簽名字選擇，特點是選擇速度快，但如果存在多個相同的標簽則只返回第一個
html_doc = """
<html><head><title>The Dormouse‘s story</title></head>
<body>
<p id="my p" class="title"><b id="bbb" class="boldest">The Dormouse‘s story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

#1、用法
from bs4 import BeautifulSoup
soup=BeautifulSoup(html_doc,‘lxml‘)
# soup=BeautifulSoup(open(‘a.html‘),‘lxml‘)

print(soup.p) #存在多個相同的標簽則只返回第一個
print(soup.a) #存在多個相同的標簽則只返回第一個

#2、獲取標簽的名稱
print(soup.p.name)

#3、獲取標簽的屬性
print(soup.p.attrs)

#4、獲取標簽的內容
print(soup.p.string) # p下的文本只有一個時，取到，否則為None
print(soup.p.strings) #拿到一個生成器對象, 取到p下所有的文本內容
print(soup.p.text) #取到p下所有的文本內容
for line in soup.stripped_strings: #去掉空白
    print(line)


‘‘‘
如果tag包含了多個子節點,tag就無法確定 .string 方法應該調用哪個子節點的內容, .string 的輸出結果是 None，如果只有一個子節點那麽就輸出該子節點的文本，比如下面的這種結構，soup.p.string 返回為None,但soup.p.strings就可以找到所有文本
<p id=‘list-1‘>
    哈哈哈哈
    <a class=‘sss‘>
        <span>
            <h1>aaaa</h1>
        </span>
    </a>
    <b>bbbbb</b>
</p>
‘‘‘

#5、嵌套選擇
print(soup.head.title.string)
print(soup.body.a.string)


#6、子節點、子孫節點
print(soup.p.contents) #p下所有子節點
print(soup.p.children) #得到一個叠代器,包含p下所有子節點

for i,child in enumerate(soup.p.children):
    print(i,child)

print(soup.p.descendants) #獲取子孫節點,p下所有的標簽都會選擇出來
for i,child in enumerate(soup.p.descendants):
    print(i,child)

#7、父節點、祖先節點
print(soup.a.parent) #獲取a標簽的父節點
print(soup.a.parents) #找到a標簽所有的祖先節點，父親的父親，父親的父親的父親...


#8、兄弟節點
print(‘=====>‘)
print(soup.a.next_sibling) #下一個兄弟
print(soup.a.previous_sibling) #上一個兄弟

print(list(soup.a.next_siblings)) #下面的兄弟們=>生成器對象
print(soup.a.previous_siblings) #上面的兄弟們=>生成器對象

搜索文檔數

五種過濾器

from bs4 import BeautifulSoup
html_doc = ‘‘‘<html><head><title>The Dormouse‘s story</title></head>
<body>
<p class="title"><b>The Dormouse‘s story</b></p>
<p class="title"><b>$75</b></p>
<p id="meiyuan">啦啦啦啦啦啦</p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>‘‘‘
soup= BeautifulSoup(html_doc,"lxml")
# 1、字符串：特點：是一種完全匹配的
print(soup.find_all(name="a"))  #找到所有的a標簽
print(soup.find_all(name="a aa"))  #找不到，會打印一個[]
print(soup.find_all(attrs={"class":"sister"}))
print(soup.find_all(text="The Dormouse‘s story"))  #按照文本來找
print(soup.find_all(name="b",text="The Dormouse‘s story"))  #找標簽名是b，並且文本是The Dormouse‘s story
print(soup.p.find(name="b").text)  #第一個p標簽的b裏面的文本
print(soup.find_all(name="p",attrs={"class":"story"}))  #找到標簽名是p,屬性名是class,
print(soup.find(name="p",attrs={"class":"story"}).find_all(name="a")[2])  #找到標簽名是p,屬性名是class的第二個a標簽
# 2、正則
import re
print(soup.find_all(name=re.compile("^b")))  #找b開頭的的標簽
print(soup.find_all(attrs={"id":re.compile("link")}))  #找到id屬性是link的
print(soup.find_all(text=re.compile(r"\$")))  #找帶有$價錢的文本
#
# # 3、列表：如果傳入列表參數,Beautiful Soup會將與列表中任一元素匹配的內容返回.
print(soup.find_all(name=["a",re.compile("^b")]))  #找a標簽或者b標簽開頭的所有的標簽
print(soup.find_all(text=["$",]))  #找不到
print(soup.find_all(text=[re.compile(r"\$")]))  #[‘$75‘]
print(soup.find_all(text=["a",re.compile(r"\$")]))

# # 4、True：可以匹配任何值
print(soup.find_all(name=True))  #找到所有標簽的標簽名
print(soup.find_all(attrs={"id":True}))#找到只要有id屬性的
#
print(soup.find_all(name="p",attrs={"id":True}))# 找到有id屬性的p標簽
# 5、方法：如果沒有合適過濾器,那麽還可以定義一個方法,方法只接受一個元素參數 ,如果這個方法返回 True 表示當前元素匹配並且被找到,如果不是則反回 False
#
# # 有class屬性沒有id屬性的
def has_class_not_id(tag):
    return tag.has_attr(‘class‘) and not tag.has_attr(‘id‘)
    # return tag.has_attr(‘id‘) and not tag.has_attr(‘class‘)

    # return tag.name =="a" and tag.has_attr("class") and not tag.has_attr("id")
# #     #只找a標簽
print(soup.find_all(has_class_not_id))  #默認是按照標簽來找的


print(soup.find_all(name="a",limit=2))#找所有的a標簽，只找前兩個
print(soup.body.find_all(attrs={"class":"sister"},recursive=False))#找屬性為sister的
print(soup.html.find_all(‘a‘))
print(soup.html.find_all(‘a‘,recursive=False))
# recursive = True  #從子子孫孫都找到了
# recursive = False #如果只想搜索tag的直接子節點（就不往裏面找了）,可以使用參數 recursive=False .

# **kwargs
print(soup.find_all(attrs={"class":"sister"}))
print(soup.find_all(class_="sister"))  #這兩個是一樣的

print(soup.find_all(attrs={"id":"link3"})) #這兩個是一樣的，只是表示方式不一樣
print(soup.find_all(id="link3"))

find_all

find_all( name , attrs , recursive , text , **kwargs )

#2、find_all( name , attrs , recursive , text , **kwargs )
#2.1、name: 搜索name參數的值可以使任一類型的 過濾器 ,字符竄,正則表達式,列表,方法或是 True .
print(soup.find_all(name=re.compile(‘^t‘)))

#2.2、keyword: key=value的形式，value可以是過濾器：字符串 , 正則表達式 , 列表, True .
print(soup.find_all(id=re.compile(‘my‘)))
print(soup.find_all(href=re.compile(‘lacie‘),id=re.compile(‘\d‘))) #註意類要用class_
print(soup.find_all(id=True)) #查找有id屬性的標簽

# 有些tag屬性在搜索不能使用,比如HTML5中的 data-* 屬性:
data_soup = BeautifulSoup(‘<div data-foo="value">foo!</div>‘,‘lxml‘)
# data_soup.find_all(data-foo="value") #報錯：SyntaxError: keyword can‘t be an expression
# 但是可以通過 find_all() 方法的 attrs 參數定義一個字典參數來搜索包含特殊屬性的tag:
print(data_soup.find_all(attrs={"data-foo": "value"}))
# [<div data-foo="value">foo!</div>]

#2.3、按照類名查找，註意關鍵字是class_，class_=value,value可以是五種選擇器之一
print(soup.find_all(‘a‘,class_=‘sister‘)) #查找類為sister的a標簽
print(soup.find_all(‘a‘,class_=‘sister ssss‘)) #查找類為sister和sss的a標簽，順序錯誤也匹配不成功
print(soup.find_all(class_=re.compile(‘^sis‘))) #查找類為sister的所有標簽

#2.4、attrs
print(soup.find_all(‘p‘,attrs={‘class‘:‘story‘}))

#2.5、text: 值可以是：字符，列表，True，正則
print(soup.find_all(text=‘Elsie‘))
print(soup.find_all(‘a‘,text=‘Elsie‘))

#2.6、limit參數:如果文檔樹很大那麽搜索會很慢.如果我們不需要全部結果,可以使用 limit 參數限制返回結果的數量.效果與SQL中的limit關鍵字類似,當搜索到的結果數量達到 limit 的限制時,就停止搜索返回結果
print(soup.find_all(‘a‘,limit=2))

#2.7、recursive:調用tag的 find_all() 方法時,Beautiful Soup會檢索當前tag的所有子孫節點,如果只想搜索tag的直接子節點,可以使用參數 recursive=False .
print(soup.html.find_all(‘a‘))
print(soup.html.find_all(‘a‘,recursive=False))

‘‘‘
像調用 find_all() 一樣調用tag
find_all() 幾乎是Beautiful Soup中最常用的搜索方法,所以我們定義了它的簡寫方法. BeautifulSoup 對象和 tag 對象可以被當作一個方法來使用,這個方法的執行結果與調用這個對象的 find_all() 方法相同,下面兩行代碼是等價的:
soup.find_all("a")
soup("a")
這兩行代碼也是等價的:
soup.title.find_all(text=True)
soup.title(text=True)
‘‘‘

find

find( name , attrs , recursive , text , **kwargs )

#3、find( name , attrs , recursive , text , **kwargs )
find_all() 方法將返回文檔中符合條件的所有tag,盡管有時候我們只想得到一個結果.比如文檔中只有一個<body>標簽,那麽使用 find_all() 方法來查找<body>標簽就不太合適, 使用 find_all 方法並設置 limit=1 參數不如直接使用 find() 方法.下面兩行代碼是等價的:

soup.find_all(‘title‘, limit=1)
# [<title>The Dormouse‘s story</title>]
soup.find(‘title‘)
# <title>The Dormouse‘s story</title>

唯一的區別是 find_all() 方法的返回結果是值包含一個元素的列表,而 find() 方法直接返回結果.
find_all() 方法沒有找到目標是返回空列表, find() 方法找不到目標時,返回 None .
print(soup.find("nosuchtag"))
# None

soup.head.title 是 tag的名字 方法的簡寫.這個簡寫的原理就是多次調用當前tag的 find() 方法:

soup.head.title
# <title>The Dormouse‘s story</title>
soup.find("head").find("title")
# <title>The Dormouse‘s story</title>

CSS選擇器

#該模塊提供了select方法來支持css,詳見官網:https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id37
html_doc = """
<html><head><title>The Dormouse‘s story</title></head>
<body>
<p class="title">
    <b>The Dormouse‘s story</b>
    Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">
        <span>Elsie</span>
    </a>
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    <div class=‘panel-1‘>
        <ul class=‘list‘ id=‘list-1‘>
            <li class=‘element‘>Foo</li>
            <li class=‘element‘>Bar</li>
            <li class=‘element‘>Jay</li>
        </ul>
        <ul class=‘list list-small‘ id=‘list-2‘>
            <li class=‘element‘><h1 class=‘yyyy‘>Foo</h1></li>
            <li class=‘element xxx‘>Bar</li>
            <li class=‘element‘>Jay</li>
        </ul>
    </div>
    and they lived at the bottom of a well.
</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup=BeautifulSoup(html_doc,‘lxml‘)

#1、CSS選擇器
print(soup.p.select(‘.sister‘))
print(soup.select(‘.sister span‘))

print(soup.select(‘#link1‘))
print(soup.select(‘#link1 span‘))

print(soup.select(‘#list-2 .element.xxx‘))

print(soup.select(‘#list-2‘)[0].select(‘.element‘)) #可以一直select,但其實沒必要,一條select就可以了

# 2、獲取屬性
print(soup.select(‘#list-2 h1‘)[0].attrs)

# 3、獲取內容
print(soup.select(‘#list-2 h1‘)[0].get_text())

其他方法

官網查看

修改文檔樹

官網查看

總結

1、推薦使用lxml解析庫

2、三種選擇器: 標簽選擇器, find與find_all，css選擇器

　　1、標簽選擇器篩選功能弱,但是速度快

　　2、建議使用find,find_all查詢匹配單個結果或者多個結果

　　3、如果對css選擇器非常熟悉建議使用select

3、記住常用的獲取屬性 attrs 和文本值 get_text() 的方法

爬蟲 BeatifulSoup 模塊

支持能夠 enumerate 其他 ror 順序 sss find() 用法 BeatifulSoup 模塊介紹 Beautiful Soup 是一個可以從HTML或XML文件中提取數據的Python庫安裝 pip install beautifuls

爬蟲基礎模塊

封裝 httpproxy dumps readme clas blog 互聯網 gbk img Python標準庫中提供了：urllib、urllib2、httplib等模塊以供Http請求，但是，它的 API 太渣了。它是為另一個時代、另一個互聯網所創建的。它需要巨量的工

網絡爬蟲re模塊的findall()函數

正則表達正則表達式搜索換行 spa img print re模塊 style findall()函數匹配所有符合規律的內容，並以列表的形式返回結果。 a = ‘"<div>指數‘ ‘</div>"‘ wo

高性能爬蟲——asynicio模塊

get sin com tpc __name__ .sh nis tools 出現一背景知識爬蟲的本質就是一個socket客戶端與服務端的通信過程，如果我們有多個url待爬取，只用一個線程且采用串行的方式執行，那只能等待爬取一個結束後才能繼續下一

爬蟲-Beautiful模塊

技術分享 name 取數據 img 方法的參數 bbbb 當前 sta ali 閱讀目錄一介紹二基本使用三遍歷文檔樹四搜索文檔樹五修改文檔樹六總結一介紹 Beautiful Soup 是一個可以從HTML或XML文件中提取數據的Pyt

python_爬蟲_模塊

res fan 文件 oot per HERE 調用 mysq use import pymysql from urllib import request,parse from urllib.error import HTTPError,URLError def mai

python3爬蟲lxml模塊的安裝

成功 xml模塊 back print inf 命令 cmd命令行輸入 src 1：在下載lxml之前，要先查看python的版本信息，　　在CMD命令行輸入python 　　再輸入import pip; print(pip.pep425tags.get_suppo

爬蟲-----selenium模塊自動爬取網頁資源

pri 輸入文字豆瓣移動相關 append 字符 scrollto value selenium介紹與使用 1 selenium介紹　　什麽是selenium？selenium是Python的一個第三方庫，對外提供的接口可以操作瀏覽器，然後讓瀏覽器完成自動化的操

美圖錄爬蟲(requests模塊,re模塊)

.get urn {} ike status req set mobile use Python 爬蟲最近學正則表達式，剛好知道這個網站美圖錄，就做了個爬蟲拿來練練手，說一說遇到的問題一 404問題問題: 由於圖片顯示頁面是分頁的，每一頁展示5張圖片，為了方便沒有每次

Python開發基礎-Day15正則表達式爬蟲應用，configparser模塊和subprocess模塊

表達 port 進行 false popen ext signal -- 默認正則表達式爬蟲應用（校花網） 1 import requests 2 import re 3 import json 4 #定義函數返回網頁的字符串信息 5 def getPage_

Python基礎----正則表達式爬蟲應用，configparser模塊和subprocess模塊

stdin alt 輸入 -h 但是狀態 swd 有效 tle 正則表達式爬蟲應用（校花網） 1 import requests 2 import re 3 import json 4 #定義函數返回網頁的字符串信息 5 def getPage_str(u

爬蟲學習——網頁下載器和urllib2模塊

處理器請求 aid none pro 調用 tps jar header 什麽是網頁下載器？一、網頁下載器是爬蟲的核心組件二、常用的python網頁下載器有urlilib2基礎模塊和requests第三方插件兩種 urllib2支持功能：1.支持直接url下載；

爬蟲——正則表達式re模塊

編碼範圍爬蟲步驟利用 world 返回操作 tor pat 為什麽要學習正則表達式實際上爬蟲一共就四個主要步驟：明確目標：需清楚目標網站爬：將所有的目標網站的內容全部爬下來取：在爬下來的網站內容中去掉對我們沒有用處的數據，只留取我們需要的數據處理數據：按

第三百二十四節，web爬蟲，scrapy模塊介紹與使用

通訊通用系列 python安裝 ide 調度器功能自動優先第三百二十四節，web爬蟲，scrapy模塊介紹與使用 Scrapy是一個為了爬取網站數據，提取結構性數據而編寫的應用框架。其可以應用在數據挖掘，信息處理或存儲歷史數據等一系列的程序中。其最初是為了頁面

python爬蟲之requests模塊

.post 過大 form表單提交 www xxxxxx psd method date .com 一. 登錄事例 a. 查找汽車之家新聞標題鏈接圖片寫入本地 import requests from bs4 import BeautifulSoup import

團隊-爬蟲豆瓣top250項目-模塊測試過程

完整過程 tps 單元 clas 說明豆瓣模塊 com 項目托管平臺地址:https://gitee.com/ningshuyoumeng/TuanDui-PaChongDouBantop250XiangMu-MoKuaiKaiFa 模塊測試: 獲取內面內容功能,測試

Forward團隊-爬蟲豆瓣top250項目-模塊開發過程

托管 git {} clas 模塊開發 4.0 else html 粘貼項目托管平臺地址:https://github.com/xyhcq/top250 開發模塊功能: 原網頁代碼分析功能,開發時間:1小時,得到了程序所需要的網頁源代碼 def getHTMLText

Forward團隊-爬蟲豆瓣top250項目-模塊開發測試

開發測試 for image ima 模塊開發 clas http alt 技術項目托管平臺地址:https://github.com/xyhcq/top250 開發模塊功能: 整合代碼：控制是否將信息顯示出來測試方法:運行代碼 Forward團隊-爬蟲豆瓣t

團隊-爬蟲豆瓣top250項目-模塊開發過程

style 托管 bsp 豆瓣 index urn 構造其他 range 項目托管平臺地址:https://github.com/gengwenhao/GetTop250.git 開發模塊功能: "get_urls()生成前250電影的頁面地址"功能,開發時間:1天(小時

scrapy爬蟲框架setting模塊解析

ocs 不用依賴 cookies received over ade maximum ole 平時寫爬蟲的時候並不需要設置setting裏所有的參數，今天心血來潮，花了點時間查了一下setting模塊創建後自動寫入的所有參數的含義，記錄一下。模塊相關說明信息 # -

爬蟲 BeatifulSoup 模塊

BeatifulSoup 模塊

介紹

安裝

解析器

基本使用

遍歷文檔樹

搜索文檔數

五種過濾器

find_all

find

CSS選擇器

其他方法

修改文檔樹

總結

相關推薦