python爬蟲第六篇--BeautifulSoup庫

阿新 • • 發佈：2018-11-30

BeautifulSoup庫

簡介

靈活方便的網頁解析庫，處理高效，支援多種解析器
利用它不用編寫正則表示式即可方便地實現網頁資訊的爬取
MacOS安裝：pip3 install BeautifulSoup4(已經更新到4了)
soup = BeautifulSoup(html, ‘lxml’)
- soup.prettify()
- soup.title
- soup.title.name
- soup.title.string
- soup.head
- soup.head.title.string
- soup.p
- soup.p.string
- soup.p.attrs[‘name’]
- soup.p[‘name’]
- soup.p.children
- soup.p.descendants
- soup.a.parents
- soup.find系列

解析器	使用方法	優勢	劣勢
Python標準庫	BeautifulSoup(markup, “html.parser”)	Python的內建標準庫、執行速度適中、文件容錯能力強	Python 2.7.3 or 3.2.2)前的版本中文容錯能力差
lxml HTML 解析器	BeautifulSoup(markup, “lxml”)	速度快、文件容錯能力強	需要安裝C語言庫
lxml XML 解析器	BeautifulSoup(markup, “xml”)	速度快、唯一支援XML的解析器	需要安裝C語言庫
html5lib	BeautifulSoup(markup, “html5lib”)	最好的容錯性、以瀏覽器的方式解析文件、生成HTML5格式的文件	速度慢、不依賴外部擴充套件

基本使用

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.prettify())#使用補全、容錯等格式化為標準格式
print(soup.title.string)#列印title

標籤選擇器

選擇元素

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.title)
print(type(soup.title))
print(soup.head)
print(soup.p)#只返回第一個結果

獲取名稱

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.title.name)

獲取屬性

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.attrs['name'])
print(soup.p['name'])

獲取內容

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p clss="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.string)

巢狀選擇

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.head.title.string)

子節點和子孫節點

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""
# from bs4 import BeautifulSoup
# soup = BeautifulSoup(html, 'lxml')
# print(soup.p.contents)

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.children)
for i, child in enumerate(soup.p.children):
    print(i, child)

# from bs4 import BeautifulSoup
# soup = BeautifulSoup(html, 'lxml')
# print(soup.p.descendants)
# for i, child in enumerate(soup.p.descendants):
#     print(i, child)

父節點和祖先節點

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""
# from bs4 import BeautifulSoup
# soup = BeautifulSoup(html, 'lxml')
# print(soup.a.parent)

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(list(enumerate(soup.a.parents)))

兄弟結點

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(list(enumerate(soup.a.next_siblings)))
print(list(enumerate(soup.a.previous_siblings)))

標準選擇器

find_all( name , attrs , recursive , text , **kwargs )

可根據標籤名、屬性、內容查詢文件
find( name , attrs , recursive , text , **kwargs )

find返回單個元素，find_all返回所有元素
find_parents() find_parent()

find_parents()返回所有祖先節點，find_parent()返回直接父節點
find_next_siblings() find_next_sibling()

find_next_siblings()返回後面所有兄弟節點，find_next_sibling()返回後面第一個兄弟節點
find_previous_siblings() find_previous_sibling()

find_previous_siblings()返回前面所有兄弟節點，find_previous_sibling()返回前面第一個兄弟節點
find_all_next() find_next()

find_all_next()返回節點後所有符合條件的節點，find_next()返回第一個符合條件的節點
find_all_previous() find_previous()

find_all_previous()返回節點後所有符合條件的節點, find_previous()返回第一個符合條件的節點

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
#name
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all('ul'))
print(type(soup.find_all('ul')[0]))

#attrs
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.find_all('ul'):
    print(ul.find_all('li'))
    
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(id='list-1'))
print(soup.find_all(class_='element'))

#text
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(text='Foo'))

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find('ul'))
print(type(soup.find('ul')))
print(soup.find('page'))

CSS選擇器

通過select()直接傳入CSS選擇器即可完成選擇

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.select('.panel .panel-heading'))
print(soup.select('ul li'))
print(soup.select('#list-2 .element'))
print(type(soup.select('ul')[0]))

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
    print(ul.select('li'))

獲取屬性

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
    print(ul['id'])
    print(ul.attrs['id'])

獲取內容

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for li in soup.select('li'):
    print(li.get_text())

總結

推薦使用lxml解析庫，必要時使用html.parser
標籤選擇篩選功能弱但是速度快
建議使用find()、find_all() 查詢匹配單個結果或者多個結果
如果對CSS選擇器熟悉建議使用select()
記住常用的獲取屬性和文字值的方法

python爬蟲第六篇--BeautifulSoup庫

BeautifulSoup庫簡介靈活方便的網頁解析庫，處理高效，支援多種解析器利用它不用編寫正則表示式即可方便地實現網頁資訊的爬取 MacOS安裝：pip3 install BeautifulSoup4(已經更新到4了) soup = B

Python爬蟲第七篇--PyQuery庫

PyQuery庫簡介網頁解析庫比正則和BeautifulSoup語法更簡單需要熟悉jQuery語法安裝 MacOS：pip3 install pyquery 初始化字串初始化 html = ''' <div

Python爬蟲第四篇--Requests庫

Requests簡介 Requests是基於urllib的python庫，比urllib庫更方便採用Apache2 Licensed開源協議的HTTP庫 MacOS安裝：pip3 install requests 例項 import req

Python爬蟲第三篇--Urllib庫

簡介 Python內建的HTTP請求庫 urllib.request 請求模組 urllib.error 異常處理模組 urllib.parse url 解析模組 urllib.robotparser robots.txt解析模組 urlopen

自學python爬蟲（五）BeautifulSoup庫的介紹

一、概念 BeautifulSoup是靈活又方便的網頁解析庫，處理高效。支援多種直譯器。利用它可以不用編寫正則表示式即可方便地實現網頁資訊的提取。二、解析庫三、例項講解下面用到的程式碼 html = """<html> <head>

python爬蟲第五篇--正則表示式

Re模組正則表示式概念正則表示式是對字串串操作的⼀一種邏輯公式，就是⽤用事先定義好的⼀一些特定字元、及這些特定字元的組合，組成⼀一個“規則字串串”，這個“規則字串串”⽤用來表達對字串串的⼀一種過濾邏輯非python獨有，re模組實現 re.

“毛星雲OpenCV3程式設計入門之python實現”第六篇基本圖形繪製

4.3基本圖形繪製 python程式碼： # -*- coding: utf-8 -*- __author__ = 'sunzhilong' import cv2 import numpy as np image = np.zeros((600,600,3

Python 學習日記第六篇 -- 深淺拷貝

python深淺拷貝： 1、字符串和數字import copy #淺拷貝 copy.copy() #深拷貝 copy.deepcopy() #################################################################################

Python開發【第六篇】：Python基礎條件和循環

ora back strong als 重復執行操作 enume 條件表達式服務目錄一、if語句 1、功能 2、語法單分支，單重條件判斷多分支，多重條件判斷 if + else 多分支if + elif + else 語句小結 + 案例三元表達式二、whil

python全棧開發【第六篇】Python字符編碼

原理如果多語言 bubuko shift 滿足避免穩定性都是 1.內存和硬盤都是用來存儲的。 CPU：速度快硬盤：永久保存 2.文本編輯器存取文件的原理（nodepad++,pycharm,word）　　　打開編輯器就可以啟動一個進程，是在內存中的，所

python學習第六十八天：數據庫相關操作

order 運行反向 all djang 建表 ima nag publish ORM簡介 orm能幹的事： 1 創建表，修改表，刪除表 2 插入數據 3 修改數據 4 刪除數據不能幹：不能創建數據庫類名-----》表對象------》一條數據屬性

Python從零開始寫爬蟲（二）BeautifulSoup庫使用

Beautiful Soup 是一個可以從HTML或XML檔案中提取資料的Python庫， BeautifulSoup在解析的時候是依賴於解析器的，它除了支援Python標準庫中的HTML解析器，還支援一些第三方的解析器比如lxml等。可以從其官網得到更詳細的資訊：http://beau

初探：Python中使用request和BeautifulSoup庫進行網路爬蟲

說起網路爬蟲，Python中最底層的應該是urllib，但是語法結構有些繁瑣，需要使用正則。而使用request和BeautifulSoup庫進行網路爬蟲，發現這真的是web開發人員的福音。凡是懂一些前端知識的人來說，使用request和BeautifulSoup庫進行爬蟲，真的有一種開心而愉快

Python金融系列第六篇：現代投資組合理論

作者：chen_h 微訊號 & QQ：862251340 微信公眾號：coderpai 第一篇：計算股票回報率，均值和方差第二篇：簡單線性迴歸第三篇：隨機變數和分佈第四篇：置信區間和假設檢驗第五篇：多元線性迴歸和殘差分析第六篇：現代投資組合

06-撩課-Python面試寶典-第六篇

一. 函式的概念是什麼? 在Python中如何定義?使用場景是什麼? 概念寫了一段程式碼實現了某個小功能; 然後把這些程式碼集中到一塊, 起一個名字; 下一次就可以根據這個名字再次使用這個程式碼塊, 這就是函式定義 def 函式名稱(引

Python開發【第六篇】：模組

模組，用一砣程式碼實現了某個功能的程式碼集合。類似於函數語言程式設計和麵向過程程式設計，函數語言程式設計則完成一個功能，其他程式碼用來呼叫即可，提供了程式碼的重用性和程式碼間的耦合。而對於一個複雜的功能來，可能需要多個函式才能完成（函式又可以在不同的.py檔案中），n個 .py 檔案組成的程式

python基礎知識第六篇(知識點總結)

#######################　整理 ################# # 一、數字# int(..) # 二、字串# replace(替換)/find/join/strip(移除)/startswith/split(分割)/upper(轉換為大寫的)/lower(轉換為小寫)/form

Python 語言學習第六篇：迭代和解析

Python 語言學習第六篇：迭代和解析 Python中的迭代是指按照元素的順序逐個呼叫的過程，迭代概念包括：迭代協議、可迭代物件和迭代器三個概念。迭代協議是指有__next__()函式的物件會前進到下一個結果，而到達系列的末尾時，則會引發StopIteration異常。為了支援迭代協議

Python爬蟲十六式 - 第三式：Requests的用法

Requests: 讓 HTTP 服務人類學習一時爽，一直學習一直爽 Hello，大家好，我是Connor，一個從無到有的技術小白。今天我們繼續來說我們的 Python 爬蟲，上一次我們說到了 urllib 與 urllib3 ，不知道大家看了以後有何感想，今天我們來繼續聊聊

python 第六篇基礎練習題乘客打的費用計算

while True: #快速多行縮排，可以選中需要縮排的行，按Tab鍵 #1.輸入一個公里數 km =input('請輸入行駛公里數(km),輸入0結束:') # 如果km的值為0，結束迴圈 if km =='0': print('歡迎下次使用！Bye bye!')

python爬蟲第六篇--BeautifulSoup庫

BeautifulSoup庫

簡介

基本使用

標籤選擇器

選擇元素

獲取名稱

獲取屬性

獲取內容

巢狀選擇

子節點和子孫節點

父節點和祖先節點

兄弟結點

標準選擇器

CSS選擇器

獲取屬性

獲取內容

總結

相關推薦