爬蟲：python之BeautifulSoup(lxml)

阿新 • • 發佈：2019-02-13

一、簡介

一個靈活又方便的HTML解析庫，處理高效，支援多種解析器，利用它不使用正則表示式也能抓取網頁內容。

解析器	使用方法	優勢	劣勢
python標準庫	BeautifulSoup(markup,"html.parser")	python內建標準庫執行速度適中文件糾錯能力強	python2.7.3以前的版本容錯能力差
lxml HTML解析器	BeautifulSoup(markup,"lxml")	速度快文件糾錯能力強	需要安裝C語言庫
lxml xml解析器	BeautifulSoup(markup,["lxml","xml"])	速度快唯一的支援解析的xml的解析器	需要安裝C語言庫
html5lib	BeautifulSoup(markup,"html5lib")	最好的容錯性以瀏覽器的方式解析文件生成html5格式的文件	速度慢不宜懶外部庫

二、lxml解析器的基本使用

#獲取直接子節點：contents、children
#獲取父節點：parent
#獲取兄弟節點：next_siblings、next_sibling、previous_siblings、previous_sibling
html = '''
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>                                                                                                                                                                                               
'''
from bs4 import BeautifulSoup
soup=BeautifulSoup(html,'html.parser')
# contents:獲取直接子節點，返回list型別
print(soup.p.contents)
# children,返回的是可以迭代的，直接列印輸出None
for i in soup.p.children:
    print(i)
print(soup.p.childrensoup)
#獲取 父節點
print(soup.a.parent)
# 獲取兄弟節點
for i in soup.a.next_siblings:#獲取a標籤後面的所有兄弟節點
    print(i)

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
import lxml
from bs4 import BeautifulSoup
#建立bs物件 bs是使用的python預設的解析器，lxml也是解析器
soup = BeautifulSoup(html,'lxml')
#prettify實現格式化的輸出
print(soup.prettify())
#通過soup標籤名，獲取這個標籤的內容。注意：通過這種方式獲取標籤，如果文件中有多個這樣的標籤，返回的結果是第一個標籤內容
print(soup.a)
print(soup.p)
#獲取名稱name
print(soup.title.name)
print(soup.p.name)
#獲取屬性
print(soup.a['href'])
#獲取文字內容-string、text
print(soup.a.string)
print(soup.a.text)
print(soup.title.string)
print(soup.title.text)
#巢狀選擇,直接通過巢狀的方式獲取
print(soup.p.b.string)
print(soup.head.title.text)

三、lxml解析器標準選擇器、find_all的使用

搜尋文件樹：

(1)find_all():可以根據標籤名、屬性、內容查詢文件

(2)find():返回匹配結果的第一個元素

(3)find_parents() find_parent()

(4)find_next_siblings() find_next_sibling()

(5)find_previous_siblings() find_previous_sibling()

(6)find_all_next() find_next()

(7)find_all_previous() 和 find_previous()

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all('ul'))
print(type(soup.find_all('ul')[0]))

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="panel-body" id="list-2">
            <li class="element">年後中好說歹說開發，什麼才能傷風膠囊</li>
            <li class="element">Bar</li>
        </ul>
         <a href="link1.html">first item</a>
         <a href="link2.html">second item</a>
         <li class="item-inactive"><a href="link3.html"><span class="bold">third item</span></a></li>
     </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup=BeautifulSoup(html,"lxml")
import re 
# 1、name引數
# 查詢所有名字為name的tag,搜尋 name 引數的值可以使任一型別的 過濾器 ,字元竄,正則表示式,列表,方法或是 True 
print(soup.find_all('li'))
# 使用列表
print(soup.find_all(['li','a']))
print(soup.find_all(True))
print(soup.find_all(re.compile('h4')))
# 2、keyword關鍵字引數
# 關鍵字是指tag的屬性：id、title、href等,注意：使用class時要加上'_'
print(soup.find_all('a',href="link1.html"))
print(soup.find_all(id="list-1"))
print(soup.find_all('ul',class_="list"))
# 使用正則
print(soup.find_all(href=re.compile('3.html')))
# 3、text
# 一般與name一起使用，通過 text 引數可以搜搜文件中的字串內容.與 name 引數的可選值一樣, text 引數接受 字串 , 正則表示式 , 列表, True 
print(soup.find_all(text=re.compile('好')))
print(soup.find_all('li',text=re.compile('好')))
# 使用多個屬性
print(soup.find_all(class_="panel-body",id="list-2"))
# 3、string
# 一般與name一起使用，通過 string 引數可以搜搜文件中的字串內容.與 name 引數的可選值一樣, string 引數接受 字串 , 正則表示式 , 列表, True;
print(soup.find_all("a", string="first item"))
print(soup.find_all(string="first item"))
# 4、limit引數
# 這個引數其實就是控制我們獲取資料的數量，效果和SQL語句中的limit一樣；
print(soup.find_all("a",limit=2))
# 5、recursive引數
# 呼叫tag的 find_all() 方法時,Beautiful Soup會檢索當前tag的所有子孫節點,如果只想搜尋tag的直接子節點,可以使用引數 recursive=False;

四、CSS選擇器

# select()直接傳入CSS選擇器完成選擇
# .表示class ，#表示id
# 標籤1，標籤2
# 標籤1 標籤2
# [attr]可以通過這種方式找到具有某個屬性值的所有標籤
# [attr=value]例子：[target=blank]
html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup=BeautifulSoup(html,"lxml")
# 找到class屬性是panel的標籤內的class屬性是panel-heading的標籤內容
print(soup.select(".panel .panel-heading"))
# 找id屬性為list-1和id屬性為list-2的所有標籤
print(soup.select('#list-1,#list-2'))
# 找到ul標籤下的li標籤
print(soup.select('ul li'))
# 找到id屬性值為list-2內部class屬性是element的所有標籤
print(soup.select('#list-2 .element'))
# get_text()：拿到標籤文字值
# 所有li標籤下的文字值
for i in soup.select('li'):
    print(i.get_text())
# 獲取屬性值
for i in soup.select('ul'):
    print(i["id"])

總結：推薦使用lxml解析庫，必要時使用html.parser，標籤選擇篩選功能弱但是速度快，
          建議使用find()、find_all() 查詢匹配單個結果或者多個結果
          如果對CSS選擇器熟悉建議使用select()
          記住常用的獲取屬性和文字值的方法

爬蟲：python之BeautifulSoup(lxml)

一、簡介一個靈活又方便的HTML解析庫，處理高效，支援多種解析器，利用它不使用正則表示式也能抓取網頁內容。解析器使用方法優勢劣勢python標準庫BeautifulSoup(markup,"html.parser")python內建標準庫執行速度適中文件糾錯能力強python

Python爬蟲：如何建立BeautifulSoup物件

from urllib.request import urlopen from bs4 import BeautifulSoup html = '<div>text1</div>

python 爬蟲提取文字之BeautifulSoup詳細用法

提取網頁內容四大基本方法之 2.beautifulsoup的使用 bs4模組準備程式碼資訊，用來練習獲取內容: from bs4 import BeautifulSoup #準備程式碼資訊，用來

信息安全－2：python之hill密碼算法[原創]

blog 計算教材文字成功 view 思路 html 測試轉發註明出處:http://www.cnblogs.com/0zcl/p/6106513.html 前言： hill密碼算法我打算簡要介紹就好，加密矩陣我用教材上的3*3矩陣，只做了加密，解密沒有做，不過

python：python之禪

zha 說明 should 也不能 tac 開篇 special cnblogs names 最近在學python，今晚看了一個名叫“python全棧之路系列”博客的關於python的相關博客，其中開篇就說到了python的設計哲學：優雅，簡潔，高效。。。可以在編譯器裏面

第一章： Python 之第一個程序

python在linux環境上安裝python3.6.2程序.到官網下載源碼包安裝：安裝python3.6.2的系統依賴包：yun -y install zlib*解壓安裝包：tar -xvf Python-3.6.2.tar 進入安裝目錄：cd Python-3.6.2添加配置，指定安裝位置：./confi

第五章：Python 之 RabbitMQ消息持久化

rabbitmq#測試RabbitMQ消息永久化 #1. 分隊列永久化和信息永久化 #2. 意思為當服務重啟後，隊列和消息還存在，可供客戶端接受 #3. 在服務器查看消息隊列命令./rabbitmqctl list_queues #send 端 import pika credentials = pik

第五章：Python 之 RabbitMQ 基本示例

rabbitmq#send 端import pikacredentials = pika.PlainCredentials(‘root‘, ‘Password1‘)connection = pika.BlockingConnection(pika.ConnectionParameters(‘10.3.151.

006：Python之常用操作符

限定有一個打印 .py 找不到 python 短路邏輯數學 span 筆記 1.python中的3種除法：傳統除法、精確除法、地板除。傳統除法：如果是整數除法則執行地板除，如果是浮點數除法則執行精確除法。 >>>1/2 0 >>

Python學習【第2篇】：Python之數據類型

msg ear sleep abc 命令 play bbbb 朋友 == 數字類型和字符串類型 1.bin()函數將十進制轉換成而進制 2.oct()函數將十進制轉換成八進制 3.hex()函數將十進制轉換成十六進制　　　　十六進制表示：0-9 a b c

Python學習【第2篇】：Python之數據類型（2）

append 但是 iss 代碼 key 常用方法 uber ner ces 元組 #為何要有元組，存放多個值，元組不可變，更多的是用來做查詢 t=(1,[1,3],‘sss‘,(1,2)) #t=tuple((1,[1,3],‘sss‘,(1,2))) #

學習筆記-小甲魚Python3學習第六講：python之常用操作符

mar 邏輯 .... 運算操作 == 整數 image 臺階 size 常用操作符運算操作符：加+ 減- 乘* 除/ 余% 冪運算** 地板除//比較操作符： < ，> ,<=,>=,==,!=邏輯操作符： and,or,not優先級：冪運算符有點

[小甲魚]零基礎入門python第006講課後測試題及答案：python之常用操作符

測試題： 0. Python 的 floor 除法現在使用 “//” 實現，那 3.0 // 2.0 您目測會顯示什麼內容呢？地板除法，3//2=1, 3.0//2.0=1.0，整數的地板除//永遠是整數，即使除不盡 1. a < b < c

第006講：python之常用操作符| 學習記錄（小甲魚零基礎入門學習Python）

（標答出處: 魚C論壇）《零基礎入門學習Python》測試題： Python 的 floor 除法現在使用 “//” 實現，那 3.0 // 2.0 您目測會顯示什麼內容呢？ 1.0 a < b < c 事實上是等於？ (b>a)

求助微博爬蟲，python中BeautifulSoup之後的select（）返回空字串

from bs4 import BeautifulSoup import requests import re headers = { 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit

Python之路【第14章】：Python之快取 RabbitMQ、Redis、Memcache、SQLAlchemy

Python之快取 RabbitMQ、Redis、Memcache、SQLAlchemy 一、Memcached Memcached 是一個高效能的分散式記憶體物件快取系統，用於動態Web應用以減輕資料庫負載。它通過在記憶體中快取資料和物件來減少讀取資料庫的次數，從而提高動態、資料庫驅動網站的速度。Mem

Python學習【第4篇】：Python之可變資料型別與不可變資料型別可變資料型別和不可變資料型別

可變資料型別和不可變資料型別 1.可變資料型別：在id不變的情況下，value可改變（列表和字典是可變型別，但是字典中的key值必須是不可變型別） 2.不可變資料型別：value改變，id也跟著改變。

Python學習【第3篇】：Python之運算子 python-----運算子及while迴圈

python-----運算子及while迴圈一、運算子計算機可以進行的運算有很多種，不只是加減乘除，它和我們人腦一樣，也可以做很多運算。種類：算術運

Python學習【第5篇】：Python之字元編碼問題 python之----------字元編碼具體原理

python之----------字元編碼具體原理 1.記憶體和硬碟都是用來儲存的。 CPU：速度快硬碟：永久儲存 &nb

Python學習【第9篇】：Python之常用模組二（時間模組，序列化模組等）常用模組2

常用模組2 一、time模組表示時間的三種方式：　　時間戳：數字（計算機能認識的）　　時間字串：t='2012-12-12'

爬蟲：python之BeautifulSoup(lxml)

一、簡介

二、lxml解析器的基本使用

三、lxml解析器標準選擇器、find_all的使用

四、CSS選擇器

相關推薦