初識python爬蟲 Python網路資料採集1.0 BeautifulSoup安裝測試

阿新 • • 發佈：2019-02-14

*文章說明這個學習資料是Ryan Mitchel的著作<Web Scraping with Python: Collecting Data from the Modern Web>我算是一步一步跟著一起去學習。分享自及遇到的問題。總結

*環境說明我使用的是python3.5+python2.7共存。

1.0.1

安裝BeautifulSoup

windows

cmd執行下面命令

pip3 install beautifulsoup4

Linux

sudo apt-get install python-bs4

對於 Mac 系統

首先用

sudo easy_install pip

安裝 Python 的包管理器 pip，

然後執行來安裝庫檔案。

pip install beautifulsoup4

另外，如果你的裝置同時安裝了 Python 2.x 和 Python 3.x，

當你安裝包的時候，如果有可能安裝到了 Python 2.x 而不是 Python 3.x 裡，就需要使用 pip3 安裝 Python 3.x 版本的包：

pip3 install beautifulsoup4

你可以在python終端中驗證一下是否安裝成功

cmd中輸入

python3
from bs4 import BeautifulSoup

沒有報錯證明成功了。

可以做一個簡單的嘗試

from urllib.request import urlopen
from  
bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/page1.html")
bsObj = BeautifulSoup(html.read())
print(bsObj.h1)

這裡我有出現warning,並不影響程式第二次就沒有了，可以嘗試調取其他的函式

print(bsObj.h1)
print(bsObj.html.body.h1)
print(bsObj.html.body.div)

其實考慮到網路與諸多原因，我們這樣寫程式碼是不行的，

html = urlopen("http://www.pythonscraping.com/pages/page1.html")

這行程式碼主要可能會發生兩種異常：
1.網頁在伺服器上不存在（或者獲取頁面的時候出現錯誤）

2.伺服器不存在（就是說連結 http://www.pythonscraping.com/ 打不開，或者是 URL 連結寫錯了），

第一種異常發生時，程式會返回 HTTP 錯誤。HTTP 錯誤可能是“404 Page Not Found”“500Internal Server Error”等。我們可以用下面的方式處理這種異常

try:
html = urlopen("http://www.pythonscraping.com/pages/page1.html")
except HTTPError as e:
print(e)
# 返回空值，中斷程式，或者執行另一個方案
else:
# 程式繼續。注意：如果你已經在上面異常捕捉那一段程式碼裡返回或中斷（break），

# 那麼就不需要使用else語句了，這段程式碼也不會執行

如果程式返回 HTTP 錯誤程式碼，程式就會顯示錯誤內容，不再執行 else 語句後面的程式碼

第二種異常發生時urlopen 會返回一個 None 物件。這個物件與其他程式語言中的 null 類似。我們

可以增加一個判斷語句檢測返回的 html 是不是 None：

if html is None:
print("URL is not found")
else:

# 程式繼續

當然了你調取函式的時候也會出現問題，如果你想要呼叫的標籤不存在也會出現問題

如果你想要呼叫的標籤不存在，BeautifulSoup 就會返回 None 物件。不過，如果再呼叫這個 None 物件下面的子標籤，就會發生 AttributeError錯誤。比如

print(bsObj.bucunzai)

繼續調取該標籤下的子標籤

print(bsObj.bucunzai.ex)

則出現AttributeError錯誤。

所以每當你呼叫 BeautifulSoup 物件裡的一個標籤時，增加一個檢查條件保證標籤確實存在。

try:
    badContent = bsObj.nonExistingTag.anotherTag
except AttributeError as e:
    print("Tag was not found")
else:
    if badContent is None:
        print("Tag was not found")
    else:
        print(badContent)

所以寫了另一種比較嚴謹的的實現方式

from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup
def getTitle(url):
    try:
     html = urlopen(url)
    except HTTPError as e:
          return None
    try:
        bsObj = BeautifulSoup(html.read())
        title = bsObj.body.h1
    except AttributeError as e :
          return None
    return title
title = getTitle("http://www.pythonscraping.com/pages/page1.html")
if title is None:
    print("Title could not be found")
else:
    print(title)

如果獲取網頁的時候遇到問題就返回一個 None 物件。在 getTitle 函式裡面，我們像前面那樣檢查了HTTPError，然後把兩行 BeautifulSoup 程式碼封裝在一個 try 語句裡面。這兩行中的任何一行有問題，AttributeError 都可能被丟擲（如果伺服器不存在，html 就是一個 None 物件，html.read() 就會丟擲 AttributeError）。

初識python爬蟲 Python網路資料採集1.0 BeautifulSoup安裝測試

初識python爬蟲 Python網路資料採集1.0 BeautifulSoup安裝測試

[python] 網路資料採集操作清單 BeautifulSoup、Selenium、Tesseract、CSV等

python爬蟲——windows + python3.4.3下的BeautifulSoup安裝

python資料採集1-初見爬蟲

Python網路爬蟲--歷史天氣資料採集

Python網路資料採集（爬蟲）

Python網路資料採集 pdf下載

python ：通過爬蟲爬取資料（1）

Python爬蟲實戰：批量採集股票資料，並儲存到Excel中

python網路資料採集-第5章儲存資料

路飛學城-Python爬蟲實戰密訓-第1章

路飛學城—python爬蟲實戰密訓-—第1章

路飛學成-Python爬蟲實戰密訓-第1章

路飛學城—python爬蟲實戰密訓-—第1章（作業）

python爬蟲三大解析資料方法：bs4 及爬小說網案例

python爬蟲三大解析資料方法：正則及圖片下載案例

python爬蟲並將資料儲存到MySQL或Excel中

Python爬蟲捉取資料(代理網站)

python爬蟲三大解析資料方法：xpath 及爬段子網案例

Python爬蟲爬取資料存入MongoDB

初識python爬蟲 Python網路資料採集1.0 BeautifulSoup安裝測試

相關推薦