Python爬蟲使用bs4方法實現資料解析

阿新 • • 發佈：2020-08-28

聚焦爬蟲:

爬取頁面中指定的頁面內容。

編碼流程：

1.指定url
2.發起請求
3.獲取響應資料
4.資料解析
5.持久化儲存

資料解析分類：

1.bs4
2.正則
3.xpath (***)

資料解析原理概述：

解析的區域性的文字內容都會在標籤之間或者標籤對應的屬性中進行儲存

1.進行指定標籤的定位

2.標籤或者標籤對應的屬性中儲存的資料值進行提取（解析）

bs4進行資料解析資料解析的原理：

1.標籤定位

2.提取標籤、標籤屬性中儲存的資料值

bs4資料解析的原理：

1.例項化一個BeautifulSoup物件，並且將頁面原始碼資料載入到該物件中

2.通過呼叫BeautifulSoup物件中相關的屬性或者方法進行標籤定位和資料提取

環境安裝：

pip install bs4 -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install lxml -i https://pypi.tuna.tsinghua.edu.cn/simple

例項化BeautifulSoup物件步驟：

from bs4 import BeautifulSoup

物件的例項化：

1.將本地的html文件中的資料載入到該物件中

fp = open('./test.html','r',encoding='utf-8')

soup = BeautifulSoup(fp,'lxml')

2.將網際網路上獲取的頁面原始碼載入到該物件中（常用方法，推薦）

page_text = response.text
soup = BeatifulSoup(page_text,'lxml')

提供的用於資料解析的方法和屬性：

soup.tagName:返回的是文件中第一次出現的tagName對應的標籤
soup.find():
find('tagName'):等同於soup.div

1.屬性定位：

soup.find('div',class_/id/attr='song')
soup.find_all('tagName'):返回符合要求的所有標籤（列表）

select：
select('某種選擇器（id，class，標籤...選擇器）'),返回的是一個列表。

2.層級選擇器：

soup.select('.tang > ul > li > a')：>表示的是一個層級
soup.select('.tang > ul a')：空格表示的多個層級

3.獲取標籤之間的文字資料：

soup.a.text/string/get_text()
text/get_text():可以獲取某一個標籤中所有的文字內容
string：只可以獲取該標籤下面直系的文字內容

4.獲取標籤中屬性值：

soup.a['href']

案例：爬取三國演義小說所有的章節標題和章節內容程式碼如下：

import requests
from bs4 import BeautifulSoup

if __name__ == "__main__":
  #對首頁的頁面資料進行爬取
  headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/72.0.3626.121 Safari/537.36'
  }
  url = 'http://www.shicimingju.com/book/sanguoyanyi.html'
  page_text = requests.get(url=url,headers=headers).text

  #在首頁中解析出章節的標題和詳情頁的url
  #例項化BeautifulSoup物件，需要將頁面原始碼資料載入到該物件中
  soup = BeautifulSoup(page_text,'lxml')
  #解析章節標題和詳情頁的url
  li_list = soup.select('.book-mulu > ul > li')
  fp = open('./sanguo.txt','w',encoding='utf-8')
  for li in li_list:
    title = li.a.string
    detail_url = 'http://www.shicimingju.com'+li.a['href']
    #對詳情頁發起請求，解析出章節內容
    detail_page_text = requests.get(url=detail_url,headers=headers).text
    #解析出詳情頁中相關的章節內容
    detail_soup = BeautifulSoup(detail_page_text,'lxml')
    div_tag = detail_soup.find('div',class_='chapter_content')
    #解析到了章節的內容
    content = div_tag.text
    fp.write(title+':'+content+'\n')
    print(title,'爬取成功！！！')

執行結果：

Python爬蟲使用bs4方法實現資料解析