1. 程式人生 > 其它 >python爬蟲-bs4解析

python爬蟲-bs4解析

bs4解析概述

bs4解析技術是python獨有的一種資料解析方式

bs4實現資料解析原理:

  1. 例項化一個BeautifulSoup物件,並將頁面原始碼載入到該資料中
  • 載入本地的html
    # 本地載入
    fp1 = open("../data2/test.html", 'r', encoding="utf-8")
    soup1 = BeautifulSoup(fp1, 'lxml')
  • 載入網際網路上的html
    fp2 = response.text
    soup2 = BeautifulSoup(fp1, 'lxml')
  1. 通過BeautifulSoup物件中的屬性和方法來進行標籤定位和資料提取

環境的準備

pip install bs4
# 這是一個xml解析器
pip install lxml

爬取紅樓夢小說的所有章節標題和內容

"""
案例:爬取紅樓夢全部標題和內容
url = "https://www.shicimingju.com/book/hongloumeng.html"
    - 每一個章節標題都是一個a標籤
    - 章節的內容在href中
    - a標籤的層級是 div class="book-mulu" -> ul -> li -> a
"""

import requests
from bs4 import BeautifulSoup

if __name__ == '__main__':
    # UA偽裝
    headers = {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) "
                      "Version/14.1 Safari/605.1.15 "
    }
    # 檔案儲存位置
    fp = open("../data2/honglou.text", 'w', encoding='utf-8')
    # 對頁面進行捕獲
    url = "https://www.shicimingju.com/book/hongloumeng.html"
    page = requests.get(url=url, headers=headers)
    page.encoding='utf-8'
    # 解析出章節標題和內容的url
    # 1,載入頁面到靚湯物件中
    soup = BeautifulSoup(page.text, 'lxml')
    # 2,解析章節標題和詳情頁的url
    li_list = soup.select('.book-mulu > ul > li')
    for li in li_list:
        title = li.a.string
        detail_url = "https://www.shicimingju.com" + li.a['href']
        # 對詳情頁發起請求,解析出章節內容
        detail_page = requests.get(url=detail_url, headers=headers)
        detail_page.encoding='utf-8'
        detail_soup = BeautifulSoup(detail_page.text, 'lxml')
        div_tag = detail_soup.find("div", class_="card bookmark-list")
        content = div_tag.text
        # 持久化儲存
        fp.write(title + ':\n' + content + '\n')
        print(title + "爬取成功!")
    fp.close()