python爬蟲-bs4解析
阿新 • • 發佈:2022-03-08
bs4解析概述
bs4解析技術是python獨有的一種資料解析方式
bs4實現資料解析原理:
- 例項化一個BeautifulSoup物件,並將頁面原始碼載入到該資料中
- 載入本地的html
# 本地載入
fp1 = open("../data2/test.html", 'r', encoding="utf-8")
soup1 = BeautifulSoup(fp1, 'lxml')
- 載入網際網路上的html
fp2 = response.text
soup2 = BeautifulSoup(fp1, 'lxml')
- 通過BeautifulSoup物件中的屬性和方法來進行標籤定位和資料提取
環境的準備
pip install bs4
# 這是一個xml解析器
pip install lxml
爬取紅樓夢小說的所有章節標題和內容
""" 案例:爬取紅樓夢全部標題和內容 url = "https://www.shicimingju.com/book/hongloumeng.html" - 每一個章節標題都是一個a標籤 - 章節的內容在href中 - a標籤的層級是 div class="book-mulu" -> ul -> li -> a """ import requests from bs4 import BeautifulSoup if __name__ == '__main__': # UA偽裝 headers = { "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) " "Version/14.1 Safari/605.1.15 " } # 檔案儲存位置 fp = open("../data2/honglou.text", 'w', encoding='utf-8') # 對頁面進行捕獲 url = "https://www.shicimingju.com/book/hongloumeng.html" page = requests.get(url=url, headers=headers) page.encoding='utf-8' # 解析出章節標題和內容的url # 1,載入頁面到靚湯物件中 soup = BeautifulSoup(page.text, 'lxml') # 2,解析章節標題和詳情頁的url li_list = soup.select('.book-mulu > ul > li') for li in li_list: title = li.a.string detail_url = "https://www.shicimingju.com" + li.a['href'] # 對詳情頁發起請求,解析出章節內容 detail_page = requests.get(url=detail_url, headers=headers) detail_page.encoding='utf-8' detail_soup = BeautifulSoup(detail_page.text, 'lxml') div_tag = detail_soup.find("div", class_="card bookmark-list") content = div_tag.text # 持久化儲存 fp.write(title + ':\n' + content + '\n') print(title + "爬取成功!") fp.close()