爬蟲之`beautifulsoup4`模組

阿新 • • 發佈：2020-08-03

`beautifulsoup4`模組

beautifulsoup4的簡介

Beautiful Soup是一個可以從HTML或XML檔案中提取資料的python庫。

安裝beautifulsoup4

# 安裝
pip install beautifulsoup4

解析庫

BeautifulSoup預設支援Python的標準HTML解析庫，但是它也支援一些第三方的解析庫

序號	解析庫	使用方法	優勢	劣勢
1	Python標準庫	BeautifulSoup(html,’html.parser’)	Python內建標準庫；執行速度快	容錯能力較差
2	lxml HTML解析庫	BeautifulSoup(html,’lxml’)	速度快；容錯能力強	需要安裝，需要C語言庫
3	lxml XML解析庫	BeautifulSoup(html,[‘lxml’,’xml’])	速度快；容錯能力強；支援XML格式	需要C語言庫
4	htm5lib解析庫	BeautifulSoup(html,’htm5llib’)	以瀏覽器方式解析，最好的容錯性	速度慢

下載解析模組lxml

# 下載解析模組lxml
pip install lxml

使用

# 初始化一個bs4物件
import requests
from bs4 import BeautifulSoup

res = requests.get("https://www.mzitu.com/")
soup = BeautifulSoup(res.text,'lxml') #解析的文字 解析器

BeautifulSoup4中文文件

案例1：爬取汽車之家的新聞

汽車之家

import requests

from bs4 import BeautifulSoup

res = requests.get("https://www.autohome.com.cn/news/1/#liststart")

soup = BeautifulSoup(res.text, 'lxml')

ul = soup.find(class_='article')  # 查詢class為article的ul
li_list = ul.find_all(name='li')  # 查詢ul下的所有li
new_list = []
for li in li_list:
    title = li.find(name='h3')   # 查詢li下name=h3的tag
    if title:
        new_list.append({
            'title': title.text,
            'link': 'http:' + li.find('a').attrs.get('href'),
            'abstract': li.find('p').text,
            'img': 'http:' + li.find(name='img').attrs.get('src')
        })

import json

with open('news.doc', mode='wb') as fw:
    for item in new_list:
        fw.write(json.dumps(item, ensure_ascii=False).encode('utf-8'))
        fw.write('\n\n'.encode('utf-8'))

`bs4`模組的使用

<1>遍歷文件樹
<2>搜尋文件樹（5種過濾規則）
<3>limit和recursive引數

遍歷文件樹

（0）用法
（1）獲取標籤的名稱
（2）獲取標籤的屬性
（3）獲取標籤的內容
（4）巢狀使用
（5）子節點、子孫節點
（6）父節點、祖先節點

用法

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p id="my_p" class="title">hello<b id="bbb" class="boldest">The Dormouse's story</b>
</p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a><a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

lxml文件容錯能力強，能夠解析非標準html文件

# 使用方式1
soup = BeautifulSoup(html_doc,'lxml')
# print(soup.prettify())
# 使用方式2
soup = BeautifulSoup(open('a.html'),'lxml')

獲取標籤的名字

soup = BeautifulSoup(html_doc,'lxml')
head = soup.head
print(type(head)) # tag <class 'bs4.element.Tag'>
print(head.name)  # head

獲取標籤的屬性（重點）

p = soup.body.p   # <class 'bs4.element.Tag'>
print(p.attrs)    # {'id': 'my_p', 'class': ['title']}

print(p.attrs.get('id'))  # my_p
print(p['id'])            # my_p
print(p.get('id'))        # my_p

獲取標籤的內容

# text會取標籤子子孫孫的內容拼接在一起
p = soup.body.p
print(p.text)     #helloThe Dormouse's story

# p下文字只有一個的時候取到，否則為None
print(p.string)  # None

# 拿到一個生成器物件，取到p下的所有文字內容
print(p.strings) # <generator object Tag._all_strings at 0x7f02260213c0>
print(list(p.strings))  # ['hello', "The Dormouse's story", '\n']
或者
for item in p.strings:
    print(item)

巢狀使用

a = soup.body.a
print(a.get('id'))

子節點、子孫節點

print(soup.p.contents)  # ['hello', <b class="boldest" id="bbb">The Dormouse's story</b>, '\n'] 獲取p下的所有子節點

print(soup.p.children) # 獲取p下的子節點，是一個生成器物件 <list_iterator object at 0x7f24215e3b80>

print(list(soup.p.children)) # ['hello', <b class="boldest" id="bbb">The Dormouse's story</b>, '\n']

父節點，祖先節點

# 獲取p標籤的父節點
print(soup.p.parent)
# a標籤所有的祖先節點，父親的父親，父親的父親的父親
print(soup.a.parents)

兄弟節點

# print(soup.a.next_sibling) #下一個兄弟
# print(soup.a.previous_sibling) #上一個兄弟

# print(list(soup.a.next_siblings)) #下面的兄弟們=>生成器物件
# print(list(soup.a.previous_siblings)) #上面的兄弟們=>生成器物件

搜尋文件樹

<1>find()      # 只返回找到的第一個
<2>find_all()  # 找到所有的
<3>5種過濾器:字串、正則表示式、列表、True、方法

字串過濾

# 過濾的內容是字串
# a=soup.find(name='a')
# res=soup.find(id='my_p')
# res=soup.find(class_='story')
# res=soup.find(href='http://example.com/elsie')

# res=soup.find(attrs={'id':'my_p'})
# res=soup.find(attrs={'class':'story'})
# print(res)

正則表示式

# import re
# # re_b=re.compile('^b')
# res=soup.find(name=re_b)
# # res=soup.find_all(name=re_b)
# res=soup.find_all(id=re.compile('^l'))
# print(res)

列表

# res=soup.find_all(name=['body','b'])
# res=soup.find_all(class_=['sister','title'])
# print(res)

True

# res=soup.find_all(name=True)
# res=soup.find_all(id=True)
# res=soup.find_all(id=False)
# res=soup.find_all(href=True)
# print(res)

方法

# def has_class_but_no_id(tag):
#     return tag.has_attr('class') and not tag.has_attr('id')
#
# print(soup.find_all(has_class_but_no_id))

limit(限制查詢的條數)

# res=soup.find_all(name=True,limit=1)
# print(res)
# recursive（recursive遞迴查詢，找子子孫孫）
# res=soup.body.find_all(name='b ',recursive=False)
# res=soup.body.find_all(name='p',recursive=False)
# res=soup.body.find_all(name='b',recursive=True)
# print(res)

css選擇器

# ret=soup.select('#my_p')
# https://www.w3school.com.cn/cssref/css_selectors.asp
# ret=soup.select('body p')  # 子子孫孫
# ret=soup.select('body>p')  # 直接子節點（兒子）
# ret=soup.select('body>p')[0].text  # 直接子節點（兒子）
# # ret=soup.select('body>p')[0].a.find()
# print(ret)

bs4的修改文件樹軟體配置檔案是xml格式的

軟體的配置檔案的格式：

# ini：configparser
# conf
# xml：bs4
# yaml格式

爬蟲之`beautifulsoup4`模組

beautifulsoup4模組 beautifulsoup4的簡介 Beautiful Soup是一個可以從HTML或XML檔案中提取資料的python庫。

06.Python網路爬蟲之requests模組（2）

06.Python網路爬蟲之requests模組（2）今日內容 session處理cookie proxies引數設定請求代理ip

04.Python網路爬蟲之requests模組（1）

04.Python網路爬蟲之requests模組（1）引入 Requests 唯一的一個非轉基因的 Python HTTP 庫，人類可以安全享用。

爬蟲之selenium模組

一介紹 selenium最初是一個自動化測試工具,而爬蟲中使用它主要是為了解決requests無法直接執行JavaScript程式碼的問題

爬蟲之Beautifulsoup模組

一、介紹 Beautiful Soup 是一個可以從HTML或XML檔案中提取資料的Python庫.它能夠通過你喜歡的轉換器實現慣用的文件導航,查詢,修改文件的方式.Beautiful Soup會幫你節省數小時甚至數天的工作時間.目前已經開發到4.0以

爬蟲之Reuqests模組使用

技術標籤：爬蟲python 一、GET請求 \'\'\' GET請求：沒有請求體資料不能超過1k 請求的資料會暴露在瀏覽器的位址列中

爬蟲之requests模組的基礎使用等相關內容-133

1 爬蟲基本原理 1 百度是個大爬蟲2 模擬瀏覽器傳送http請求---(請求庫)(頻率，cookie，瀏覽器頭。。js反扒，app逆向)（抓包工具）-----》從伺服器取回資料-----》解析資料--（解析庫）（反扒）----》入庫（儲存庫，）

爬蟲之bs4模組的基礎使用等相關內容-134

1 爬取汽車之家新聞 # request模組（傳送請求）+bs4（解析html的模組）# 汽車之家為例# pip3 install beautifulsoup4# pip3 install lxmlimport pymysqlimport requestsfrom bs4 import BeautifulSoupr