BeautifulSoup中find和find_all的使用詳解
爬蟲利器BeautifulSoup中find和find_all的使用方法
二話不說,先上段HTML例子
<html> <head> <title> index </title> </head> <body> <div> <ul> <li id="flask"class="item-0"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li> <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li> <li class="item-inactie"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li> <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li> <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a> </ul> </div> <li> hello world </li> </body> </html>
使用BeautifulSoup前需要先構建BeautifulSoup例項
# 構建beautifulsoup例項 soup = BeautifulSoup(html,'lxml') # 第一個引數是要匹配的內容 # 第二個引數是beautifulsoup要採用的模組,即規則
需要注意的是,匯入對的模組需要事先安裝,此處匯入的LXML事先已經安裝。可以匯入的模組可通過查詢BeautifulSoup的文件檢視
接下來是find和find_all的介紹
1. find
只返回第一個匹配到的物件
語法:
find(name,attrs,recursive,text,**wargs) # recursive 遞迴的,迴圈的
引數:
引數名 | 作用 |
---|---|
name | 查詢標籤 |
text | 查詢文字 |
attrs | 基於attrs引數 |
例子:
# find查詢一次 li = soup.find('li') print('find_li:',li) print('li.text(返回標籤的內容):',li.text) print('li.attrs(返回標籤的屬性):',li.attrs) print('li.string(返回標籤內容為字串):',li.string)
執行結果:
find_li: <li class="item-0" id="flask"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li>
li.text(返回標籤的內容): first item
li.attrs(返回標籤的屬性): {'id': 'flask','class': ['item-0']}
li.string(返回標籤內容為字串): first item
find也可以通過‘屬性=值'的方法進行匹配
li = soup.find(id = 'flask') print(li,'\n')
<li class="item-0" id="flask"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li>
需要注意的是,因為class是python的保留關鍵字,若要匹配標籤內class的屬性,需要特殊的方法,有以下兩種:
- 在attrs屬性用字典的方式進行引數傳遞
- BeautifulSoup自帶的特別關鍵字class_
# 第一種:在attrs屬性用字典進行傳遞引數 find_class = soup.find(attrs={'class':'item-1'}) print('findclass:',find_class,'\n') # 第二種:BeautifulSoup中的特別關鍵字引數class_ beautifulsoup_class_ = soup.find(class_ = 'item-1') print('BeautifulSoup_class_:',beautifulsoup_class_,'\n')
執行結果
findclass: <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li>
BeautifulSoup_class_: <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li>
2. find_all
返回所有匹配到的結果,區別於find(find只返回查詢到的第一個結果)
語法:
find_all(name,limit,**kwargs)
引數名 | 作用 |
---|---|
name | 查詢標籤 |
text | 查詢文字 |
attrs | 基於attrs引數 |
與find一樣的語法
上程式碼
# find_all 查詢所有 li_all = soup.find_all('li') for li_all in li_all: print('---') print('匹配到的li:',li_all) print('li的內容:',li_all.text) print('li的屬性:',li_all.attrs)
執行結果:
---
匹配到的li: <li class="item-0" id="flask"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li>
li的內容: first item
li的屬性: {'id': 'flask','class': ['item-0']}
---
匹配到的li: <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li>
li的內容: second item
li的屬性: {'class': ['item-1']}
---
匹配到的li: <li cvlass="item-inactie"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li>
li的內容: third item
li的屬性: {'cvlass': 'item-inactie'}
---
匹配到的li: <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li>
li的內容: fourth item
li的屬性: {'class': ['item-1']}
---
匹配到的li: <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a>
</li>
li的內容: fifth item
附上比較靈活的find_all查詢方法:
# 最靈活的使用方式 li_quick = soup.find_all(attrs={'class':'item-1'}) for li_quick in li_quick: print('最靈活的查詢方法:',li_quick)
執行結果:
- 最靈活的查詢方法: <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li>
- 最靈活的查詢方法: <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li>
完整程式碼:
# coding=utf8 # @Author= CaiJunxuan # @QQ=469590490 # @Wechat:15916454524 # beautifulsoup # 匯入beautifulsoup模組 from bs4 import BeautifulSoup # HTML例子 html = ''' <html> <head> <title> index </title> </head> <body> <div> <ul> <li id="flask"class="item-0"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li> <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li> <li cvlass="item-inactie"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li> <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li> <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a> </ul> </div> <li> hello world </li> </body> </html> ''' # 構建beautifulsoup例項 soup = BeautifulSoup(html,'lxml') # 第一個引數是要匹配的內容 # 第二個引數是beautifulsoup要採用的模組,即規則 # html.parser是python內建的結構匹配方法,但是效率不如lxml所以不常用 # lxml 採用lxml模組 # html5lib,該模組可以將內容轉換成html5物件 # 若想要以上功能,就需要具備對應的模組,比如使用lxml就要安裝lxml # 在bs4當中有很多種匹配方法,但常用有兩種: # find查詢一次 li = soup.find('li') print('find_li:',li.string) print(50*'*','\n') # find可以通過'屬性 = 值'的方法進行select li = soup.find(id = 'flask') print(li,'\n') # 因為class是python的保留關鍵字,所以無法直接查詢class這個關鍵字 # 有兩種方法可以進行class屬性查詢 # 第一種:在attrs屬性用字典進行傳遞引數 find_class = soup.find(attrs={'class':'item-1'}) print('findclass:','\n') # find_all 查詢所有 li_all = soup.find_all('li') for li_all in li_all: print('---') print('匹配到的li:',li_all.attrs) # 最靈活的使用方式 li_quick = soup.find_all(attrs={'class':'item-1'}) for li_quick in li_quick: print('最靈活的查詢方法:',li_quick)
到此這篇關於BeautifulSoup中find和find_all的使用詳解的文章就介紹到這了,更多相關BeautifulSoup find和find_all內容請搜尋我們以前的文章或繼續瀏覽下面的相關文章希望大家以後多多支援我們!