網路爬蟲之BeautifulSoup入門（四）

阿新 • • 發佈：2019-01-22

5.帶更多引數的find方法
官方文件給出的find方法的引數如下：find( name , attrs , recursive , string , **kwargs )，總體來看和find_all方法的引數沒什麼不同，在這裡仍以示例的方法給出常見的使用方法：
兩種方法的使用大致相同，注意以下兩種寫法都可以且輸出結果一致，但顯然使用find方法更方便。

soup.find_all('title', limit=1)
# [<title>The Dormouse's story</title>]

soup.find('title')
# <title>The Dormouse's story</title>

在這裡一定要注意：find_all方法的返回值為列表，而find直接返回結果；同時在沒有找到目標時，find_all返回空的列表，而find將返回None。
6. 輸出格式及編碼
- 使用prettify方法可以將BeautifulSoup物件格式化輸出，這在大型專案內是非常有用的。當然也可以對物件的某一個tag節點使用該方法，如下：

markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup)

print(soup.a.prettify())
# <a href="http://example.com/"> 

#  I linked to
#  <i>
#   example.com
#  </i>
# </a>

若只想得到結果字串，而不注重格式的話，可以使用str方法，如下：

str(soup.a)
#'<a href="http://example.com/">I linked to <i>example.com</i></a>'

7.get_text()
若想得到tag中包含的文字內容，可以使用get_text()方法，如下：

soup.get_text()
u'\nI linked to example.com\n' 

soup.i.get_text()
u'example.com'

8.實踐
給出一個實踐專案原始碼地址：網頁表格抓取
介紹：專案內爬蟲部分主要應用了get_text,find,find_all，prettify等方法,實現給定URL地址的網頁表格提取儲存、展示等。

網路爬蟲之BeautifulSoup入門（四）

網路爬蟲之BeautifulSoup入門（四）

爬蟲庫之BeautifulSoup學習（四）

Python編寫簡單爬蟲之新手入門（一）

16.Python網路爬蟲之Scrapy框架（CrawlSpider）

python網路爬蟲與資訊提取（四）Robots協議

Spring Cloud實戰之初級入門（四）— 利用Hystrix實現服務熔斷與服務監控

Python爬蟲包 BeautifulSoup 學習（四） bs基本物件與函式

爬蟲庫之BeautifulSoup學習（二）

爬蟲庫之BeautifulSoup學習（三）

Spring Boot 入門（四）微服務之 Config Server 統一配置中心

Shell入門（四）之數組

springcloud入門之斷路器Hystrix（四）

JavaFX入門（四）之Hello World，JavaFX樣式

C語言入門（四）之switch、迴圈語句

monkeyrunner入門之傳送郵件（四）

python爬蟲入門（四）利用多執行緒爬蟲

Python網路爬蟲與資訊提取（三）bs4入門

Spring Cloud入門教程之斷路器 Hystrix（四）(Finchley版本+Boot2.0)

python爬蟲實踐——零基礎快速入門（四）爬取小豬租房資訊

mybatis入門（四）之動態SQL

網路爬蟲之BeautifulSoup入門（四）

相關推薦