Python3爬蟲從零開始：Beautiful Soup的使用

阿新 • • 發佈：2018-12-16

基本用法

例項1：

from bs4 import BeautifulSoup

html ="""

<html><head><title>The Dormouse's story</title></head>

<body>

<p class = 'title' name = "dromouse"><b>The Dormouse's story</b></p>

<p class = "story">Once upon a time there were three little sisters;and

their names were

<a href = "http://example.com/elsie" class = "sister" id = "link1"><!--Elsie--></a>,

<a href = "http://example.com/lacie" class = "sister" id = "link2">Lacie</a> and

<a href = "http://example.com/tillie" class = "sister" id = "link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class = "story">...<p>

"""

soup = BeautifulSoup(html,'lxml')

print(soup.prettify())

print(type(soup))

print(soup.title.string)

結果：

說明1：

可以看到，我們輸入的並不是一個完整的HTML字串，缺少了</body>等標籤，我們初始化BeautifulSoup時完成了自動更正格式。

說明2：

節點選擇器

例項2：提取資訊和巢狀選擇

from bs4 import BeautifulSoup

html ="""

<html><head><title>The Dormouse's story</title></head>

<body>

<p class = 'title' name = "dromouse"><b>The Dormouse's story</b></p>

<p class = "story">Once upon a time there were three little sisters;and

their names were

<a href = "http://example.com/elsie" class = "sister" id = "link1"><!--Elsie--></a>,

<a href = "http://example.com/lacie" class = "sister" id = "link2">Lacie</a> and

<a href = "http://example.com/tillie" class = "sister" id = "link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class = "story">...<p>

"""

soup = BeautifulSoup(html,'lxml')

#獲取節點#

print('soutp.title:',soup.title) #選擇元素

print('soup.a:',soup.a) #取得的只是第一個a節點

#獲取名稱#

print('soup.title.name:',soup.title.name) # .name獲取名稱

#獲取屬性#

print('soup.p.attrs:',soup.p.attrs) # .attrs獲取所有屬性

print('soup.p.attrs["name"]:',soup.p.attrs['name']) #獲取限定屬性

print('soup.p["name]:',soup.p['name']) #更簡單的寫法

#獲取內容#

print('soup.p.string:',soup.p.string)

#巢狀選擇#

print('soup.head.title:',soup.head.title)

結果

例項3：子節點

from bs4 import BeautifulSoup

html ="""

<html><head><title>The Dormouse's story</title></head>

<body>

<p class = "story">Once upon a time there were three little sisters;and

their names were

<a href = "http://example.com/elsie" class = "sister" id = "link1"><!--Elsie--></a>,

<a href = "http://example.com/lacie" class = "sister" id = "link2">Lacie</a> and

<a href = "http://example.com/tillie" class = "sister" id = "link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class = "story">...<p>

"""

soup = BeautifulSoup(html,'lxml')

for i,content in enumerate(soup.p.contents):

    print(i,content)

for i,child in enumerate(soup.p.children):

    print(i,child)

#遍歷輸出是一致的。

print('contents:',soup.p.contents)

print('children:',soup.p.children)

print('type of contents:',type(soup.p.contents))

print('type of children:',type(soup.p.children))

結果：

說明：通過contents屬性和children屬性都能獲取直接子節點，但注意兩者區別。

例項4：子孫節點

from bs4 import BeautifulSoup

html ="""

<html><head><title>The Dormouse's story</title></head>

<body>

<p class = "story">Once upon a time there were three little sisters;and

their names were

<a href = "http://example.com/elsie" class = "sister" id = "link1"><!--Elsie--></a>,

<a href = "http://example.com/lacie" class = "sister" id = "link2">Lacie</a> and

<a href = "http://example.com/tillie" class = "sister" id = "link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class = "story">...<p>

"""

soup = BeautifulSoup(html,'lxml')

for i,content in enumerate(soup.p.descendants):

    print(i,content)

print('type:',type(soup.p.descendants))

結果：

例項5：父節點和祖先節點

from bs4 import BeautifulSoup

html ="""

<html><head><title>The Dormouse's story</title></head>

<body>

<p class = "story">Once upon a time there were three little sisters;and

their names were

<a href = "http://example.com/elsie" class = "sister" id = "link1"><!--Elsie--></a>,

and they lived at the bottom of a well.</p>

<p class = "story">...<p>

"""

soup = BeautifulSoup(html,'lxml')

print(list(enumerate(soup.p.parents)))

print(soup.p.parent)

print(type(soup.p.parents))

print(type(soup.p.children))

print(type(soup.p.contents))

print(type(soup.p.descendants))

結果：

說明1：

parent屬性獲取的是父節點，parents屬性獲取的是所有的祖先節點。

說明2：

區分不同屬性的型別。

例項6：兄弟選擇器

from bs4 import BeautifulSoup

html ="""

<html><head><title>The Dormouse's story</title></head>

<body>

<p>hahaha</p>

Leraning

<a>C++</a>

HELLO

<a>Java</a>

World

<a>Python</a>

<a>JS</a>

<p class = "story">...<p>

"""

soup = BeautifulSoup(html,'lxml')

print('Next Sibling:',soup.a.next_sibling)

print('Prev Sibling:',soup.a.previous_sibling)

print('Next Siblings',list(enumerate(soup.a.next_siblings)))

print('Prev Siblings',list(enumerate(soup.a.previous_siblings)))

結果：

例項7：方法選擇器

find_all() API如下：find_all(name,attrs,recursive,text,**kwargs)

from bs4 import BeautifulSoup

html ="""

<div class = "C1">

<div class = "C2">

<h1>Hello</h1>

</div>

<div class = "C3">

<ul class = "U1" id = "list1">

<li class = "element">C++</li>

<li class = "element">Java</li>

</ul>

<ul class ="U2" id = "list2">

<li class = "element">Python></li>

</ul>

</div>

</div>

"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,'lxml')

print(soup.find_all(name='ul')) #根據name查詢元素

print()

print(soup.find_all(attrs={'class':'element'})) #根據attrs查詢元素，引數型別是字典型

print()

print(soup.find_all(id='list2')) #另一種寫法

print()

print(soup.find_all(class_='element')) #class是Python中關鍵字，這裡記得加_

結果：

補充：除了find_all()方法，還有find()方法，返回第一個匹配的元素。

例項8：CSS選擇器

from bs4 import BeautifulSoup

html ="""

<div class = "C1">

<div class = "C2">

<h1>Hello</h1>

</div>

<div class = "C3">

<ul class = "U1" id = "list1">

<li class = "element">C++</li>

<li class = "element">Java</li>

</ul>

<ul class ="U2" id = "list2">

<li class = "element">Python</li>

</ul>

</div>

</div>

"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,'lxml')

print(soup.select('.U1')) #選擇器選擇

print()

print(soup.select('ul li')[0]) #限定選擇

print()

print(soup.select('ul li')[0].attrs['class']) #獲取屬性

print()

print(soup.select('li')[2].string) #獲取文字

print()

print(soup.select('li')[2].get_text()) #另一個獲取文字方法

結果：

Python3爬蟲從零開始：Beautiful Soup的使用

基本用法例項1： from bs4 import BeautifulSoup html =""" <html><head><title>The Dormouse's story</title></head&g

Python3爬蟲從零開始：環境配置

話不多說，關於爬蟲的作用和介紹網上資料很多，不再累述。 “工欲善其事必先利其器”。 1.首先到Python官網進行Python安裝： 2.環境變數配置：（1）找到Python3安裝路徑，我的如下：C:\Users\Administrator\AppData\

Python3爬蟲從零開始：庫的安裝

抓取網頁之後下一步就是從網頁中提取資訊。提取方式有很多種，可以利用正則表示式進行提請，但是相對而言比較麻煩繁瑣。現在有很多強大的解析庫供我們使用，如lxml,Beautiful Soupp,pyquery等。本節對其安裝進行介紹。 lxml的安裝 lxm

從零開始：教你如何訓練神經網路

原文連結：https://zhuanlan.zhihu.com/p/31953880 選自TowardsDataScience 作者：Vitaly Bushaev 機器之心編譯作者從神經網路簡單的數學定義開始，沿著損失函式、啟用函式和反向傳播等方法進一步描述基本的優化演算法。在

Scala從零開始：使用Scala IDE for eclipse寫hello world

雖然Scala是一門比較新的語言，但是很多機構都為其開發了IDE或者整合外掛，比較流行的有Eclipse、IntelliJ以及Netbeans。今天我們使用集成了Scala IDE外掛的Eclipse進行程式碼的編寫。 IDE下載及安裝大資料學習的順序：（1）大資料的第一代技術

Scala從零開始：使用Scala IDE寫hello world

簡介在上一篇文章中，我們闡述了Coursera使用Scala的理由，以及Scala的優缺點。說多不如少練，我們今天就開始練習如何使用Scala程式設計。雖然Scala是一門比較新的語言，但是很多機構都為其開發了IDE或者整合外掛，比較流行的有Eclipse、Intell

從零開始：　Spring Cloud微框架系列：spring boot

Spring 頂級專案，包含眾多，我們重點學習一下，SpringCloud專案以及SpringBoot專案 ————————————————————main———————————————————— 一、SpringCloud專案簡介　　Spring Cloud：

從零開始：zipkin

一、工作環境作業系統：Ubuntu 16.04 java環境：JDK 1.8 SpringBoot版本：1.5.13 SpringCloud版本：Edgware.SR3 二、專案配置 1. zipkin Server pom.xml <depend

python3爬蟲（二）-使用beautiful soup 讀取網頁

Beautiful Soup簡介簡單來說，Beautiful Soup是python的一個庫，最主要的功能是從網頁抓取資料。官方解釋如下： Beautiful Soup提供一些簡單的、python式的函式用來處理導航、搜尋、修改分析樹等功能。它是一個工

從零開始：一個正式的vue+webpack專案的目錄結構是怎麼形成的

如何從零開始一個vue+webpack前端工程工作流的搭建，首先我們先從專案的目錄結構入手。一個持續可發展，不斷加入新功能，方便後期維護的目錄結構究竟是長什麼樣子的？接下來閏土大叔帶你們一起手摸手學起來。初級前端初始化目錄篇專案伊始，我們肯定是先在terminal終端命令列（以下簡稱terminal）cd進入

從零開始：用Python搭建神經網路

在這篇部落格裡，我們將從零開始搭建一個三層的神經網路。我們不會對用到的數學原理一一贅述，但我保證你可以直觀地瞭解到我們在做什麼。另外，你也可以通過文章內的連結來獲取更詳細的資訊。這兒我就假定你已經熟悉基礎的微積分和機器學習的一些概念，比如分類和規範化，最好還能懂得一些優

【轉】Scala從零開始：使用Intellij IDEA寫hello world

在之前的文章中，我們介紹瞭如何使用Scala IDE也就是eclipse中整合的Scala開發外掛來進行Scala語言程式的開發，在使用了一段時間之後，發現eclipse對Scala的支援並不是很好。使用者體驗比較差，比如聯想速度比較慢等。由於在公司一直使用的Scala開發工具是Intellij IDEA（

教程 | 一步步從零開始：使用PyCharm和SSH搭建遠端TensorFlow開發環境

作者：Erik Hallström 機器之心編譯參與：機器之心編輯部一般而言，大型的神經網路對硬體能力有著較高的需求——往往需要強勁的 GPU 來加速計算。但是你也許還是想拿著一臺筆記本坐在咖啡店裡安靜地寫 TensorFlow 程式碼，同時還能享受每秒數萬億次

odoo12從零開始：二、個性化定製odoo12 之建立資料庫頁面

劇情回顧上一文章，我們已經成功運行了odoo12，並訪問localhost:8069看到如下介面：我們還沒有建立資料庫，但是我們發現，資料庫管理頁面的logo是odoo，資料庫頁面全是英文的，對於我們國內使用者來說，這是不太友好的。我們想要自定義這個資料庫頁面，有沒有辦法？答案是肯

odoo12從零開始：三、1）建立你的第一個應用模型(module)

前言以前，我一直都不知道為什麼好多框架的入門都是“hello world”開始，當我思前想後我要如何介紹odoo的model、record、template等繼承等高階特性時，發現在那之前便需要清楚地介紹什麼是模型(model)，什麼是記錄

odoo12從零開始：一、安裝odoo執行環境(windows10)

前言鑑於好多朋友說沒有mac電腦，windows開發其實也差不了多遠，只是個人習慣問題，而且吧，windows的電腦其實配環境也挺快的其實，我在這裡再稍微補一個比較簡單的windows環境部署，希望可以對朋友們有一些幫助。在windows10上安裝odoo12

odoo12從零開始：三、2）odoo模型層

前言　　上一篇文章(建立你的第一個應用模組(module))已經大致描述了odoo的模型層(model)和檢視層(view)，這一篇文章，我們將系統地介紹有關於model的知識，其中包括： 1、模型的型別：Model、TransientModel、AbstractModel 2、模型的屬性：_na

Python爬蟲系列（一）：從零開始，安裝環境

tar 公司 pip nal 網頁解析目標 http caption 在上一個系列，我們學會使用rabbitmq。本來接著是把公司的celery分享出來，但是定睛一看，celery4.0已經不再支持Windows。公司也逐步放棄了服役多年的celery項目。恰好，公司找

從零開始的Python爬蟲速成指南，本文受眾：沒寫過爬蟲的萌新

引言用最短的時間寫一個最簡單的爬蟲，可以抓一些簡單的論壇、帖子、網頁。入門 1.準備工作安裝Python 安裝scrapy框架一個IDE或者可以用自帶的 2.開始寫爬蟲 &n

從零開始寫Python爬蟲 --- 1.6 爬蟲實踐： DOTA'菠菜'結果查詢

說起來目錄裡面本來是準備雙色球資訊查詢的，但是我一點都不懂這個啊，恰好身邊有個老賭棍，沉迷Dota飾品交易，俗稱 “菠菜”。老賭棍啊，老賭棍，能不能不要每天我說天台見。。。這次的爬蟲功能十分的簡答，主要目的是延展一下bs4庫的使用。目標分析：看一看網站裡的資訊是怎麼排列的：和上一次一樣我們

Python3爬蟲從零開始：Beautiful Soup的使用

相關推薦