python3 BS4 BeautifulSoup 解析的一些解析（迷惑點）

阿新 • • 發佈：2019-01-31

1 BeautifulSoup

只要目標資訊的旁邊或者附近有標籤就可以呼叫，，不用管是幾層標籤（父輩後代輩的都可以）。

Soup.html.body.h1

Soup.body.h1

Soup.html.h1

Soup.h1

索引的效果都是同一個內容。

但是應該把重要的標籤包含進去，以免過於簡單爬到不想要的內容。

<li>

</li>

在這裡的li 和 div都是標籤用法可以soup.li soup.div

而aria-label class role是屬性用法則區別於標籤，引用用div.attrs

比如 list=soup.findAll(“div”,{“role”:”img”})

div是標籤而大括號裡面的role和img是改標籤下的類的屬性

同樣等價於

list=soup.findAll(“div”,attrs= “role”:”img”})

3注意下find與findAll的用法

soup.div.findAll("img") 會找到第一個div標籤中的全部img 並不是全部div標籤的img

soup.div.find_next("div").findAll('img')說明是找到第二div標籤中的所有img

bs4的資料型別

Tga標籤

最基本的資訊組織單元，分別用<>和</>標明開頭和結尾

例如：

from bs4 import BeautifulSoup
import requests
r = requests.get("https://python123.io/ws/demo.html")
demo = r.text

soup=BeautifulSoup(demo,"html.parser")
print(soup.title)

print(soup.a)

輸出為

<title>This is a python demo page</title>

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>

任何存在於HTML語法中的標籤都可以用soup.<tag>訪問獲得
當HTML文件中存在多個相同<tag>對應內容時，soup.<tag>返回第一個

型別 <class 'bs4.element.Tag'>

Tag的name

name 標籤的名字，<p>…</p>的名字是'p'，格式：<tag>.name

例子：

from bs4 import BeautifulSoup
import requests
r = requests.get("https://python123.io/ws/demo.html")
demo = r.text
soup=BeautifulSoup(demo,"html.parser")

print(soup.a.name)

print(soup.a..parent.name)

輸出為 “a”

“p”

每個<tag>都有自己的名字，通過<tag>.name獲取

字串型別<class 'str'>

Tag的attrs（屬性）

Attributes：標籤的屬性，字典形式組織，格式：<tag>.attrs

例子：

from bs4 import BeautifulSoup
import requests
r = requests.get("https://python123.io/ws/demo.html")
demo = r.text
soup=BeautifulSoup(demo,"html.parser")

tag=soup.a

print(tag.attrs)

print(tag.attrs['class'])

輸出為：

{'href': 'http://www.icourse163.org/course/BIT-268001', 'id': 'link1', 'class': ['py1']}
['py1']

一個<tag>可以有0或多個屬性

字典型別<class 'list'>

Tag的NavigableString

NavigableString：標籤內非屬性字串，<>…</>中字串，格式：<tag>.string

from bs4 import BeautifulSoup
import requests
r = requests.get("https://python123.io/ws/demo.html")
demo = r.text
soup=BeautifulSoup(demo,"html.parser")

print(soup.a.string)

soup.a為

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>

輸出為 Basic Python NavigableString可以跨越多個層次

型別：<class 'bs4.element.NavigableString'>

Tag的Comment

Comment：標籤內字串的註釋部分，一種特殊的Comment型別

型別為<class 'bs4.element.Comment'>

例子：

newsoup=BeautifulSoup("<b></b><p>This is not a conment</p>","html.parser")
print(newsoup.b.string)

輸出為

This is a conment

型別為 <class 'bs4.element.Comment'>

可見並不是標籤<b blalal /b>

而是直接<b> 所以他不是一個標籤型別而是comment

總結：

這裡我們要注意遍歷html樹的時候幾個特殊的輸出型別

注意到 soup.children返回的型別是一個迭代器並不能用BeautifukSoup的方法進行檢索了。 而且需要注意到的是soup.findAll(```)返回的也是一個set迭代如果再利用BearutifulSoup方法索引 可能就會出現錯誤比如a.attrs["td"] 正確的表達方式應該用列表或者迭代器的方法 a("td") 更加註意 soup.find 和findAll有很大的區別 find找的是標籤 findAll找的是set

python3 BS4 BeautifulSoup 解析的一些解析（迷惑點）

python3 BS4 BeautifulSoup 解析的一些解析（迷惑點）

bs4——BeautifulSoup模組：解析網頁

Java併發程式設計高階技術-高效能併發框架原始碼解析與實戰（資源同步）

React 伺服器渲染原理解析與實踐（同步更新）

反向解析與PTR（Pointer Record）

2018最新Java併發程式設計高階技術-高效能併發框架原始碼解析與實戰（已完結）

Spring原始碼解析之四（bean載入）

mybatis原始碼-解析配置檔案（四-1）之配置檔案Mapper解析(cache)

React 伺服器渲染原理解析與實踐（資源連結）

2018React 伺服器渲染原理解析與實踐（已完結）最新

Java併發程式設計高階技術-高效能併發框架原始碼解析與實戰（資源連結）

React 伺服器渲染原理解析與實踐（完整版）

Gson欄位解析失敗相容（gson-plugin）

JSON資料解析：Gson（谷歌）和fastjson（阿里巴巴）的異同點

Java併發程式設計高階技術-高效能併發框架原始碼解析與實戰（已完結）2018（最全）

React 伺服器渲染原理解析與實踐（已完結）2018（最全）

2019最新Java併發程式設計高階技術-高效能併發框架原始碼解析與實戰（已完結）

java基礎74 XML解析中的SAX解析相關知識點（網頁知識）

JavaMail解析郵件內容（經典收藏）

Lua 使用cjson解析json資料（Mac環境）

python3 BS4 BeautifulSoup 解析的一些解析（迷惑點）

相關推薦