Python3爬蟲從零開始:Beautiful Soup的使用
阿新 • • 發佈:2018-12-16
基本用法
例項1:
from bs4 import BeautifulSoup html =""" <html><head><title>The Dormouse's story</title></head> <body> <p class = 'title' name = "dromouse"><b>The Dormouse's story</b></p> <p class = "story">Once upon a time there were three little sisters;and their names were <a href = "http://example.com/elsie" class = "sister" id = "link1"><!--Elsie--></a>, <a href = "http://example.com/lacie" class = "sister" id = "link2">Lacie</a> and <a href = "http://example.com/tillie" class = "sister" id = "link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class = "story">...<p> """ soup = BeautifulSoup(html,'lxml') print(soup.prettify()) print(type(soup)) print(soup.title.string)
結果:
說明1:
可以看到,我們輸入的並不是一個完整的HTML字串,缺少了</body>等標籤,我們初始化BeautifulSoup時完成了自動更正格式。
說明2:
節點選擇器
例項2:提取資訊和巢狀選擇
from bs4 import BeautifulSoup html =""" <html><head><title>The Dormouse's story</title></head> <body> <p class = 'title' name = "dromouse"><b>The Dormouse's story</b></p> <p class = "story">Once upon a time there were three little sisters;and their names were <a href = "http://example.com/elsie" class = "sister" id = "link1"><!--Elsie--></a>, <a href = "http://example.com/lacie" class = "sister" id = "link2">Lacie</a> and <a href = "http://example.com/tillie" class = "sister" id = "link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class = "story">...<p> """ soup = BeautifulSoup(html,'lxml') #獲取節點# print('soutp.title:',soup.title) #選擇元素 print('soup.a:',soup.a) #取得的只是第一個a節點 #獲取名稱# print('soup.title.name:',soup.title.name) # .name獲取名稱 #獲取屬性# print('soup.p.attrs:',soup.p.attrs) # .attrs獲取所有屬性 print('soup.p.attrs["name"]:',soup.p.attrs['name']) #獲取限定屬性 print('soup.p["name]:',soup.p['name']) #更簡單的寫法 #獲取內容# print('soup.p.string:',soup.p.string) #巢狀選擇# print('soup.head.title:',soup.head.title)
結果
例項3:子節點
from bs4 import BeautifulSoup html =""" <html><head><title>The Dormouse's story</title></head> <body> <p class = "story">Once upon a time there were three little sisters;and their names were <a href = "http://example.com/elsie" class = "sister" id = "link1"><!--Elsie--></a>, <a href = "http://example.com/lacie" class = "sister" id = "link2">Lacie</a> and <a href = "http://example.com/tillie" class = "sister" id = "link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class = "story">...<p> """ soup = BeautifulSoup(html,'lxml') for i,content in enumerate(soup.p.contents): print(i,content) for i,child in enumerate(soup.p.children): print(i,child) #遍歷輸出是一致的。 print('contents:',soup.p.contents) print('children:',soup.p.children) print('type of contents:',type(soup.p.contents)) print('type of children:',type(soup.p.children))
結果:
說明:通過contents屬性和children屬性都能獲取直接子節點,但注意兩者區別。
例項4:子孫節點
from bs4 import BeautifulSoup
html ="""
<html><head><title>The Dormouse's story</title></head>
<body>
<p class = "story">Once upon a time there were three little sisters;and
their names were
<a href = "http://example.com/elsie" class = "sister" id = "link1"><!--Elsie--></a>,
<a href = "http://example.com/lacie" class = "sister" id = "link2">Lacie</a> and
<a href = "http://example.com/tillie" class = "sister" id = "link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class = "story">...<p>
"""
soup = BeautifulSoup(html,'lxml')
for i,content in enumerate(soup.p.descendants):
print(i,content)
print('type:',type(soup.p.descendants))
結果:
例項5:父節點和祖先節點
from bs4 import BeautifulSoup
html ="""
<html><head><title>The Dormouse's story</title></head>
<body>
<p class = "story">Once upon a time there were three little sisters;and
their names were
<a href = "http://example.com/elsie" class = "sister" id = "link1"><!--Elsie--></a>,
and they lived at the bottom of a well.</p>
<p class = "story">...<p>
"""
soup = BeautifulSoup(html,'lxml')
print(list(enumerate(soup.p.parents)))
print(soup.p.parent)
print(type(soup.p.parents))
print(type(soup.p.children))
print(type(soup.p.contents))
print(type(soup.p.descendants))
結果:
說明1:
parent屬性獲取的是父節點,parents屬性獲取的是所有的祖先節點。
說明2:
區分不同屬性的型別。
例項6:兄弟選擇器
from bs4 import BeautifulSoup
html ="""
<html><head><title>The Dormouse's story</title></head>
<body>
<p>hahaha</p>
Leraning
<a>C++</a>
HELLO
<a>Java</a>
World
<a>Python</a>
<a>JS</a>
<p class = "story">...<p>
"""
soup = BeautifulSoup(html,'lxml')
print('Next Sibling:',soup.a.next_sibling)
print('Prev Sibling:',soup.a.previous_sibling)
print('Next Siblings',list(enumerate(soup.a.next_siblings)))
print('Prev Siblings',list(enumerate(soup.a.previous_siblings)))
結果:
例項7:方法選擇器
find_all() API如下:find_all(name,attrs,recursive,text,**kwargs)
from bs4 import BeautifulSoup
html ="""
<div class = "C1">
<div class = "C2">
<h1>Hello</h1>
</div>
<div class = "C3">
<ul class = "U1" id = "list1">
<li class = "element">C++</li>
<li class = "element">Java</li>
</ul>
<ul class ="U2" id = "list2">
<li class = "element">Python></li>
</ul>
</div>
</div>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(name='ul')) #根據name查詢元素
print()
print(soup.find_all(attrs={'class':'element'})) #根據attrs查詢元素,引數型別是字典型
print()
print(soup.find_all(id='list2')) #另一種寫法
print()
print(soup.find_all(class_='element')) #class是Python中關鍵字,這裡記得加_
結果:
補充:除了find_all()方法,還有find()方法,返回第一個匹配的元素。
例項8:CSS選擇器
from bs4 import BeautifulSoup
html ="""
<div class = "C1">
<div class = "C2">
<h1>Hello</h1>
</div>
<div class = "C3">
<ul class = "U1" id = "list1">
<li class = "element">C++</li>
<li class = "element">Java</li>
</ul>
<ul class ="U2" id = "list2">
<li class = "element">Python</li>
</ul>
</div>
</div>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.select('.U1')) #選擇器選擇
print()
print(soup.select('ul li')[0]) #限定選擇
print()
print(soup.select('ul li')[0].attrs['class']) #獲取屬性
print()
print(soup.select('li')[2].string) #獲取文字
print()
print(soup.select('li')[2].get_text()) #另一個獲取文字方法
結果: