python 爬蟲 提取文字之BeautifulSoup詳細用法
阿新 • • 發佈:2019-02-16
提取 網頁內容 四大基本方法之
2.beautifulsoup的使用 bs4模組
準備程式碼資訊,用來練習獲取內容:
from bs4 import BeautifulSoup #準備程式碼資訊,用來練習獲取內容 html =''' <html> <head><title>The Dormouse's story</title></head> <body> <h1><b>123456</b></h1> <p class="title" name="dromouse"> <b>The Dormouse's story</b> aaaaa </p> <p class="title" name="dromouse" title='new'><b>The Dormouse's story</b>a</p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; <a href="http://example.com/tillie" class="siterr" id="link4">Tillie</a>; <a href="http://example.com/tillie" class="siterr" id="link5">Tillie</a>; and they lived at the bottom of a well. </p> <p class="story">...</p> <ul id="ulone"> <li>01</li> <li>02</li> <li>03</li> <li>04</li> <li>05</li> </ul> <div class='div11'> <ul id="ultwo"> <li>0001</li> <li>0002</li> <li>0003</li> <li>0004</li> <li>0005</li> </ul> </div> </body> </html> '''
1.得到beautifulsoup物件
soup = BeautifulSoup(html,'html.parser') #選擇解析器
2.獲取內容
(1)獲取標題物件
print(soup.title)
獲取標題文字字串:
print(soup.title.string) #返回迭代器,出現換行就找不到了
print(soup.title.text)
print(soup.title.get_text())
title = soup.find('title').get_text()
print(title)
通過上下級關係獲取物件
print(soup.title.parent)
print(soup.title.child)
print(soup.title.children)
(2)獲取第一個p標籤
print(soup.p.get_text())
print(soup.find('p').text)
#獲取p的子標籤們 (空行也看成了一個children)
print(soup.p.children)
for i,echo in enumerate(soup.p.children):
print(i,echo)
(3)獲取標籤的屬性
#只能找到第一個a標籤
print('1',soup.a)
print('2',soup.a.name)
#應該這樣寫
print(soup.a.attrs)
print(soup.a.attrs['href'])
print(soup.a.attrs['id'])
print(soup.a.attrs['class'][0]) #獲得的是一個列表,可以用下標查詢
(4)獲取多個
print(soup.find('p')) #獲取一個
print(soup.find_all('p')) #獲取soup內的p標籤 返回一個列表
(5)多層查詢
find_all查詢返回的是列表,使用下標尋找想要的內容
print(soup.find_all('ul'))
print(soup.find_all('ul')[0].find_all('li'))
(6)通過指定的屬性,獲取物件
print(soup.find(id='ulone')) #單個物件
print(soup.find('ul',id='ulone'))
print(soup.find_all('ul',id='ulone')) #可以使用下標查詢
class是關鍵字 要這麼寫class_
print('class1',soup.find_all('p',class_='title'))
print('class2',soup.find_all('p',attrs={'class':'title'})) #更通用
print('class3',soup.find_all('p',attrs={'class':'title','title':'new'})) #多條件
用函式作為引數,獲取元素
def judgeTilte1(t):
if t=='title':
return True
print(soup.find_all(class_=judgeTilte1))
判斷長度
import re #正則表示式
reg = re.compile("sis")
def judgeTilte2(t):
#返回長度為6,且包含'sis'的t引數
return len(str(t))==6 and bool(re.search(reg,t))
print(soup.find_all(class_=judgeTilte2))
#獲取文字內容
# <p class="title" name="dromouse">
# <b>The Dormouse's story</b>
# aaaaa
# </p>
print(soup.find('p').text)
print(soup.find('p').string) #返回迭代器,出現換行就找不到了<b>,
print(soup.find('p').get_text())
獲取可以加limit的,限制訪問個數
print(soup.find_all('a',limit=2))
recursive = True 尋找子孫 ;recursive = False只找子
print(soup.find_all('body')[0].find_all('ul',recursive = False))