python爬蟲學習第十二天
阿新 • • 發佈:2019-01-24
今天學習了用Beautifulsoup函式來獲取指定的節點,以及用當前結點順藤摸瓜找到其子節點,後代節點,兄弟節點,父節點。
練習1 findAll 函式抽取只包含在 標籤裡的文字
還順便把class=’red’標籤裡的內容也提取了
# from urllib.request import urlopen
# from bs4 import BeautifulSoup
# r = urlopen('http://www.pythonscraping.com/pages/warandpeace.html')
# bsObj = BeautifulSoup(r)
# persons = bsObj.findAll('span',{'class':'green'})
# conversasions = bsObj.findAll('span',{'class':'red'})
# for name in persons:
# print(name.get_text())
# print('\n')
# for talks in conversasions:
# print(talks.get_text())
練習2 查詢內容匹配的html元素
查詢html元素在昨天已經練習過了就是find/findall函式。
利用這兩個函式的tag引數與tagAtrribute引數可以讓我們檢索大多數標籤,此外我們還可以通過text引數(下面的例子正是如此)匹配內容包含制定字串的標籤
# from urllib.request import urlopen
# from bs4 import BeautifulSoup
# r = urlopen('http://www.pythonscraping.com/pages/warandpeace.html')
# bsObj = BeautifulSoup(r)
# test = bsObj.findAll(text = 'the prince')
# print(len(test))
練習3 子標籤和後代標籤 注意他們的區別
子標籤就是一個父標籤的下一級,而後代標籤是指一個父標籤 下面所有級別的標籤。所有的子標籤都是後代標 籤,但不是所有的後代標籤都是子標籤。
# from urllib.request import urlopen
# from bs4 import BeautifulSoup
# r = urlopen('http://www.pythonscraping.com/pages/page3.html')
# bsObj = BeautifulSoup(r)
# for child in bsObj.find('table',{'id':'giftList'}).children:
# print(child)
# print('\n')
# for descendant in bsObj.find('table',{'id':'giftList'}).descendants:
# print(descendant)
練習4 用next_siblings獲取兄弟節點
# from urllib.request import urlopen
# from bs4 import BeautifulSoup
# r = urlopen('http://www.pythonscraping.com/pages/page3.html')
# bsObj = BeautifulSoup(r)
# for sibling in bsObj.find('table',{'id':'giftList'}).tr.next_siblings:
# print(sibling)
練習5 用parent/parents操作父節點
# from urllib.request import urlopen
# from bs4 import BeautifulSoup
# r = urlopen('http://www.pythonscraping.com/pages/page3.html')
# bsObj = BeautifulSoup(r)
# money = bsObj.find('img',{'src':'../img/gifts/img1.jpg'}).parent.previous_sibling
# print(money.get_text())