1. 程式人生 > >BeautifulSoup select方法

BeautifulSoup select方法

圖書 電動車 收藏 spider pri chan 文件 process 保健

 1 html = """
 2 <html><head><title>The Dormouse‘s story</title></head>
 3 <body>
 4 <p class="title" name="dromouse"><b>The Dormouse‘s story</b></p>
 5 <p class="story">Once upon a time there were three little sisters; and their names were
 6 
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, 7 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and 8 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; 9 and they lived at the bottom of a well.
</p> 10 <p class="story">...</p> 11 """

我們在寫 CSS 時,標簽名不加任何修飾,類名前加點,id名前加 #,在這裏我們也可以利用類似的方法來篩選元素,用到的方法是 soup.select(),返回類型是 list
(1)通過標簽名查找

print soup.select(‘title‘) 
#[<title>The Dormouse‘s story</title>]
 
print soup.select(‘a‘)
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
 
print soup.select(‘b‘)
#[<b>The Dormouse‘s story</b>]

(2)通過類名查找

print soup.select(‘.sister‘)
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

(3)通過 id 名查找

print soup.select(‘#link1‘)
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

(4)組合查找

組合查找即和寫 class 文件時,標簽名與類名、id名進行的組合原理是一樣的,例如查找 p 標簽中,id 等於 link1的內容,二者需要用空格分開

print soup.select(‘p #link1‘)
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

直接子標簽查找

print soup.select("head > title")
#[<title>The Dormouse‘s story</title>]

(5)屬性查找

查找時還可以加入屬性元素,屬性需要用中括號括起來,註意屬性和標簽屬於同一節點,所以中間不能加空格,否則會無法匹配到。

print soup.select("head > title")
#[<title>The Dormouse‘s story</title>]
 
print soup.select(‘a[href="http://example.com/elsie"]‘)
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]

同樣,屬性仍然可以與上述查找方式組合,不在同一節點的空格隔開,同一節點的不加空格

print soup.select(‘p a[href="http://example.com/elsie"]‘)
#[<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>]


示例代碼:
from bs4 import BeautifulSoup
import requests

#定義58同城上杭州區域的起始頁面
start_url = http://hz.58.com/sale.shtml
url_host = http://hz.58.com

def get_index_url(url):
    wb_data = requests.get(start_url)
    soup = BeautifulSoup(wb_data.text,lxml)
    links = soup.select(ul.ym-mainmnu > li > span > a)
    print(links)
    for link in links:
        page_url = url_host + str(link.get(href))
        print(page_url)
get_index_url(start_url)

運行結果:

C:\Users\licl11092\AppData\Local\Programs\Python\Python35\python.exe D:/Spider/58spider/channel_extact.py
[<a href="/shouji/">手機</a>, <a href="/tongxunyw/">通訊</a>, <a href="/danche/">摩托車</a>, <a href="/diandongche/">電動車</a>, <a href="/diannao/">電腦</a>, <a href="/shuma/">數碼</a>, <a href="/jiadian/">家電</a>, <a href="/ershoujiaju/">家具</a>, <a href="/yingyou/">母嬰玩具</a>, <a href="/fushi/">服裝箱包</a>, <a href="/meirong/">美容保健</a>, <a href="/yishu/">藝術收藏</a>, <a href="/tushu/">圖書音像</a>, <a href="/wenti/">文體戶外</a>, <a href="/bangong/">辦公設備</a>, <a href="/shebei.shtml">二手設備</a>, <a href="/chengren/" onclick="clickLog(‘from=pc_index_loucengdb_ershoujiaoyi_gongcheng‘)">成人用品</a>, <a>更多分類</a>]
http://hz.58.com/shouji/
http://hz.58.com/tongxunyw/
http://hz.58.com/danche/
http://hz.58.com/diandongche/
http://hz.58.com/diannao/
http://hz.58.com/shuma/
http://hz.58.com/jiadian/
http://hz.58.com/ershoujiaju/
http://hz.58.com/yingyou/
http://hz.58.com/fushi/
http://hz.58.com/meirong/
http://hz.58.com/yishu/
http://hz.58.com/tushu/
http://hz.58.com/wenti/
http://hz.58.com/bangong/
http://hz.58.com/shebei.shtml
http://hz.58.com/chengren/
http://hz.58.comNone

Process finished with exit code 0

BeautifulSoup select方法