Xpath語法詳解
阿新 • • 發佈:2018-12-14
本次示例使用python的lxml 對xpath進行演示
安裝lxml
pip install lxml
xpath常規用法
示例html
htm = """ <html> <div> <ul> <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> <li class="else-1">something else</li> this is ul item </ul> </div> </html> """
查詢xxx下的所有xx元素
from lxml import etree # 紅線提示找不到etree的初始化方法,沒關係不影響 htm = """ <html> <div> <ul> <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> <li class="else-1">something else</li> this is ul item </ul> </div> </html> """ selector = etree.HTML(htm) # 初始化etree all_li = selector.xpath('//div/ul/li') # //代表從節結點開始查詢,這裡查詢ul下為li的所有元素 for i in all_li: print(i) 執行結果: <Element li at 0x1a7955a2808> # 0x1a7955a2808是記憶體地址,這是一組元素,如要顯示具體可以這樣(如:/a/text() # 檢視a標籤的文字(往下看也有演示)) <Element li at 0x1a7955a27c8> <Element li at 0x1a7955a28c8> <Element li at 0x1a7955a2908> <Element li at 0x1a7955a2948> <Element li at 0x1a7955a29c8>
查詢xxx下的第一個xx元素
from lxml import etree # 紅線提示找不到etree的初始化方法,沒關係不影響 htm = """ <html> <div> <ul> <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> <li class="else-1">something else</li> this is ul item </ul> </div> </html> """ selector = etree.HTML(htm) # 初始化etree all_li = selector.xpath('//div/ul/li[1]') # 查詢第一個li,注意在xpath中第一個下標不是0,而是1 print(all_li) 執行結果: [<Element li at 0x1d0e2612608>]
注意:
如果網頁中存在多個相同元素,不使用下標進行查詢,系統只會預設查詢第一個,若第一個元素不符會直接丟擲異常。
查詢xx元素對應的文字資訊
from lxml import etree # 紅線提示找不到etree的初始化方法,沒關係不影響
htm = """
<html>
<div>
<ul>
<li class="item-0"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
<li class="else-1">something else</li>
this is ul item
</ul>
</div>
</html>
"""
selector = etree.HTML(htm) # 初始化etree
# all_li = selector.xpath('//div/ul/li[1]/a/text()')[0] # 這樣寫直接輸出a下面的第一個文字
all_li = selector.xpath('//div/ul/li[1]/a/text()') # 使用text()提取a標籤下的文字資訊
print(all_li) # 也可以使用下標直接取出結果如:all_li[0]輸出結果 first item
執行結果:
['first item']
小知識
如果在使用的html頁面中只要元素是唯一的,也可以不從根目錄開始查詢,簡單示例幾種:
all_li = selector.xpath('//ul/li[1]/a/text()')[0] #省去div一樣可以
all_li = selector.xpath('//*[@class="item-inactive"]/a/text()') [0] # 直接使用class查詢第三個li的文字
all_li = selector.xpath('//a[@href="link2.html"]/text()')[0] # 直接使用href查詢第二個li的文字
獲取xxx下元素的屬性
獲取單個屬性
from lxml import etree # 紅線提示找不到etree的初始化方法,沒關係不影響
htm = """
<html>
<div>
<ul>
<li class="item-0"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
<li class="else-1">something else</li>
this is ul item
</ul>
</div>
</html>
"""
selector = etree.HTML(htm) # 初始化etree
all_li = selector.xpath('//li[3]/a/@href')[0] # 獲取href的屬性
print(all_li)
執行結果:
link3.html
獲取class的全部屬性
from lxml import etree # 紅線提示找不到etree的初始化方法,沒關係不影響
htm = """
<html>
<div>
<ul>
<li class="item-0"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
<li class="else-1">something else</li>
this is ul item
</ul>
</div>
</html>
"""
selector = etree.HTML(htm) # 初始化etree
all_li = selector.xpath('//li/@class') # 獲取href的屬性
print(all_li)
執行結果:
['item-0', 'item-1', 'item-inactive', 'item-1', 'item-0', 'else-1']
xpath高階用法
查找出xxx元素以xx開頭的屬性
還是這段html來做演示:
<html>
<div>
<ul>
<li class="item-0"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
<li class="else-1">something else</li>
this is ul item
</ul>
</div>
</html>
"""
使用starts-with()
示例程式碼:
from lxml import etree # 紅線提示找不到etree的初始化方法,沒關係不影響
htm = """
<html>
<div>
<ul>
<li class="item-0"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
<li class="else-1">something else</li>
this is ul item
</ul>
</div>
</html>
"""
selector = etree.HTML(htm) # 初始化etree
all_li = selector.xpath("//li[starts-with(@class, 'item-')]") # 獲取href的屬性
all_a = []
for i in all_li:
all_a.append(i.xpath('a/text()')[0]) # 繼續對找到的li元素使用xpath查詢其裡面的內容
print(all_a)
執行結果:
['first item', 'second item', 'third item', 'fourth item', 'fifth item']
也可以這樣寫:
from lxml import etree # 紅線提示找不到etree的初始化方法,沒關係不影響
htm = """
<html>
<div>
<ul>
<li class="item-0"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
<li class="else-1">something else</li>
this is ul item
</ul>
</div>
</html>
"""
selector = etree.HTML(htm) # 初始化etree
all_li = selector.xpath("//li[starts-with(@class, 'item-')]/a/text()") # 獲取href的屬性
print(all_li)
執行結果:
['first item', 'second item', 'third item', 'fourth item', 'fifth item']
查詢所有文字
使用string()
示例程式碼:
from lxml import etree # 紅線提示找不到etree的初始化方法,沒關係不影響
htm = """
<html>
<div>
<ul>
<li class="item-0"><a href="link1.html">first item</a></li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-inactive"><a href="link3.html">third item</a></li>
<li class="item-1"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
<li class="else-1">something else</li>
this is ul item
</ul>
</div>
</html>
"""
selector = etree.HTML(htm) # 初始化etree
all_li = selector.xpath("string(//ul)") # 獲取ul下的所有文字
print(all_li)
執行結果:
first item
second item
third item
fourth item
fifth item
something else
this is ul item
小小例項
獲取豆瓣首頁的豆瓣讀書文字及連結,在首頁取出一張圖片存入本地。
import requests
from lxml import etree # 紅線提示找不到etree的初始化方法,沒關係不影響
r = requests.get('https://www.douban.com/')
r.encoding = 'utf-8'
html = etree.HTML(r.text)
text = html.xpath('//*[@id="anony-nav"]/div[1]/ul/li[1]/a/@href')[0]
h1 = html.xpath('//*[@id="anony-nav"]/div[1]/ul/li[1]/a/text()')[0]
logs = html.xpath('//*[@id="anony-sns"]/div/div[3]/div/div[1]/ul/li[3]/div/a/img/@src')[0]
print(text)
print(h1)
print(logs)
log = requests.get(logs)
with open('d:/a.gif', 'wb') as file: # wb 二進位制形式寫入
file.write(log.content) # 儲存圖片
執行結果:
https://book.douban.com
豆瓣讀書
https://img3.doubanio.com/f/shire/a1fdee122b95748d81cee426d717c05b5174fe96/pics/blank.gif