1. 程式人生 > >XPath使用總結

XPath使用總結

常用 nodename ant href DC 規則 cto 總結 fir

參考:https://cuiqingcai.com/5545.html

XPath

  • XML Path Language
  • 在XML文檔中查找信息,同樣適用於HTML文檔
  • 使用路徑選擇表達式的方式查找信息

XPath常用規則

  • nodename:選取次節點的所有子節點
  • /:從當前節點選取直接子節點
  • //: 從當前節點選取子孫節點
  • .: 選取當前節點
  • ..: 選取當前節點的父節點
  • @: 選取屬性

text = ‘‘‘
  <div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a>
     </ul>
 </div>
‘‘‘

選取所有節點

from lxml import etree

selector = etree.HTML(text)
result = selector.xpath(//*)
print(result)

輸出

[<Element html at 0x1761bfd5508>, <Element body at 0x1761bfd5a88>, <Element div at 0x1761bfd5ac8>, <Element ul at 0x1761bfd5b08>, <Element li at 0x1761bfd5e88>, <Element a at 0x1761bfd5f08>, <Element li at 0x1761bfd5f48>, <Element a at 0x1761bfd5f88>, <Element li at 0x1761bfd5fc8>, <Element a at 0x1761bfd5ec8>, <Element li at 0x1761bfdb048>, <Element a at 0x1761bfdb088>, <Element li at 0x1761bfdb0c8>, <Element a at 0x1761bfdb108>]

子節點

from lxml import etree

selector = etree.HTML(text)
result = selector.xpath(//li/a)
print(result)

輸出

[<Element a at 0x1761c02dec8>, <Element a at 0x1761c02de88>, <Element a at 0x1761c02df08>, <Element a at 0x1761c02df48>, <Element a at 0x1761c02df88>]

父節點

from
lxml import etree selector = etree.HTML(text) result = selector.xpath(//li/..) print(result)

輸出

[<Element ul at 0x1761ae7c288>]

屬性匹配

from lxml import etree

selector = etree.HTML(text)
result = selector.xpath(//li[@class="item-0"])
print(result)

輸出

[<Element li at 0x1761afe2dc8>, <Element li at 0x1761c067748>]

註:[@class="item-0"]要使用雙引號

文本獲取

from lxml import etree

selector = etree.HTML(text)
result1 = selector.xpath(//li[@class="item-0"]/text())
result2 = selector.xpath(//li[@class="item-0"]/a/text())
print(result1)
print(result2)

輸出

[\n     ]
[first item, fifth item]

註://li[@class="item-0"]/text()得到[‘\n ‘] 因"/"是獲取直接子節點

屬性獲取

from lxml import etree

selector = etree.HTML(text)
result = selector.xpath(//li[@class="item-0"]/a/@href)
print(result)

輸出

[link1.html, link5.html]

屬性多值匹配

from lxml import etree

text1 = ‘‘‘
<li class="li li-first"><a href="link.html">first item</a></li>
‘‘‘

selector = etree.HTML(text1)
result1 = selector.xpath(//li[@calss="li"]/a/text())
result2 = selector.xpath(//li[contains(@class,"li")]/a/text())
print(result1)
print(result2)

輸出

[]
[first item]

多屬性匹配

from lxml import etree

text2 = ‘‘‘
<li class="li li-first" name="item"><a href="link.html">first item</a></li>
‘‘‘

selector = etree.HTML(text2)
result = selector.xpath(//li[contains(@class,"li") and @name="item"]/a/text())
print(result

輸出

[first item]

按序選擇

from lxml import etree
 
text = ‘‘‘
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a>
     </ul>
 </div>
‘‘‘
selector = etree.HTML(text)
result1 = selector.xpath(//li[1]/a/text())
print(result1)
result2 = selector.xpath(//li[last()]/a/text())
print(result2)
result3 = selector.xpath(//li[position()<3]/a/text())
print(result3)
result4 = selector.xpath(//li[last()-2]/a/text())
print(result4)

輸出

[first item]
[fifth item]
[first item, second item]
[third item]

節點軸選擇

from lxml import etree
 
text3 = ‘‘‘
<div>
    <ul>
         <li class="item-0"><a href="link1.html"><span>first item</span></a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a>
     </ul>
 </div>
‘‘‘
selector = etree.HTML(text3)
result1 = selector.xpath(//li[1]/ancestor::*)
print(result1)
result2 = selector.xpath(//li[1]/ancestor::div)
print(result2)
result3 = selector.xpath(//li[1]/attribute::*)
print(result3)
result4 = selector.xpath(//child::a[@href="link1.html"])
print(result4)
result5 = selector.xpath(//li[1]/descendant::span)
print(result5)
result6 = selector.xpath(//li[1]/following::*[2])
print(result6)
result7 = selector.xpath(//li[1]/following-sibling::*)
print(result7)

輸出

[<Element html at 0x1761c02db88>, <Element body at 0x1761c07bf08>, <Element div at 0x1761c078308>, <Element ul at 0x1761c086088>]
[<Element div at 0x1761c078308>]
[item-0]
[<Element a at 0x1761c086288>]
[<Element span at 0x1761c06e6c8>]
[<Element a at 0x1761c06e688>]
[<Element li at 0x1761c078b08>, <Element li at 0x1761c078648>, <Element li at 0x1761c0864c8>, <Element li at 0x1761c086448>]

XPath使用總結