Python爬蟲之xpath的基本使用
阿新 • • 發佈:2021-11-28
寫在前面:
前段時間練習爬蟲一直在使用Beautifulsoup,現在打算開始接觸xpath,XPath 的選擇功能十分強大,它提供了非常簡潔明瞭的路徑選擇表示式。
使用規則:
例項學習
<p> <ul> <li class="item-0"><a href="https://s1.bdstatic.com/">item 0 </a></li> <li class="item-1"><a href="https://s2.bdstatic.com/">item 1 </a></li> <li class="item-2"><a href="https://s3.bdstatic.com/">item 2 </a></li> <li class="item-3"><a href="https://s4.bdstatic.com/">item 3 </a></li> <li class="item-4"><a href="https://s5.bdstatic.com/">item 4 </a></li> <li class="item-5"><a href="https://s6.bdstatic.com/">item 5 </a></li> </ul> </p> '''
獲取某個標籤的內容
注意,獲取a標籤的所有內容,a後面就不用再加正斜槓,否則報錯
html_data = html.xpath('/html/body/ul/li/a/text()') for i in html_data: print(i.text) 或 html_data = html.xpath('/html/body/ul/li/a') for i in html_data: print(i.text) text()是獲取標籤裡的內容
列印指定路徑下a標籤的屬性
這裡可以通過遍歷拿到某個屬性的值,查詢標籤的內容,通過@屬性名獲取
html = etree.HTML(text) html_data = html.xpath('/html/body/ul/li/a/@href') for i in html_data: print(i)
[]裡是具體屬性,contains是包含,常用於屬性匹配,而“//li[@class="item-1"]/a/text()”就是獲取class為item-1標籤的文字內容
from lxml import etree text = ''' <li class="zxc asd wer" name="222"><a href="https://s2.bdstatic.com/">1 item</a></li> <li class="ddd zxc eee" name="111"><a href="https://s3.bdstatic.com/">2 item</a></li> ''' html = etree.HTML(text) result = html.xpath('//li[contains(@class, "zxc") and @name="111"]/a/text()') print(result) # 執行結果:['2 item']
from lxml import etree print("------------") text = ''' <div> <ul> <li class="item-0"><a href="https://s1.bdstatic.com/">item 0 </a></li> <li class="item-1"><a href="https://s2.bdstatic.com/">item 1 </a></li> <li class="item-2"><a href="https://s3.bdstatic.com/">item 2 </a></li> <li class="item-3"><a href="https://s4.bdstatic.com/">item 3 </a></li> <li class="item-4"><a href="https://s5.bdstatic.com/">item 4 </a></li> <li class="item-5"><a href="https://s6.bdstatic.com/">item 5 </a></li> </ul> </div> ''' html = etree.HTML(text) # 獲取第一個 result = html.xpath('//li[1]/a/text()') print(result) # 獲取最後一個 result = html.xpath('//li[last()]/a/text()') print(result) # 獲取前兩個 result = html.xpath('//li[position()<3]/a/text()') print(result) # 獲取倒數第三個 result = html.xpath('//li[last()-2]/a/text()') print(result) """ 執行結果: ['item 0 '] ['item 5 '] ['item 0 ', 'item 1 '] ['item 3 '] """