python爬蟲7——XPath與lxml類庫、xpath helper外掛

阿新 • • 發佈：2018-11-19

有同學說，我正則用的不好，處理HTML文件很累，有沒有其他的方法？

有！那就是XPath，我們可以先將 HTML檔案轉換成 XML文件，然後用 XPath 查詢 HTML 節點或元素。

什麼是XML

XML 指可擴充套件標記語言（EXtensible Markup Language）
XML 是一種標記語言，很類似 HTML
XML 的設計宗旨是傳輸資料，而非顯示資料
XML 的標籤需要我們自行定義。
XML 被設計為具有自我描述性。
XML 是 W3C 的推薦標準

W3School官方文件：http://www.w3school.com.cn/xml/index.asp

XML 和 HTML 的區別

資料格式	描述	設計目標
XML	Extensible Markup Language `（可擴充套件標記語言）`	被設計為傳輸和儲存資料，其焦點是資料的內容。
HTML	HyperText Markup Language `（超文字標記語言）`	顯示資料以及如何更好顯示資料。
HTML DOM	Document Object Model for HTML `(文件物件模型)`	通過 HTML DOM，可以訪問所有的 HTML 元素，連同它們所包含的文字和屬性。可以對其中的內容進行修改和刪除，同時也可以建立新的元素。

XML文件示例

<?xml version="1.0" encoding="utf-8"?>

<bookstore> 

  <book category="cooking"> 
    <title lang="en">Everyday Italian</title>  
    <author>Giada De Laurentiis</author>  
    <year>2005</year>  
    <price>30.00</price> 
  </book>  

  <book category="children"> 
    <title lang="en">Harry Potter</title>  
    <author>J K. Rowling</author>  
    <year>2005</year>  
    <price>29.99</price> 
  </book>  

  <book category="web"> 
    <title lang="en">XQuery Kick Start</title>  
    <author>James McGovern</author>  
    <author>Per Bothner</author>  
    <author>Kurt Cagle</author>  
    <author>James Linn</author>  
    <author>Vaidyanathan Nagarajan</author>  
    <year>2003</year>  
    <price>49.99</price> 
  </book> 

  <book category="web" cover="paperback"> 
    <title lang="en">Learning XML</title>  
    <author>Erik T. Ray</author>  
    <year>2003</year>  
    <price>39.95</price> 
  </book> 

</bookstore>

HTML DOM 模型示例

HTML DOM 定義了訪問和操作 HTML 文件的標準方法，以樹結構方式表達 HTML 文件。

XML的節點關係

1. 父（Parent）

每個元素以及屬性都有一個父。

下面是一個簡單的XML例子中，book 元素是 title、author、year 以及 price 元素的父：

<?xml version="1.0" encoding="utf-8"?>

<book>
  <title>Harry Potter</title>
  <author>J K. Rowling</author>
  <year>2005</year>
  <price>29.99</price>
</book>

2. 子（Children）

元素節點可有零個、一個或多個子。

在下面的例子中，title、author、year 以及 price 元素都是 book 元素的子：

<?xml version="1.0" encoding="utf-8"?>

<book>
  <title>Harry Potter</title>
  <author>J K. Rowling</author>
  <year>2005</year>
  <price>29.99</price>
</book>

3. 同胞（Sibling）

擁有相同的父的節點

在下面的例子中，title、author、year 以及 price 元素都是同胞：

<?xml version="1.0" encoding="utf-8"?>

<book>
  <title>Harry Potter</title>
  <author>J K. Rowling</author>
  <year>2005</year>
  <price>29.99</price>
</book>

4. 先輩（Ancestor）

某節點的父、父的父，等等。

在下面的例子中，title 元素的先輩是 book 元素和 bookstore 元素：

<?xml version="1.0" encoding="utf-8"?>

<bookstore>

<book>
  <title>Harry Potter</title>
  <author>J K. Rowling</author>
  <year>2005</year>
  <price>29.99</price>
</book>

</bookstore>

5. 後代（Descendant）

某個節點的子，子的子，等等。

在下面的例子中，bookstore 的後代是 book、title、author、year 以及 price 元素：

<?xml version="1.0" encoding="utf-8"?>

<bookstore>

<book>
  <title>Harry Potter</title>
  <author>J K. Rowling</author>
  <year>2005</year>
  <price>29.99</price>
</book>

</bookstore>

什麼是XPath？

XPath (XML Path Language) 是一門在 XML 文件中查詢資訊的語言，可用來在 XML 文件中對元素和屬性進行遍歷。

W3School官方文件：http://www.w3school.com.cn/xpath/index.asp

XPath 開發工具

開源的XPath表示式編輯工具:XMLQuire(XML格式檔案可用)
Chrome外掛 XPath Helper
Firefox外掛 XPath Checker

選取節點

XPath 使用路徑表示式來選取 XML 文件中的節點或者節點集。這些路徑表示式和我們在常規的電腦檔案系統中看到的表示式非常相似。

下面列出了最常用的路徑表示式：

表示式	描述
nodename	選取此節點的所有子節點。
/	從根節點選取。
//	從匹配選擇的當前節點選擇文件中的節點，而不考慮它們的位置。
.	選取當前節點。
..	選取當前節點的父節點。
@	選取屬性。

在下面的表格中，我們已列出了一些路徑表示式以及表示式的結果：

	路徑表示式	結果
bookstore	選取 bookstore 元素的所有子節點。
/bookstore	選取根元素 bookstore。註釋：假如路徑起始於正斜槓( / )，則此路徑始終代表到某元素的絕對路徑！
bookstore/book	選取屬於 bookstore 的子元素的所有 book 元素。
//book	選取所有 book 子元素，而不管它們在文件中的位置。
bookstore//book	選擇屬於 bookstore 元素的後代的所有 book 元素，而不管它們位於 bookstore 之下的什麼位置。
//@lang	選取名為 lang 的所有屬性。

謂語（Predicates）

謂語用來查詢某個特定的節點或者包含某個指定的值的節點，被嵌在方括號中。

在下面的表格中，我們列出了帶有謂語的一些路徑表示式，以及表示式的結果：

路徑表示式	結果
/bookstore/book[1]	選取屬於 bookstore 子元素的第一個 book 元素。
/bookstore/book[last()]	選取屬於 bookstore 子元素的最後一個 book 元素。
/bookstore/book[last()-1]	選取屬於 bookstore 子元素的倒數第二個 book 元素。
/bookstore/book[position()<3]	選取最前面的兩個屬於 bookstore 元素的子元素的 book 元素。
//title[@lang]	選取所有擁有名為 lang 的屬性的 title 元素。
//title[@lang=’eng’]	選取所有 title 元素，且這些元素擁有值為 eng 的 lang 屬性。
/bookstore/book[price>35.00]	選取 bookstore 元素的所有 book 元素，且其中的 price 元素的值須大於 35.00。
/bookstore/book[price>35.00]/title	選取 bookstore 元素中的 book 元素的所有 title 元素，且其中的 price 元素的值須大於 35.00。

選取未知節點

XPath 萬用字元可用來選取未知的 XML 元素。

萬用字元	描述
*	匹配任何元素節點。
@*	匹配任何屬性節點。
node()	匹配任何型別的節點。

在下面的表格中，我們列出了一些路徑表示式，以及這些表示式的結果：

路徑表示式	結果
/bookstore/*	選取 bookstore 元素的所有子元素。
//*	選取文件中的所有元素。
//title[@*]	選取所有帶有屬性的 title 元素。

選取若干路徑

通過在路徑表示式中使用“|”運算子，您可以選取若干個路徑。

例項

在下面的表格中，我們列出了一些路徑表示式，以及這些表示式的結果：

路徑表示式	結果
//book/title \| //book/price	選取 book 元素的所有 title 和 price 元素。
//title \| //price	選取文件中的所有 title 和 price 元素。
/bookstore/book/title \| //price	選取屬於 bookstore 元素的 book 元素的所有 title 元素，以及文件中所有的 price 元素。

XPath的運算子

下面列出了可用在 XPath 表示式中的運算子：

這些就是XPath的語法內容，在運用到Python抓取時要先轉換為xml。

lxml庫

lxml 是一個HTML/XML的解析器，主要的功能是如何解析和提取 HTML/XML 資料。

lxml和正則一樣，也是用 C 實現的，是一款高效能的 Python HTML/XML 解析器，我們可以利用之前學習的XPath語法，來快速的定位特定元素以及節點資訊。

lxml python 官方文件：http://lxml.de/index.html

需要安裝C語言庫，可使用 pip 安裝：pip install lxml （或通過wheel方式安裝）

初步使用

我們利用它來解析 HTML 程式碼，簡單示例：

# lxml_test.py

# 使用 lxml 的 etree 庫
from lxml import etree 

text = '''
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a> # 注意，此處缺少一個 </li> 閉合標籤
     </ul>
 </div>
'''

#利用etree.HTML，將字串解析為HTML文件
html = etree.HTML(text) 

# 按字串序列化HTML文件
result = etree.tostring(html) 

print(result)

輸出結果：

<html><body>
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
 </div>
</body></html>

lxml 可以自動修正 html 程式碼，例子裡不僅補全了 li 標籤，還添加了 body，html 標籤。

檔案讀取：

除了直接讀取字串，lxml還支援從檔案裡讀取內容。我們新建一個hello.html檔案：

<!-- hello.html -->

<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul>
 </div>

再利用 etree.parse() 方法來讀取檔案。

# lxml_parse.py

from lxml import etree

# 讀取外部檔案 hello.html
html = etree.parse('./hello.html')
result = etree.tostring(html, pretty_print=True)

print(result)

輸出結果與之前相同：

<html><body>
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
 </div>
</body></html>

XPath例項測試

1. 獲取所有的 <li> 標籤

# xpath_li.py

from lxml import etree

html = etree.parse('hello.html')
print type(html)  # 顯示etree.parse() 返回型別

result = html.xpath('//li')

print result  # 列印<li>標籤的元素集合
print len(result)
print type(result)
print type(result[0])

輸出結果：

<type 'lxml.etree._ElementTree'>
[<Element li at 0x1014e0e18>, <Element li at 0x1014e0ef0>, <Element li at 0x1014e0f38>, <Element li at 0x1014e0f80>, <Element li at 0x1014e0fc8>]
5
<type 'list'>
<type 'lxml.etree._Element'>

2. 繼續獲取<li> 標籤的所有 class屬性

# xpath_li.py

from lxml import etree

html = etree.parse('hello.html')
result = html.xpath('//li/@class')

print result

執行結果

['item-0', 'item-1', 'item-inactive', 'item-1', 'item-0']

3. 繼續獲取<li>標籤下hre 為 link1.html 的 <a> 標籤

# xpath_li.py

from lxml import etree

html = etree.parse('hello.html')
result = html.xpath('//li/a[@href="link1.html"]')

print result

執行結果

[<Element a at 0x10ffaae18>]

4. 獲取<li> 標籤下的所有 <span> 標籤

# xpath_li.py

from lxml import etree

html = etree.parse('hello.html')

#result = html.xpath('//li/span')
#注意這麼寫是不對的：
#因為 / 是用來獲取子元素的，而 <span> 並不是 <li> 的子元素，所以，要用雙斜槓

result = html.xpath('//li//span')

print result

執行結果

[<Element span at 0x10d698e18>]

5. 獲取 <li> 標籤下的<a>標籤裡的所有 class

# xpath_li.py

from lxml import etree

html = etree.parse('hello.html')
result = html.xpath('//li/a//@class')

print result

執行結果

['blod']

6. 獲取最後一個 <li> 的 <a> 的 href

# xpath_li.py

from lxml import etree

html = etree.parse('hello.html')

result = html.xpath('//li[last()]/a/@href')
# 謂語 [last()] 可以找到最後一個元素

print result

執行結果

['link5.html']

7. 獲取倒數第二個元素的內容

# xpath_li.py

from lxml import etree

html = etree.parse('hello.html')
result = html.xpath('//li[last()-1]/a')

# text 方法可以獲取元素內容
print result[0].text

執行結果

fourth item

8. 獲取 class 值為 bold 的標籤名

# xpath_li.py

from lxml import etree

html = etree.parse('hello.html')

result = html.xpath('//*[@class="bold"]')

# tag方法可以獲取標籤名
print result[0].tag

執行結果

span

xpath helper外掛

google還有一款還不錯的外掛，叫xpath helper可以直接呼叫xpath語法來檢視html檔案的元素

xpath連結：https://pan.baidu.com/s/1dFgzBSd 密碼：zwvb

用快捷鍵Ctrl+shift+x來調出介面編寫xpath

python爬蟲7——XPath與lxml類庫、xpath helper外掛

有同學說，我正則用的不好，處理HTML文件很累，有沒有其他的方法？有！那就是XPath，我們可以先將 HTML檔案轉換成 XML文件，然後用 XPath 查詢 HTML 節點或元素。什麼是XML XML 指可擴充套件標記語言（EXtensible Marku

非結構化數據與結構化數據提取----XPath與lxml類庫

html ext sce .html 文件系統結構化數據繼續 http encoding 什麽是XML XML 指可擴展標記語言（EXtensible Markup Language） XML 是一種標記語言，很類似 HTML XML 的設計宗旨是傳輸數據，而非顯示數

Python爬蟲(十二)_XPath與lxml類庫

Python學習指南有同學說，我正則用的不好，處理HTML文件很累，有沒有其他的方法？有！那就是XPath,我們可以用先將HTML文件轉換成XML文件，然後用XPath查詢HTML節點或元素。什麼是XML XML指可擴充套件標記語言(Extensi

xpath和lxml類庫

1. xpath和lxml lxml是一款高效能的 Python HTML/XML 解析器，我們可以利用XPath，來快速的定位特定元素以及獲取節點資訊 2. 什麼是xpath XPath (XML Path Language) 是一門在 HTML\XML 文件中查詢資訊的語

[Python] [爬蟲] 7.批量政府網站的招投標、中標資訊爬取和推送的自動化爬蟲——資料處理器

目錄 1.Intro 2.Source 1.Intro 檔名：dataDisposer.py 模組名：資料處理器引用庫： pymongo datetime time sys

Python爬蟲之Xpath與lxml庫的用法

XPath 是一門在 XML 文件中查詢資訊的語言。XPath 用於在 XML 文件中通過元素和屬性進行導航。在學習之前應該具備的知識：在您繼續學習之前，應該對下面的知識有基本的瞭解： HTML / XHTML XML / XML 名稱空間

爬蟲-Xpath語言和lxml類庫

利用 store text 取字符串字符串 thead 讀取字符串代碼 pytho 什麽是XML XML 指可擴展標記語言（EXtensible Markup Language） XML 是一種標記語言，很類似 HTML XML 的設計宗旨是傳輸數據，而非顯示數據 X

python爬蟲（三）xpath與lxml

XPath XPath是一種在xml中查詢資訊的語言，可以用來在xml文件中對元素和屬性進行遍歷。 XPath使用路徑表示式在xml文件中選取節點，這裡注意需要逐級表現要選取節點的父子關係。 XPath符號 nodename 選取此節點的所有子節點 /

python爬蟲系列(3.2-lxml庫的使用)

一、基本介紹 1、lxml 是一個HTML/XML的解析器，主要的功能是如何解析和提取 HTML/XML 資料。 2、lxml和正則一樣，也是用 C 實現的，是一款高效能的 Python HTML/XML 解析器，我們可

爬蟲網頁分析——XPath與lxml的使用

爬蟲網頁分析——XPath與lxml的使用簡介： xpath ：是一種結構化網頁元素選擇器，支援列表和單節點資料獲取，他的好處可以支援規整網頁資料抓取。 xPath helper：是一款Chrome瀏覽器的開發者外掛，讓開發者能輕鬆獲取HTML元素的。整個抓取使用了xp

python與reportlab類庫技術用例-pdf報表1

#coding=utf-8from reportlab.platypus import *from reportlab.lib.styles import * from reportlab.rl_config import * from reportlab.lib.u

python安裝lxml類庫（windows環境下）

這篇文章因為看得小夥伴比較多，所以我重新整理了一下，發在這裡。 ------------------------------------------------------------------------------------------------------

windows下python安裝lxml類庫（重新整理後）

應小夥伴的要求，把原來的又修改整理了一下。 python安裝lxml類庫（windows環境下）　　這幾天為了使用python解析網頁找了不少方法，其中比較有用的是lxml。lxml是python使用xpath解析html時需要匯入的一個類庫，功

左手用R右手Python系列16——XPath與網頁解析庫

杜雨，EasyCharts團隊成員，R語言中文社群專欄作者，興趣方向為：Excel商務圖表，R語言資料視覺化，地理資訊資料視覺化。個人公眾號：資料小魔方（微信ID：datamofang），“資料小魔方”創始人。最近寫了不少關於網頁資料抓取的內容，大多涉及的是網頁請求方面的，無論是傳統的RCurl還

零基礎學python-2.7 列表與元組

cells one iss 顯示不同元組 jsb lsp ext 事實上，能夠把列表和元組看成普通的數組。可是這個數組能夠存儲不同的數據類型（對象）列表和元組的差別列表元組使用的符號 [] () 元素數量可變不可變改動元素不能

Python 2.7.6 安裝lxml模塊[ubuntu14.04 LTS]

targe 分享 python round .html back ubunt 文件 ati lxml --->首字母是字母l，不是數字１ lxml 2.x : https://pypi.python.org/pypi/lxml/2.3 1xml官網：http

九. 常用類庫、向量與哈希5.向量及其應用

構造 pty obj 元素 init container right setsize 使用數組 Vector（向量）是 java.util 包中的一個類，該類實現了類似動態數組的功能。向量和數組相似，都可以保存一組數據（數據列表）。但是數組的大小是固定的，一旦指定，就不能改

python django -7 Git與項目

inf 不同 lin 暫時 https sudo git倉庫推送 esc git的使用，主要包括：本地倉庫的命令遠程倉庫的命令項目需求、頁面、模型類的設計，及頁面的使用 Git簡介 Git是目前世界上最先進的分布式版本控制系統安裝 sudo apt-

Python爬蟲知識點——Session與Cookie

過期通過 class clas 決定知識點內存協作存在 Session和Cookie： Cookie：保持登錄後的認證狀態而保存在本地的數據 session:用cookie來標記用戶，cookie攜帶了會話ID消息，通過檢查cookie就能找到會話(session

【Python爬蟲學習筆記2】urllib庫的基本使用

代理服務 cor proc 技術 origin car windows tpc -c urllib庫是python內置的實現HTTP請求的基本庫，通過它可以模擬瀏覽器的行為，向指定的服務器發送一個請求，並保存服務器返回的數據。 urlopen函數函數原型：urlopen(

python爬蟲7——XPath與lxml類庫、xpath helper外掛

什麼是XML

XML的節點關係

什麼是XPath？

XPath 開發工具

選取節點

謂語（Predicates）

選取未知節點

選取若干路徑

XPath的運算子

lxml庫

初步使用

檔案讀取：

XPath例項測試

xpath helper外掛

相關推薦