python的lxml模組學習筆記

阿新 • • 發佈：2020-11-30

問題1：有一個XML檔案，如何解析
問題2：解析後，如果查詢、定位某個標籤
問題3：定位後如何操作標籤，比如訪問屬性、文字內容等

from lxml import etree-> 匯入模組，該庫常用的XML處理功能都在lxml.etree中

from lxml import etree  
import requests  
 
url = 'http://www.nbzhuti.cn/
html = requests.get(url)  
  
selector = etree.HTML(html.text)  
content_field = selector.xpath('//div[@class="lesson-list"]/ul/li')  
print(content_field)

Element類

Element是XML處理的核心類，Element物件可以直觀的理解為XML的節點，大部分XML節點的處理都是圍繞該類進行的。這部分包括三個內容：節點的操作、節點屬性的操作、節點內文字的操作。

1. 節點操作

建立Element物件
使用Element方法，引數即節點名稱。

>>> root = etree.Element('root')
>>> print(root)
<Element root at 0x2da0708>
這一at 後面不是報錯，剛開始我也以為是報錯，哈哈

獲取節點名稱
使用tag屬性，獲取節點的名稱。

>>> print(root.tag)
root

輸出XML內容
使用tostring方法輸出XML內容，引數為Element物件。

>>> print(etree.tostring(root))
b'<root><child1/><child2/><child3/></root>'

新增子節點
使用SubElement方法建立子節點，第一個引數為父節點（Element物件），第二個引數為子節點名稱。

>>> child1 = etree.SubElement(root, 'child1')
>>> child2 = etree.SubElement(root, 'child2')
>>> child3 = etree.SubElement(root, 'child3')

刪除子節點
使用remove方法刪除指定節點，引數為Element物件。clear方法清空所有節點。

>>> root.remove(child1)  # 刪除指定子節點
>>> print(etree.tostring(root))
b'<root><child2/><child3/></root>'
>>> root.clear()  # 清除所有子節點
>>> print(etree.tostring(root))
b'<root/>'

以列表的方式操作子節點
可以將Element物件的子節點視為列表進行各種操作：

>>> child = root[0]  # 下標訪問
>>> print(child.tag)
child1

>>> print(len(root))  # 子節點數量
3

>>> root.index(child2)  # 獲取索引號
1

>>> for child in root:  # 遍歷
...     print(child.tag)
child1
child2
child3

>>> root.insert(0, etree.Element('child0'))  # 插入
>>> start = root[:1]  # 切片
>>> end = root[-1:]

>>> print(start[0].tag)
child0
>>> print(end[0].tag)
child3

>>> root.append( etree.Element('child4') )  # 尾部新增
>>> print(etree.tostring(root))
b'<root><child0/><child1/><child2/><child3/><child4/></root>'

獲取父節點
使用getparent方法可以獲取父節點。

>>> print(child1.getparent().tag)
root

屬性操作

屬性是以key-value的方式儲存的，就像字典一樣。

1. 建立屬性

可以在建立Element物件時同步建立屬性，第二個引數即為屬性名和屬性值：

>>> root = etree.Element('root', interesting='totally')
>>> print(etree.tostring(root))
b'<root interesting="totally"/>'
也可以使用set方法給已有的Element物件新增屬性，兩個引數分別為屬性名和屬性值：

>>> root.set('hello', 'Huhu')
>>> print(etree.tostring(root))
b'<root interesting="totally" hello="Huhu"/>'

2. 獲取屬性

屬性是以key-value的方式儲存的，就像字典一樣。直接看例子

# get方法獲得某一個屬性值
>>> print(root.get('interesting'))
totally

# keys方法獲取所有的屬性名
>>> sorted(root.keys())
['hello', 'interesting']

# items方法獲取所有的鍵值對
>>> for name, value in sorted(root.items()):
...     print('%s = %r' % (name, value))
hello = 'Huhu'
interesting = 'totally'

也可以用attrib屬性一次拿到所有的屬性及屬性值存於字典中：

>>> attributes = root.attrib
>>> print(attributes)
{'interesting': 'totally', 'hello': 'Huhu'}

>>> attributes['good'] = 'Bye'  # 字典的修改影響節點
>>> print(root.get('good'))
Bye

文字操作

標籤及標籤的屬性操作介紹完了，最後就剩下標籤內的文字了。可以使用text和tail屬性、或XPath的方式來訪問文字內容。

1. text和tail屬性

一般情況，可以用Element的text屬性訪問標籤的文字。

>>> root = etree.Element('root')
>>> root.text = 'Hello, World!'
>>> print(root.text)
Hello, World!
>>> print(etree.tostring(root))
b'<root>Hello, World!</root>'```

XML的標籤一般是成對出現的，有開有關，但像HTML則可能出現單一的標籤，如下面這段程式碼中的`<br/>`。

`<html><body>Text<br/>Tail</body></html>`  

Element類提供了tail屬性支援單一標籤的文字獲取。
```python
>>> html = etree.Element('html')
>>> body = etree.SubElement(html, 'body')
>>> body.text = 'Text'
>>> print(etree.tostring(html))
b'<html><body>Text</body></html>'

>>> br = etree.SubElement(body, 'br')
>>> print(etree.tostring(html))
b'<html><body>Text<br/></body></html>'

# tail僅在該標籤後面追加文字
>>> br.tail = 'Tail'
>>> print(etree.tostring(br))
b'<br/>Tail'

>>> print(etree.tostring(html))
b'<html><body>Text<br/>Tail</body></html>'

# tostring方法增加method引數，過濾單一標籤，輸出全部文字
>>> print(etree.tostring(html, method='text'))
b'TextTail'

2. XPath方式

# 方式一：過濾單一標籤，返回文字
>>> print(html.xpath('string()'))
TextTail

# 方式二：返回列表，以單一標籤為分隔
>>> print(html.xpath('//text()'))
['Text', 'Tail']

方法二獲得的列表，每個元素都會帶上它所屬節點及文字型別資訊，如下：

>>> texts = html.xpath('//text()'))

>>> print(texts[0])
Text
# 所屬節點
>>> parent = texts[0].getparent()  
>>> print(parent.tag)
body

>>> print(texts[1], texts[1].getparent().tag)
Tail br

# 文字型別：是普通文字還是tail文字
>>> print(texts[0].is_text)
True
>>> print(texts[1].is_text)
False
>>> print(texts[1].is_tail)
True

檔案解析與輸出

這部分講述如何將XML檔案解析為Element物件，以及如何將Element物件輸出為XML檔案。

1. 檔案解析

檔案解析常用的有fromstring、XML和HTML三個方法。接受的引數都是字串。

>>> xml_data = '<root>data</root>'

# fromstring方法
>>> root1 = etree.fromstring(xml_data)
>>> print(root1.tag)
root
>>> print(etree.tostring(root1))
b'<root>data</root>'

# XML方法，與fromstring方法基本一樣
>>> root2 = etree.XML(xml_data)
>>> print(root2.tag)
root
>>> print(etree.tostring(root2))
b'<root>data</root>'

# HTML方法，如果沒有<html>和<body>標籤，會自動補上
>>> root3 = etree.HTML(xml_data)
>>> print(root3.tag)
html
>>> print(etree.tostring(root3))
b'<html><body><root>data</root></body></html>'

2. 輸出

輸出其實就是前面一直在用的tostring方法了，這裡補充xml_declaration和encoding兩個引數，前者是XML宣告，後者是指定編碼。

>>> root = etree.XML('<root><a><b/></a></root>')

>>> print(etree.tostring(root))
b'<root><a><b/></a></root>'

# XML宣告
>>> print(etree.tostring(root, xml_declaration=True))
b"<?xml version='1.0' encoding='ASCII'?>\n<root><a><b/></a></root>"

# 指定編碼
>>> print(etree.tostring(root, encoding='iso-8859-1'))
b"<?xml version='1.0' encoding='iso-8859-1'?>\n<root><a><b/></a></root>"

ElementPath

講ElementPath前，需要引入ElementTree類，一個ElementTree物件可理解為一個完整的XML樹，每個節點都是一個Element物件。而ElementPath則相當於XML中的XPath。用於搜尋和定位Element元素。

這裡介紹兩個常用方法，可以滿足大部分搜尋、查詢需求，它們的引數都是XPath語句：
findall()：返回所有匹配的元素，返回列表
find()：返回匹配到的第一個元素

>>> root = etree.XML("<root><a x='123'>aText<b/><c/><b/></a></root>")

# 查詢第一個b標籤
>>> print(root.find('b'))
None
>>> print(root.find('a').tag)
a

# 查詢所有b標籤，返回Element物件組成的列表
>>> [ b.tag for b in root.findall('.//b') ]
['b', 'b']

# 根據屬性查詢
>>> print(root.findall('.//a[@x]')[0].tag)
a
>>> print(root.findall('.//a[@y]'))
[]

還有html版本

from lxml import html import requests

下一步我們將使用 requests.get 來從web頁面中取得我們的資料，通過使用 html 模組解析它，並將結果儲存到 tree 中。

1 2	`page` `=` `requests.get('http://www.nbzhuti.cn/')` `tree` `=` `html.fromstring(page.text)`

tree 現在包含了整個HTML檔案到一個優雅的樹結構中，我們可以使用兩種方法訪問：XPath以及CSS選擇器。在這個例子中，我們將選擇前者。

XPath是一種在結構化文件（如HTML或XML）中定位資訊的方式。一個關於XPath的不錯的介紹參見 W3Schools 。

有很多工具可以獲取元素的XPath，如Firefox的FireBug或者Chrome的Inspector。如果你使用Chrome，你可以右鍵元素，選擇 ‘Inspect element'，高亮這段程式碼，再次右擊，並選擇 ‘Copy XPath'。

在進行一次快速分析後，我們看到在頁面中的資料儲存在兩個元素中，一個是title是 ‘buyer-name' 的div，另一個class是 ‘item-price' 的span：

1 2	`<div` `title="buyer-name">Carson Busses</div>` `<span` `class="item-price">$29.95</span>`

知道這個後，我們可以建立正確的XPath查詢並且使用lxml的 xpath 函式，像下面這樣：

1 2 3 4 #這將建立buyers的列表： buyers = tree.xpath('//div[@title="buyer-name"]/text()') #這將建立prices的列表： prices = tree.xpath('//span[@class="item-price"]/text()')

然後就是print一下就出來啦

python的lxml模組學習筆記

Element類

1. 節點操作

屬性操作

1. 建立屬性

2. 獲取屬性

文字操作

1. text和tail屬性

2. XPath方式

檔案解析與輸出

1. 檔案解析

2. 輸出

ElementPath

python的lxml模組學習筆記

python modbus_tk模組學習筆記（rtu slaver例程）

Node模組---學習筆記（二）

python shutil模組學習筆記

MSP430 DAC模組學習筆記

Python webbrowser,requests,bs4 模組學習筆記(一)

python爬蟲學習筆記之Beautifulsoup模組用法詳解

python爬蟲學習筆記之pyquery模組基本用法詳解

Python中關於logging模組的學習筆記

【學習筆記】sentinel原始碼學習--transport模組

Python基礎學習筆記（16）主要模組

Python學習筆記之3.6-複數的數學運算》》》 complex(real, imag) 或 cmath模組

Python基礎學習筆記（19）re 模組遞迴函式帶引數的裝飾器

Python基礎學習筆記（20）遞迴詳解 shutil 模組 logging 模組

Netty學習筆記03-Netty核心模組元件與Google Protobuf

Django學習筆記：第十七天 REST Framework 使用者模組

cnstream pipeline啟動到準備資料過程的原始碼學習筆記(二）：osd模組

Python學習筆記之日誌模組logging使用詳解

kafka學習筆記（四）kafka的日誌模組

[PyTorch 學習筆記] 2.2 圖片預處理 transforms 模組機制

python的lxml模組學習筆記

Element類

1. 節點操作

屬性操作

1. 建立屬性

2. 獲取屬性

文字操作

1. text和tail屬性

2. XPath方式

檔案解析與輸出

1. 檔案解析

2. 輸出

ElementPath

相關推薦