python網路爬蟲——lxml

阿新 • • 發佈：2019-01-17

解析XML字串
網頁下載下來以後是字串的形式，使用etree.fromstring(str)構造一個etree._ElementTree物件，使用etree.tostring(t)返回一個字串

from lxml import etree

xml_string='<root><foo id="foo-id" class="foo zoo">Foo</foo><bar>中文</bar><baz></baz></root>'
root=etree.fromstring(xml_string)

print etree.tostring(root)
# <root><foo id="foo-id" class="foo zoo">Foo</foo><bar>&#20013;&#25991;</bar><baz/></root>


print etree.tostring(root,pretty_print=True)

#沒有子節點的baz變成了自閉和的標籤
"""
<root>
  <foo id="foo-id" class="foo zoo">Foo</foo>
  <bar>&#20013;&#25991;</bar>
  <baz/>
</root>
"""

print type(root)
#tostring返回的是一個_Element型別的物件，也使整個xml樹的根結點
# <type 'lxml.etree._Element'>

Element結構
etree._Element是一個設計很精妙的結構，可以把它當作一個物件訪問當前節點自身的文字節點，可以把它當作一個叔祖，元素就是他的子節點，可以把它當作一個字典，從而遍歷它的屬性：

foo=root[0]
result={}
for attr,val in foo.items():
    result[attr]=val
print result
# {'id': 'foo-id', 'class': 'foo zoo'}

#獲取foo標籤中id對應的值
print foo.get('id')
# foo-id

#foo標籤的屬性
print foo.attrib
# {'id': 'foo-id', 'class': 'foo zoo'}

Element 和 ElementTree
xml 是一個樹形結構，lxml 使用etree._Element和 etree._ElementTree來分別代表樹中的節點和樹，etree.ELement和 etree.ElementTree 分別是兩個工廠函式

t=root.getroottree()

print t
# <lxml.etree._ElementTree object at 0x7fedd2caca28>
#獲得一個結點對應的樹物件


print t.getroot()
# <Element root at 0x7fbd2ecb3a70>
#返回樹的根結點

foo_tree=etree.ElementTree(root[0])
#從foo這個節點構造一個樹，那麼這個節點就是這個樹的根
print foo_tree
# <lxml.etree._ElementTree object at 0x7f1bf391e998>
print foo_tree.getroot().tag
# foo

XPath
_Element和 _ElementTree 分別具有xpath 函式，兩者的區別在於：
如果是相對路徑，_Element.xpath是以當前節點為參考的，_ElementTree.xpath以根為參考
如果是絕對路徑，_ElementTree.xpath是以當前節點的getroottree的根節點為參考的

foo=root[0]

print foo.xpath('/root')[0].tag
# root
print foo.xpath('.')[0].tag
# foo

t=root.getroottree()
print t.xpath('/root')[0].tag
# root
print t.xpath('.')[0].tag
# root

python網路爬蟲——lxml

python網路爬蟲——lxml

python網路爬蟲（一）

python網路爬蟲五

Python網路爬蟲快速入門到精通

python網路爬蟲四

python網路爬蟲二

我的 Python 網路爬蟲直播分享要來了！

python網路爬蟲一

Python網路爬蟲實戰

資料處理（玩轉python網路爬蟲）

Requests庫函式的學習（玩轉python網路爬蟲）

python網路爬蟲磁碟快取資料

python網路爬蟲（web spider）系統化整理總結（二）：爬蟲python程式碼示例(兩種響應格式：json和html)

python網路爬蟲（web spider）系統化整理總結（一）：入門

Python網路爬蟲之requests庫Scrapy爬蟲比較

Python網路爬蟲之製作股票資料定向爬蟲以及爬取的優化可以顯示進度條！

Python網路爬蟲之爬取淘寶網頁頁面 MOOC可以執行的程式碼

Python網路爬蟲之股票資料Scrapy爬蟲例項介紹，實現與優化！（未成功生成要爬取的內容！）

python網路爬蟲開發實戰（崔慶才）_14頁_chromedriver環境配置和載入

分享《精通Python網路爬蟲：核心技術、框架與專案實戰》中文PDF+原始碼

python網路爬蟲——lxml

相關推薦