python lxml包學習筆記

阿新 • • 發佈：2019-01-22

python lxml包用於解析XML和html檔案，可以使用xpath和css定位元素，個人認為相對於BeautifulSoup功能更加強大，更加靈活。本文根據lxml官方文件和自己的理解列出常用的函式, 本文程式碼為python3.4， lxml2.0

解析XML，以pubmed文獻資料庫文字解析為例

匯入xml字串

匯入xml字串有多種方式，我最長使用的是 lxml.etree.XML(xml字串), etree.fromstring(xml字串)也可以

import lxml.etree 
import urllib.request
from lxml.etree import 
 *
str_url = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=26693255&retmode=text&rettype=xml'
request = urllib.request.Request(str_url)
xml_text = urllib.request.urlopen(request).read()
root = lxml.etree.XML(xml_text) # xml_text 為xml純文字檔案

root 為lxml.etree._Element 物件，含有多個函式

root 含有find，findall， xpath，get，getchildren函式，重點請help（root）

### findall， find
findall(…)
| findall(self, path, namespaces=None)
|
| Finds all matching subelements, by tag name or path.
| 輸入下一級物件的tag標籤或xpath(必須是相對路徑.//開頭)，返回匹配結果的所有元素，是一個list
| The optional namespaces argument accepts a
| prefix-to-namespace mapping that allows the usage of XPath
| prefixes in the path expression.

# example  獲取雜誌名稱和ISSN
# 使用 tag作為輸入需要逐級進行
journal_name = root.find('PubmedArticle').find('MedlineCitation').find('Article').find('Journal').find('Title').text
print('tag:', journal_name)

tag: Cognitive computation

# 也可以使用xpath(必須使用相對路徑，以.//開頭，如果想使用絕對路徑可以使用xpath函式)
journal_name = root.find('.//Title').text
print('xpath:' ,journal_name)

xpath: Cognitive computation

# text是element物件的屬性，可以得到內部的內容，如果要得到標籤內部的屬性
#使用get函式
# 比如得到 <ISSN IssnType="Print">1866-9956</ISSN>的 IssnTYpe屬性，則可以使用get函式
issn_attr = root.find('.//ISSN').get('IssnType')
print('issn attr:', issn_attr)

issn attr: Print

# 使用tostring函式
# 可以得到改標籤下的全部內容，tostring函式是 lxml.etree 下的靜態函式，使用前需要 from lxml.etree import *
tostring(root.find('.//JournalIssue')) # 得到JournalIssue標籤下的全部內容

b'<JournalIssue CitedMedium="Print">\n                    <Volume>7</Volume>\n                    <Issue>6</Issue>\n                    <PubDate>\n                        <MedlineDate>2015</MedlineDate>\n                    </PubDate>\n                </JournalIssue>\n                '

findall函式與find函式類似，find相當於findall(‘tag’)[0]

xpath 函式

具體xpath的學習可以參考 http://www.w3school.com.cn/xpath/xpath_syntax.asp
xpath與findall類似也返回一個list，不同之處是隻能使用xpath，而且可以使用xpath的相對路徑和絕對路徑

journal_name = root.xpath('//Title')[0].text
print(journal_name)

Cognitive computation

getchildren函式

得到所有直接子元素

注意，使用findall，find，xpath時一定要確定元素是否存在（可以用 if 判斷），然後才讀取text屬性，否則會遇到 Type ‘NoneType’ cannot be serialized., list index out of range, ‘NoneType’ object has no attribute ‘text’這樣的錯誤。

除了上述讀取的函式，lxml還包含了很多設定的函式，功能強大，具體可以去看lxml官方文件

lxml 解析 html 以爬取豆瓣電影主頁本週口碑榜

匯入html字串，使用 lxml.html.fromstring(html_text)

import lxml.html
str_url = 'http://movie.douban.com/'
request = urllib.request.Request(str_url)
html_text = urllib.request.urlopen(request).read()
root = lxml.html.fromstring(html_text)

依舊可以使用find，findall函式,用法與XML部分完全相同，可以使用下一級的tag和xpath作為輸出，此處不再贅述

cssselect() 函式，返回list，包含所有匹配的結果，可以使用css選擇器，類似於jquery

# 獲取本頁面所有專案名稱
movies_list = [a.text for a in  root.cssselect('div.billboard-bd tr td a')]
print(movies_list)

['老炮兒', '八惡人', '卡羅爾', '海街日記', '荒野獵人', '尋龍訣', '丹麥女孩', '龍蝦', '邊境殺手', '實習生']

# 獲取所有電影超連結
movies_href = [a.get('href') for a in  root.cssselect('div.billboard-bd tr td a')]
print(movies_href)

['http://movie.douban.com/subject/24751756/', 'http://movie.douban.com/subject/25787888/', 'http://movie.douban.com/subject/10757577/', 'http://movie.douban.com/subject/25895901/', 'http://movie.douban.com/subject/5327268/', 'http://movie.douban.com/subject/3077412/', 'http://movie.douban.com/subject/3071604/', 'http://movie.douban.com/subject/20514947/', 'http://movie.douban.com/subject/25881247/', 'http://movie.douban.com/subject/10594965/']

其他函式

text_content() 可以返回改element下的所有文字（去除所有<>標籤）

.make_links_absolute(base_href, resolve_base_href=True) 有時候遇到的連結是相對路徑，可以使用該函式將相對路徑轉換為絕對路徑

.rewrite_links(link_repl_func, resolve_base_href=True, base_href=None)根據替換函式替換連結

python lxml包學習筆記

解析XML，以pubmed文獻資料庫文字解析為例

匯入xml字串

findall函式與find函式類似，find相當於findall(‘tag’)[0]

xpath 函式

getchildren函式

lxml 解析 html 以爬取豆瓣電影主頁本週口碑榜

匯入html字串，使用 lxml.html.fromstring(html_text)

依舊可以使用find，findall函式,用法與XML部分完全相同，可以使用下一級的tag和xpath作為輸出，此處不再贅述

cssselect() 函式，返回list，包含所有匹配的結果，可以使用css選擇器，類似於jquery

其他函式

python lxml包學習筆記

python閉包學習筆記三

Python全棧學習筆記day 21：包、軟體開發規範、異常處理

Python scikit-learn機器學習工具包學習筆記：cross_validation模組

Python進階學習筆記——函數語言程式設計之返回函式&閉包

Python 3.6學習筆記（一）

流暢的python和cookbook學習筆記（一）

流暢的python和cookbook學習筆記（五）

流暢的python和cookbook學習筆記（八）

Python 進階學習筆記

Python(Head First)學習筆記：四

python requests庫學習筆記（下）

python入門教程學習筆記#2

python入門教程學習筆記#1

python自動化測試學習筆記-2-字典、元組、字符串方法

python 3.x 學習筆記9 (面向對象)

python 3.x 學習筆記13 (socket_ssh and socket_文件傳輸)

python 3.x 學習筆記18 (mysql 未完 )

python+selenium個人學習筆記10-調用JavaScript和截圖

Python第一周學習筆記（2）

python lxml包學習筆記

解析XML， 以pubmed文獻資料庫文字解析為例

匯入xml字串

findall函式與find函式類似，find相當於findall(‘tag’)[0]

xpath 函式

getchildren函式

lxml 解析 html 以爬取 豆瓣電影主頁本週口碑榜

匯入html字串，使用 lxml.html.fromstring(html_text)

依舊可以使用find，findall函式,用法與XML部分完全相同，可以使用下一級的tag和xpath作為輸出，此處不再贅述

cssselect() 函式，返回list，包含所有匹配的結果，可以使用css選擇器，類似於jquery

其他函式

相關推薦

解析XML，以pubmed文獻資料庫文字解析為例

lxml 解析 html 以爬取豆瓣電影主頁本週口碑榜