使用 lxml 中的 xpath 高效提取文字與標籤屬性值

阿新 • • 發佈：2018-12-18

# 我們爬取網頁的目的，無非是先定位到DOM樹的節點，然後取其文字或屬性值

myPage = '''<html>
<title>TITLE</title>
<body>
<h1>我的部落格</h1>
<div>我的文章</div>
<div id="photos">
<img src="pic1.jpeg"/><span id="pic1">PIC1 is beautiful!</span>
<img src="pic2.jpeg"/><span id="pic2">PIC2 is beautiful!</span>
<p><a href="http://www.example.com/more_pic.html">更多美圖</a></p>
<a href="http://www.baidu.com">去往百度</a>
<a href="http://www.163.com">去往網易</a>
<a href="http://www.sohu.com">去往搜狐</a>
</div>
<p class="myclassname">Hello,\nworld!<br/>-- by Adam</p>
<div class="foot">放在尾部的其他一些說明</div>
</body>
</html>'''

html = etree.fromstring(myPage)

# 一、定位
divs1 = html.xpath('//div')
divs2 = html.xpath('//div[@id]')
divs3 = html.xpath('//div[@class="foot"]')
divs4 = html.xpath('//div[@*]')
divs5 = html.xpath('//div[1]')
divs6 = html.xpath('//div[last()-1]')
divs7 = html.xpath('//div[position()<3]')
divs8 = html.xpath('//div|//h1')
divs9 = html.xpath('//div[not(@*)]')

# 二、取文字 text() 區別 html.xpath('string()')
text1 = html.xpath('//div/text()')
text2 = html.xpath('//div[@id]/text()')
text3 = html.xpath('//div[@class="foot"]/text()')
text4 = html.xpath('//div[@*]/text()')
text5 = html.xpath('//div[1]/text()')
text6 = html.xpath('//div[last()-1]/text()')
text7 = html.xpath('//div[position()<3]/text()')
text8 = html.xpath('//div/text()|//h1/text()')

# 三、取屬性 @
value1 = html.xpath('//a/@href')
value2 = html.xpath('//img/@src')
value3 = html.xpath('//div[2]/span/@id')

# 四、定位（進階）
# 1.文件(DOM)元素(Element)的find，findall方法
divs = html.xpath('//div[position()<3]')
for div in divs:
ass = div.findall('a') # 這裡只能找到:div->a, 找不到:div->p->a
for a in ass:
if a is not None:
#print(dir(a))
print(a.text, a.attrib.get('href')) #文件(DOM)元素(Element)的屬性：text, attrib

# 2.與1等價
a_href = html.xpath('//div[position()<3]/a/@href')
print(a_href)

# 3.注意與1、2的區別
a_href = html.xpath('//div[position()<3]//a/@href')
print(a_href)

使用 lxml 中的 xpath 高效提取文字與標籤屬性值

使用 lxml 中的 xpath 高效提取文字與標籤屬性值

利用lxml中的etree 查詢節點的某些屬性值

Js與標籤屬性關於在JS中設定標籤屬性 js和jquery通過this獲取html標籤中的屬性值

JavaScript中的通過html元素的標籤屬性找節點

在Struts2中，自定義radio與select的值

Android佈局檔案中控制元件的高度與寬度屬性設定

網頁中內容的滾動：marquee標籤屬性詳解

在jsp頁面使用JS函式設定標籤屬性值

from表單取消提交隱藏的標籤屬性值

EF Core中怎麼實現自動更新實體的屬性值到資料庫

Spring Boot thymeleaf 自定義標籤獲取標籤屬性值 EL表示式的值

JS獲取標籤屬性值

Spring中利用配置檔案和@value注入屬性值

JavaScript 獲取標籤屬性值

JavaScript獲取遍歷中的多選框的唯一屬性值（如id）

資料庫中查詢同一欄位的不同屬性值

Xpath如何提取一個標籤裡的所有文字？

CAD高級功能，如何在CAD圖紙中提取文字？

scrapy中xpath將某一個節點下的文字內容串起來

css中字型與段落屬性設定/文字高階樣式

使用 lxml 中的 xpath 高效提取文字與標籤屬性值

相關推薦