1. 程式人生 > >Python 爬蟲學習筆記二: xpath 模組

Python 爬蟲學習筆記二: xpath 模組

Python 爬蟲學習筆記二: xpath from lxml

  1. 首先應該知道的是xpath 只是一個元素選擇器, 在python 的另外一個庫lxml 中, 想要使用xpath 必須首先下載lxml 庫
  2. lxml 庫的安裝: 很簡單, 具體請檢視 http://www.jianshu.com/p/2bc5aa0db486
    上述連結中有如何安裝lxml , 以及如何使用xpath的入門程式, 以及xpath 的初始語法
pip install lxml

初始實踐的code 記錄:

from lxml import etree
import requests
url = 'http://sh.lianjia.com/ershoufang/'
region = 'pudong' finalURL = url+region price = 'p23' r= requests.get(finalURL) r.text html = requests.get(finalURL).content.decode('utf-8') dom_tree = etree.HTML(html) links = dom_tree.xpath("//div/span[@class='info-col row2-text']/a") for i in links: print(i.text) links_yaoshi = dom_tree.xpath("//div/span[@class='c-prop-tag2']"
) for i in links_yaoshi: print(i.text) links_danjia = dom_tree.xpath("//span[@class='info-col price-item minor']") for index in range(len(links_yaoshi)): print(index) print(links[index].text) print(links_yaoshi[index].text) print(links_danjia[index].text)

本來想要將這些不同標籤下的資訊分別取出,在列印的時候再將資訊進行合併, 但是輸出的結果格式不符合標準並且爬出的element數量不一致導致程式碼報錯。
一定注意 xpath 查詢提取結果是可以用“|”來提取多個results, 所以最終的code 如下:

from lxml import etree
import requests
url = 'http://sh.lianjia.com/ershoufang/'
region = 'pudong'
finalURL = url+region
price = 'p23'
r= requests.get(finalURL)
r.text
html = requests.get(finalURL).content.decode('utf-8')
dom_tree = etree.HTML(html)
"""
links = dom_tree.xpath("//div/span[@class='info-col row2-text']/a")

for i in links:
    print(i.text)

links_yaoshi = dom_tree.xpath("//div/span[@class='c-prop-tag2']")

for i in links_yaoshi:
   print(i.text)

links_danjia = dom_tree.xpath("//span[@class='info-col price-item minor']")

for index in range(len(links_yaoshi)):
    print(index)
    print(links[index].text)
    print(links_yaoshi[index].text)
    print(links_danjia[index].text)
"""
data = dom_tree.xpath("//div[@class='info-table']/text()")
# info = data[0].xpath('string(.)').extract()[0]

dataRes = dom_tree.xpath("//div/span[@class='info-col row2-text']/a | //div/span[@class='c-prop-tag2'] | //span[@class='info-col price-item minor']")


for i in dataRes:
   print(i.text)

打印出的結果:
輸出text 示例

3 . 寫程式碼的時候, 有時會輸出形如< Element a at 0x334fcb0> 的結果, 這個是pyhton 中的一個物件,表示的是element a 儲存的實體記憶體地址。
4. 想取出element 之間的內容, 形如下圖中紅色方框中的內容:
DOM結構
但是使用如下命令的時候:

# 取出element之間的內容
bloger = dom_tree.xpath("//div[@class='info-table']")
#"info-table" 是截圖中DOM 結構的上兩級的element , <div class="info-table">
print (bloger[0].xpath('string(.)').strip())

輸出結果是整個div 下面所有的text 文字(有點小驚喜)
輸出結果

5 . 所以我想要取出所有的text文字的話,可以直接擷取< li> ,或者使用 | 篩查出另外的資訊
另外的element結構
最終取出所有結果的code :

# all the messages
all_message = dom_tree.xpath("//ul[@class='js_fang_list']/li")
print (all_message[0].xpath('string(.)').strip())  # 只是列印第一行的結果

for index in range(len(all_message)):
   print(all_message[index].xpath('string(.)').strip())

最終的輸出結果示例:
輸出結果

至此,單頁爬蟲以及相應text 文字的輸出就告一段落。

【注】Python 及 xpath 應用 :

【附】最終的Fang.py 檔案 :

from lxml import etree
import requests
url = 'http://sh.lianjia.com/ershoufang/'
region = 'pudong'
finalURL = url+region
price = 'p23'
r= requests.get(finalURL)
r.text
html = requests.get(finalURL).content.decode('utf-8')
dom_tree = etree.HTML(html)
"""
links = dom_tree.xpath("//div/span[@class='info-col row2-text']/a")

for i in links:
    print(i.text)

links_yaoshi = dom_tree.xpath("//div/span[@class='c-prop-tag2']")

for i in links_yaoshi:
   print(i.text)

links_danjia = dom_tree.xpath("//span[@class='info-col price-item minor']")

for index in range(len(links_yaoshi)):
    print(index)
    print(links[index].text)
    print(links_yaoshi[index].text)
    print(links_danjia[index].text)
"""
data = dom_tree.xpath("//div[@class='info-table']/text()")

# 取出element之間的內容
bloger = dom_tree.xpath("//div[@class='info-table']")
print (bloger[0].xpath('string(.)').strip())

# all the messages
all_message = dom_tree.xpath("//ul[@class='js_fang_list']/li")
print (all_message[0].xpath('string(.)').strip())  # 只是列印第一行的結果

for index in range(len(all_message)):
   print(all_message[index].xpath('string(.)').strip())


print(dom_tree.xpath("//*[@id='js-ershoufangList']/div[2]/div[3]/div[1]/ul/li[1]/div/div[2]/div[1]/span")[0].xpath('string(.)').strip())

# info = data[0].xpath('string(.)').extract()[0]
data_fangxing = dom_tree.xpath("//div/div[2]/div[1]/span[@class='info-col row1-text']/text()")
#results = etree.tostring(data_fangxing.pop, pretty_print=True)

#results = etree.tostring(data_fangxing.pop(0), pretty_print=True)
#print(results)

dataRes = dom_tree.xpath("//div/span[@class='info-col row2-text']/a | //div/div[2]/div[1]/span[@class='info-col row1-text'] | //div/span[@class='c-prop-tag2'] | //span[@class='info-col price-item minor']")


#for i in dataRes:

     # print(i.text)