1. 程式人生 > >爬蟲(Xpath)——爬tieba.baidu.com (bug)

爬蟲(Xpath)——爬tieba.baidu.com (bug)

tieba 數據 http lis __name__ gin lencod 問題: agen

工具:python3

問題:在執行loadPage時遇到了問題,

link_list = content.xpath(‘//div[@class="t_con cleafix"]/div/div/div/a/@href‘)
這個正則表達式在xpath helper中能夠找到對應的href值,如圖:

技術分享圖片

但是在在執行程序時 link_list = content.xpath(‘//div[@class="t_con cleafix"]/div/div/div/a/@href‘) 返回的列表值為空,如圖:

技術分享圖片

嘗試進入兩個輸出的fullurl均能正確進入網頁,說明上一步傳入的網址是沒有錯誤的呀!

到底是什麽原因呢?

import
urllib.request import re from lxml import etree class Spider: def __init__(self): self.headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36", } def loadPage(self, link): """ 下載頁面
""" print("正在下載數據。。。。。。") request = urllib.request.Request(link, headers=self.headers) html = urllib.request.urlopen(request).read() # html = html.decode("utf-8") with open("meinvba.txt", "w") as f: f.write(str(html)) # 獲取每頁的HTML源碼字符串 #
html = html.decode("gbk") # 解析html文檔為HTML DOM類型 content = etree.HTML(html) print(content) # 返回所有匹配成功的列表集合 link_list = content.xpath(//div[@class="t_con cleafix"]/div/div/div/a/@href) print(link_list) for i in link_list: print("__4__") fulllink = "http://tieba.baidu.com" + i self.loadImage(fulllink) print("___3___") # 取出每個帖子的圖片鏈接 def loadImage(self, link): request = urllib.request.Request(link, headers=self.headers) html = urllib.request.urlopen(request).read() content = etree.HTML(html) link_list = content.xpath(//img[@class="BDE_Image"]/@src) print("____1____") for link in link_list: self.writeImage(link) def writeImage(self, link): request = urllib.request.Request(link, headers=self.headers) image = urllib.request.urlopen(request).read() filename = link[-5:] print("___2____") with open(filename, "wb") as f: f.write(image) print("*"*30) def startWork(self, kw, beginpage, endpage): """ 控制爬蟲運行 """ url = "http://tieba.baidu.com/f?" key = urllib.parse.urlencode({"kw": kw}) print("key:" + key) fullurl = url + key for page in range(int(beginpage), int(endpage) + 1): pn = (page - 1)*50 fullurl = fullurl + "&pn=" + str(pn) self.loadPage(fullurl) # print("fullurl:" + fullurl) if __name__ == "__main__": tiebaSpider = Spider() kw = input("請輸入要爬取的貼吧名:") beginpage = input("請輸入起始頁:") endpage = input("請輸入結束頁:") tiebaSpider.startWork(kw, beginpage, endpage)

好想知道哪裏出了錯誤啊!!!

爬蟲(Xpath)——爬tieba.baidu.com (bug)