python 爬蟲（xpath解析網頁，下載照片）

阿新 • • 發佈：2019-01-15

XPath (XML Path Language) 是一門在 XML 文件中查詢資訊的語言，可用來在 XML 文件中對元素和屬性進行遍歷。

lxml 是一個HTML/XML的解析器，主要的功能是如何解析和提取 HTML/XML 資料。lxml和正則一樣，也是用 C 實現的，是一款高效能的 Python HTML/XML 解析器，我們可以利用之前學習的XPath語法，來快速的定位特定元素以及節點資訊。

lxml python 官方文件：[http://lxml.de/index.html](http://lxml.de/index.html)

需要安裝C語言庫，可使用 pip 安裝：pip install lxml （或通過wheel方式安裝）

下面的程式碼實現了抓取某個婚戀網站的圖片並儲存到本地，通過使用xpath來解析獲取圖片的連線地址，然後儲存到本地的。程式碼只實現獲取一頁的圖片，通過分析url 【http://www.qyw520.com/user/list-1-0--0-0-0-0-0-0-0-0-0-0-0-0-0---0-0-0-2.html】的變化，可以發現頁數與list後面的數字有關，如果需要回去多頁的資料，只需設定一個偏移量來控制頁碼就可以了。

import urllib.request
import random
from lxml import etree


class MySpider:

    userName = 1
    
    def headers(self):
        """
        隨機產生User-Agent的header
        :return:
        """
        headers_list = [
            "User-Agent:Mozilla/5.0(compatible;MSIE9.0;WindowsNT6.1;Trident/5.0",
            "User-Agent:Mozilla/4.0(compatible;MSIE8.0;WindowsNT6.0;Trident/4.0)",
            "User-Agent:Mozilla/4.0(compatible;MSIE7.0;WindowsNT6.0)",
            "Mozilla/5.0(WindowsNT6.1;rv:2.0.1)Gecko/20100101Firefox/4.0.1",
            "Opera/9.80(WindowsNT6.1;U;en)Presto/2.8.131Version/11.11",
            "Mozilla/4.0(compatible;MSIE7.0;WindowsNT5.1;TencentTraveler4.0)",
            "Mozilla/4.0(compatible;MSIE7.0;WindowsNT5.1;Maxthon2.0)",
            "Mozilla/4.0(compatible;MSIE7.0;WindowsNT5.1;360SE)",
            "Mozilla/4.0(compatible;MSIE7.0;WindowsNT5.1)",
        ]
        ua_agent = random.choice(headers_list)
        return ua_agent

    def load_page(self, url, header):

        headers = {"User-Agent": header}
        request = urllib.request.Request(url, headers=headers)
        response = urllib.request.urlopen(request)

        return response.read()

    def parse(self, html):
        # 解析html 為 HTML 文件
        content = html.decode("utf-8")

        selector = etree.HTML(content)
        img_links = selector.xpath('//img[@class="img"]/@src')

        for link in img_links:
            self.write_img(link)

    def write_img(self, imgurl):
        print("正在儲存檔案 %d ..." % self.userName)
        # 1. 開啟檔案，返回一個檔案物件
        with open('images/' + str(self.userName) + '.png', 'wb') as f:
            # 2. 獲取圖片裡的內容
            images = urllib.request.urlopen('http://www.qyw520.com' + imgurl)
            # 3. 呼叫檔案物件write() 方法，將圖片的內容寫入到檔案裡
            f.write(images.read())
        print("檔案 %d 儲存成功！" % self.userName)
        self.userName += 1

    def main(self, url):
        header = self.headers()
        html = self.load_page(url, header)
        self.parse(html)

if __name__ == "__main__":

    url = "http://www.qyw520.com/user/list-1-0--0-0-0-0-0-0-0-0-0-0-0-0-0---0-0-0-2.html"

    myspider = MySpider()
    myspider.main(url)

python 爬蟲（xpath解析網頁，下載照片）

python 爬蟲（xpath解析網頁，下載照片）

python 爬蟲（爬取網頁的img並下載）

python實戰之網路爬蟲（爬取網頁新聞資訊列表）

python爬蟲之BeautifulSoup解析網頁

Python爬蟲：Selenium常用操作，下載youtube視訊例項

python爬蟲（Xpath）

python爬蟲 selenium+phantomjs動態解析網頁，加載頁面成功，返回空數據

爬蟲入門，爬蟲簡單的入門庫Beautifulsoup庫,解析網頁，簡單用法-案例篇（5）

Python開發爬蟲之BeautifulSoup解析網頁篇：爬取安居客網站上北京二手房數據

python 爬蟲（一） requests+BeautifulSoup 爬取簡單網頁代碼示例

python爬蟲實戰：利用scrapy，短短50行代碼下載整站短視頻

還沒寫過爬蟲的小白點進來，一文帶你入門python爬蟲（小白福利）

python爬蟲爬取全站url，完美小demo（可防止連結到外網等各種強大篩選）

python爬蟲之xpath和lxml解析內容

python爬蟲爬取淘寶，羅蘭電鋼琴和雅馬哈電鋼琴（參考崔大）

python爬蟲實戰：利用scrapy，短短50行程式碼下載整站短視訊

解決PyCharm下python使用XPath解析html，獲取文字時中文為亂碼問題

python爬蟲（爬取豆瓣電影）_動態網頁,json解釋,中文編碼

python爬蟲之反爬蟲（隨機user-agent，獲取代理ip，檢測代理ip可用性）

python 爬蟲（三）模擬post請求，爬取資料

python 爬蟲（xpath解析網頁，下載照片）

相關推薦