XPath：爬取百度貼吧圖片，並儲存本地

阿新 • • 發佈：2018-11-09

使用XPath，我們可以先將 HTML檔案轉換成 XML文件，然後用 XPath 查詢 HTML 節點或元素。

什麼是XML

XML 指可擴充套件標記語言（EXtensible Markup Language）
XML 是一種標記語言，很類似 HTML
XML 的設計宗旨是傳輸資料，而非顯示資料
XML 的標籤需要我們自行定義。
XML 被設計為具有自我描述性。
XML 是 W3C 的推薦標準

W3School官方文件：http://www.w3school.com.cn/xml/index.asp

XML 和 HTML 的區別

XML文件示例

<?xml version="1.0" encoding="utf-8"?>

<bookstore> 

  <book category="cooking"> 
    <title lang="en">Everyday Italian</title>  
    <author>Giada De Laurentiis</author>  
    <year>2005</year>  
    <price>30.00</price> 
  </book>  

  <book category="children"> 
    <title lang="en">Harry Potter</title>  
    <author>J K. Rowling</author>  
    <year>2005</year>  
    <price>29.99</price> 
  </book>  

  <book category="web"> 
    <title lang="en">XQuery Kick Start</title>  
    <author>James McGovern</author>  
    <author>Per Bothner</author>  
    <author>Kurt Cagle</author>  
    <author>James Linn</author>  
    <author>Vaidyanathan Nagarajan</author>  
    <year>2003</year>  
    <price>49.99</price> 
  </book> 

  <book category="web" cover="paperback"> 
    <title lang="en">Learning XML</title>  
    <author>Erik T. Ray</author>  
    <year>2003</year>  
    <price>39.95</price> 
  </book> 

</bookstore>

HTML DOM 模型示例

HTML DOM 定義了訪問和操作 HTML 文件的標準方法，以樹結構方式表達 HTML 文件。

XML的節點關係

1. 父（Parent）

每個元素以及屬性都有一個父。

下面是一個簡單的XML例子中，book 元素是 title、author、year 以及 price 元素的父：


<?xml version="1.0" encoding="utf-8"?>

<book>
  <title>Harry Potter</title>
  <author>J K. Rowling</author>
  <year>2005</year>
  <price>29.99</price>
</book>

2. 子（Children）

元素節點可有零個、一個或多個子。

在下面的例子中，title、author、year 以及 price 元素都是 book 元素的子：

<?xml version="1.0" encoding="utf-8"?>

<book>

  <title>Harry Potter</title>

  <author>J K. Rowling</author>

  <year>2005</year>

  <price>29.99</price>

</book>

3. 同胞（Sibling）

擁有相同的父的節點

在下面的例子中，title、author、year 以及 price 元素都是同胞：

<?xml version="1.0" encoding="utf-8"?>

<book>

  <title>Harry Potter</title>

  <author>J K. Rowling</author>

  <year>2005</year>

  <price>29.99</price>

</book>

4. 先輩（Ancestor）

某節點的父、父的父，等等。

在下面的例子中，title 元素的先輩是 book 元素和 bookstore 元素：

<?xml version="1.0" encoding="utf-8"?>

<bookstore>

<book>

  <title>Harry Potter</title>

  <author>J K. Rowling</author>

  <year>2005</year>

  <price>29.99</price>

</book>

</bookstore>

5. 後代（Descendant）

某個節點的子，子的子，等等。

在下面的例子中，bookstore 的後代是 book、title、author、year 以及 price 元素：

<?xml version="1.0" encoding="utf-8"?>

<bookstore>

<book>

  <title>Harry Potter</title>

  <author>J K. Rowling</author>

  <year>2005</year>

  <price>29.99</price>

</book>

</bookstore>

什麼是XPath？

XPath (XML Path Language) 是一門在 XML 文件中查詢資訊的語言，可用來在 XML 文件中對元素和屬性進行遍歷。

W3School官方文件：http://www.w3school.com.cn/xpath/index.asp

XPath 開發工具

開源的XPath表示式編輯工具:XMLQuire(XML格式檔案可用)

Chrome外掛 XPath Helper

Firefox外掛 XPath Checker

選取節點

XPath 使用路徑表示式來選取 XML 文件中的節點或者節點集。

這些路徑表示式和我們在常規的電腦檔案系統中看到的表示式非常相似。

路徑表示式：

在下面的表格中，我們已列出了一些路徑表示式：

謂語（Predicates）

謂語用來查詢某個特定的節點或者包含某個指定的值的節點，被嵌在方括號中。

在下面的表格中，我們列出了帶有謂語的一些路徑表示式，以及表示式的結果：

選取未知節點

XPath 萬用字元可用來選取未知的 XML 元素。

在下面的表格中，我們列出了一些路徑表示式，以及這些表示式的結果：

選取若干路徑

通過在路徑表示式中使用“|”運算子，您可以選取若干個路徑。

例項

在下面的表格中，我們列出了一些路徑表示式，以及這些表示式的結果：

下面列出了可用在 XPath 表示式中的運算子：

這些就是XPath的語法內容，在運用到Python抓取時要先轉換為xml。

XPath例項測試

1. 獲取所有的 <li> 標籤

# xpath_li.py

from lxml import etree

html = etree.parse('hello.html')
print type(html)  # 顯示etree.parse() 返回型別

result = html.xpath('//li')

print result  # 列印<li>標籤的元素集合
print len(result)
print type(result)
print type(result[0])

輸出結果：

<type 'lxml.etree._ElementTree'>

[<Element li at 0x1014e0e18>, <Element li at 0x1014e0ef0>, <Element li at 0x1014e0f38>, <Element li at 0x1014e0f80>, <Element li at 0x1014e0fc8>]

5

<type 'list'>

<type 'lxml.etree._Element'>

2. 繼續獲取<li> 標籤的所有 class屬性


# xpath_li.py

from lxml import etree

html = etree.parse('hello.html')
result = html.xpath('//li/@class')

print result

執行結果

['item-0', 'item-1', 'item-inactive', 'item-1', 'item-0']

3. 繼續獲取<li>標籤下hre 為 link1.html 的 <a> 標籤

# xpath_li.py

from lxml import etree

html = etree.parse('hello.html')
result = html.xpath('//li/a[@href="link1.html"]')

print result

執行結果

[<Element a at 0x10ffaae18>]

4. 獲取<li> 標籤下的所有 <span> 標籤


# xpath_li.py

from lxml import etree

html = etree.parse('hello.html')

#result = html.xpath('//li/span')

#注意這麼寫是不對的：

#因為 / 是用來獲取子元素的，而 <span> 並不是 <li> 的子元素，所以，要用雙斜槓

result = html.xpath('//li//span')

print result

執行結果

[<Element span at 0x10d698e18>]

5. 獲取 <li> 標籤下的<a>標籤裡的所有 class


# xpath_li.py

from lxml import etree

html = etree.parse('hello.html')
result = html.xpath('//li/a//@class')

print result

執行結果

['blod']

6. 獲取最後一個 <li> 的 <a> 的 href

# xpath_li.py

from lxml import etree

html = etree.parse('hello.html')

result = html.xpath('//li[last()]/a/@href')
# 謂語 [last()] 可以找到最後一個元素
print result

執行結果

['link5.html']

7. 獲取倒數第二個元素的內容


# xpath_li.py

from lxml import etree

html = etree.parse('hello.html')
result = html.xpath('//li[last()-1]/a')

# text 方法可以獲取元素內容
print result[0].text

執行結果

fourth item

8. 獲取 class 值為 bold 的標籤名

# xpath_li.py

from lxml import etree

html = etree.parse('hello.html')

result = html.xpath('//*[@class="bold"]')

# tag方法可以獲取標籤名
print result[0].tag

執行結果

span

使用XPath的爬蟲

現在我們用XPath來做一個簡單的爬蟲，我們嘗試爬取某個貼吧裡的所有帖子，並且將該這個帖子裡每個樓層釋出的圖片下載到本地。

# tieba_xpath.py


#!/usr/bin/env python
# -*- coding:utf-8 -*-

import os
import urllib
import urllib2
from lxml import etree

class Spider:
    def __init__(self):
        self.tiebaName = raw_input("請需要訪問的貼吧：")
        self.beginPage = int(raw_input("請輸入起始頁："))
        self.endPage = int(raw_input("請輸入終止頁："))

        self.url = 'http://tieba.baidu.com/f'
        self.ua_header = {"User-Agent" : "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1 Trident/5.0;"}

        # 圖片編號
        self.userName = 1

    def tiebaSpider(self):
        for page in range(self.beginPage, self.endPage + 1):
            pn = (page - 1) * 50 # page number
            word = {'pn' : pn, 'kw': self.tiebaName}

            word = urllib.urlencode(word) #轉換成url編碼格式（字串）
            myUrl = self.url + "?" + word

            # 示例：http://tieba.baidu.com/f? kw=%E7%BE%8E%E5%A5%B3 & pn=50
            # 呼叫 頁面處理函式 load_Page
            # 並且獲取頁面所有帖子連結,
            links = self.loadPage(myUrl)  # urllib2_test3.py

    # 讀取頁面內容
    def loadPage(self, url):
        req = urllib2.Request(url, headers = self.ua_header)
        html = urllib2.urlopen(req).read()

        # 解析html 為 HTML 文件
        selector=etree.HTML(html)

        #抓取當前頁面的所有帖子的url的後半部分，也就是帖子編號
        # http://tieba.baidu.com/p/4884069807裡的 “p/4884069807”
        links = selector.xpath('//div[@class="threadlist_lz clearfix"]/div/a/@href')

        # links 型別為 etreeElementString 列表
        # 遍歷列表，並且合併成一個帖子地址，呼叫 圖片處理函式 loadImage
        for link in links:
            link = "http://tieba.baidu.com" + link
            self.loadImages(link)

    # 獲取圖片
    def loadImages(self, link):
        req = urllib2.Request(link, headers = self.ua_header)
        html = urllib2.urlopen(req).read()

        selector = etree.HTML(html)

        # 獲取這個帖子裡所有圖片的src路徑
        imagesLinks = selector.xpath('//img[@class="BDE_Image"]/@src')

        # 依次取出圖片路徑，下載儲存
        for imagesLink in imagesLinks:
            self.writeImages(imagesLink)

    # 儲存頁面內容
    def writeImages(self, imagesLink):
        '''
            將 images 裡的二進位制內容存入到 userNname 檔案中
        '''

        print imagesLink
        print "正在儲存檔案 %d ..." % self.userName
        # 1. 開啟檔案，返回一個檔案物件
        file = open('./images/' + str(self.userName)  + '.png', 'wb')

        # 2. 獲取圖片裡的內容
        images = urllib2.urlopen(imagesLink).read()

        # 3. 呼叫檔案物件write() 方法，將page_html的內容寫入到檔案裡
        file.write(images)

        # 4. 最後關閉檔案
        file.close()

        # 計數器自增1
        self.userName += 1

# 模擬 main 函式
if __name__ == "__main__":

    # 首先建立爬蟲物件
    mySpider = Spider()
    # 呼叫爬蟲物件的方法，開始工作
    mySpider.tiebaSpider()

執行效果：

XPath：爬取百度貼吧圖片，並儲存本地

使用XPath，我們可以先將 HTML檔案轉換成 XML文件，然後用 XPath 查詢 HTML 節點或元素。什麼是XML XML 指可擴充套件標記語言（EXtensible Markup

requests+xpath+map爬取百度貼吧

name ads int strip 獲取 app open http col 1 # requests+xpath+map爬取百度貼吧 2 # 目標內容:跟帖用戶名,跟帖內容,跟帖時間 3 # 分解: 4 # requests獲取網頁 5 # xpath提取內

Python爬蟲教程：爬取百度貼吧

貼吧爬取寫程式碼前，構思需要的功能塊；寫程式碼時，把各個功能模組名提前寫好初始化初始化必要引數，完成基礎設定爬取百度貼吧lol吧：爬取地址中的get引數須傳遞(可以指定不同主題的貼吧和頁碼) 主題名初始網址請求頭生成網址生成每一頁的路由

Python簡易爬蟲爬取百度貼吧圖片

decode works 接口 def 讀取 min baidu 得到 internal 　　　　　通過python 來實現這樣一個簡單的爬蟲功能，把我們想要的圖片爬取到本地。(Python版本為3.6.0) 一.獲取整個頁面數據　　 def getHtml(url)

爬取百度貼吧圖片

本次文章內容是爬取貼吧圖片，希望對大家有所幫助使用環境：我用的是python2.7.9 在Python 3以後的版本中，urllib2這個模組已經不單獨存在（也就是說當你import urllib2時，系統提示你沒這個模組），urllib2被合併到了urllib中。 url

Python爬取百度貼吧圖片指令碼

新手，以下是爬取百度貼吧制定帖子的圖片指令碼，因為指令碼主要是解析html程式碼，因此一旦百度修改頁面前端程式碼，那麼指令碼會失效，權當爬蟲入門練習吧，後續還會嘗試更多的爬蟲。 # coding=ut

實戰python 爬蟲爬取百度貼吧圖片

#!/usr/bin/python import urllib,urllib2import re def getHtml(url): page = urllib2.urlopen(url) return page.read() def getImage(html): re_img = re.compil

網路爬蟲簡單的實現爬取百度貼吧圖片

我們要爬取的網站是https://tieba.baidu.com/p/3797994694 首先爬取第一頁的圖片，使用python3自帶庫urllib，詳細的程式碼如下：接下來爬去多頁的圖片，這裡我們選取五頁的圖片，這裡我們採用requests，beautifuls

Python爬取百度貼吧數據

utf-8 支持我 family encode code word keyword 上一條時間　　本渣除了工作外，在生活上還是有些愛好，有些東西，一旦染上，就無法自拔，無法上岸，從此走上一條不歸路。花鳥魚蟲便是我堅持了數十年的愛好。　　本渣還是需要上班，才能支持我的

Python爬蟲實例（一）爬取百度貼吧帖子中的圖片

選擇圖片查看負責 targe mpat wid agent html headers 程序功能說明：爬取百度貼吧帖子中的圖片，用戶輸入貼吧名稱和要爬取的起始和終止頁數即可進行爬取。思路分析：一、指定貼吧url的獲取例如我們進入秦時明月吧，提取並分析其有效url如下

ulrlib案例-爬取百度貼吧

request class clas uml sleep 很多下載變化 random 1.任務需求百度貼吧有很多主題，每個主題下的網頁存在很多分頁。爬取不同的主題，並下載每個主題下的多頁網頁。輸入貼吧名稱，下載相應貼吧的多頁網頁，設置最多下載50頁。 2.分析

完整的爬蟲程序爬取百度貼吧的圖片

列表文檔 for tieba http ... 自增 num 圖片 #!/usr/bin/env python#-- coding:utf-8 -- import osimport urllibimport urllib2from lxml import etree cl

python爬取百度貼吧指定內容

環境:python3.6 1：抓取百度貼吧—linux吧內容基礎版抓取一頁指定內容並寫入檔案萌新剛學習Python爬蟲,做個練習貼吧連結: http://tieba.baidu.com/f?kw=linux&ie=utf-8&pn=0 解析原始碼使用的是B

使用者輸入關鍵字，爬取百度貼吧

爬取百度貼吧 #coding=utf-8 import urllib2 import urllib def loadPage(url, filename): ''' 作用:根據url傳送請求，獲取伺服器響應檔案 url:需要爬取的url地址

PHP爬蟲-爬取百度貼吧首頁違規主題貼

因為是第一次寫，感覺有點冗餘。不過嘛，本文章主要面向不知道爬蟲為何物的小夥伴。o(∩_∩)o <?php $url='http://tieba.baidu.com/f?ie=utf-8&kw=php&fr=search'; // 地址 $html = file_ge

爬取百度貼吧中的圖片以及視訊

將爬取下來的內容儲存到本地 import re import time import urllib import requests from lxml import etree class ImmgeSpider: def __init__(self): self.hea

Python爬蟲-爬取百度貼吧

方法 eba style name urlopen for pri url pen 爬取百度貼吧 ===================== ===== 結果示例： ===================================== 1 ‘‘‘ 2 爬去百

Python爬取百度貼吧標題

# -*- coding: utf-8 -*- """ Created on Sun Nov 4 10:22:07 2018 @author: wangf """ from urllib.request import urlopen import codecs from

教你分分鐘爬取百度貼吧，新手可操作（附原始碼及解析）

不要以為這個教程很難，其實非常容易上手。並且講解非常詳細。原理：通過檢視原始碼扣出關鍵資料，然後將其儲存到本地txt檔案下。（一通百通，原理大多一樣。）【新建一個BugBaidu.py檔案，然後將程式碼複製到裡面後，雙擊執行。將貼吧中樓主釋出的內容打包txt儲存到本地。】學

Python爬取百度貼吧回帖中的微訊號（基於簡單http請求）

作者：草小誠轉載請注原文地址：https://blog.csdn.net/cxcjoker7894/article/details/85685115 前些日子媳婦兒有個需求，想要一個任意貼吧近期主題帖的所有回帖中的微訊號，用來做一些微商的操作，你懂的。因為有些貼吧專門就是

XPath：爬取百度貼吧圖片，並儲存本地

什麼是XML

XML 和 HTML 的區別

XML文件示例

HTML DOM 模型示例

XML的節點關係

1. 父（Parent）

2. 子（Children）

3. 同胞（Sibling）

4. 先輩（Ancestor）

5. 後代（Descendant）

什麼是XPath？

XPath 開發工具

選取節點

路徑表示式：

謂語（Predicates）

選取未知節點

選取若干路徑

XPath例項測試

1. 獲取所有的 <li> 標籤

2. 繼續獲取<li> 標籤的所有 class屬性

3. 繼續獲取<li>標籤下hre 為 link1.html 的 <a> 標籤

4. 獲取<li> 標籤下的所有 <span> 標籤

5. 獲取 <li> 標籤下的<a>標籤裡的所有 class

6. 獲取最後一個 <li> 的 <a> 的 href

7. 獲取倒數第二個元素的內容

8. 獲取 class 值為 bold 的標籤名

使用XPath的爬蟲

相關推薦