scrapy框架-- response

阿新 • • 發佈：2018-11-11

1.Scrapy response

1.1 response方法和引數

（1）body：http響應的正文，位元組
（2）body_as_unicode：字串型別的響應
（3）copy：複製
（4）css：以css進行匹配
（5）encoding：加碼
（6）headers：響應頭部
（7）meta：響應處理的引數
（8）replace：替換
（9）request：產生http請求的request物件
（10）selector：scrapy 的字元匹配器
（11）status：狀態碼 200 400
（12）text：文字形式的響應內容
（13）url：http響應的地址
（14）urljoin：構造絕對url
（15）xpath：以xpath進行匹配

①程式碼：

②結果：

1.2 response分類

TextResponse、 HtmlResponse、Xmlresponse

2. Scrapy selector

Scrapy匹配核心selector

Beautifulsoup 比較方便，但是解析速度比較慢

Lxml 解析速度比較快

Scrapy集合二者的優點，進行總和。

①程式碼：

from scrapy.selector import Selector

def parse(self,response):
selector = Selector(response)
self.log(selector)

②結果：

2.1 Selector物件支援

2.1.1 css查詢

表示式	描述	例子
*	所有元素	* 所有的標籤
Tag	指定標籤	img 所有的img標籤
Tag1,tag2	指定多個標籤	img,a img和a標籤
Tag1 tage2	下一層標籤	img a img下的a標籤
Attrib = value	指定屬性	Id = 1 id等於1的標籤

①程式碼（selector呼叫）

from scrapy.selector import Selector

def parse(self,response):
    selector = Selector(response)
    img_list = selector.css("img")
    for img in img_list:
        self.log(img)

②程式碼【response直接呼叫，優化程式碼】

無需匯入Selector，結果與優化前的顯示一致

def parse(self,response):
    img_list = response.css("img")
    for img in img_list:
        self.log(img)

③結果：

2.1.2 xpath查詢

（1）在scrapy當中寫xpath不會有 text() attrib() tag()這樣的方法，我們需要把這些方法寫到匹配當中

（2）.當前節點，

（3）..上一層節點

表示式	描述	例子
/	當前文件的根或者層	/html/body/div 取div
text()	文字	/html/body/div/a/text() 取a的文字
@attrib	屬性	/html/body/div/a/@href 取a的屬性
*	代表所有	/html/body/[@class=’hello’] 取所有class屬性等於hello的標籤 /html/body/a/@ 取a標籤所有的屬性
[]	修飾語	/html/body/div[4] 取第4個div /html/body/div[@class=“xxx”]

①程式碼：

def parse(self,response):
    img_list = response.xpath("//img/@src")
    for img in img_list:
        self.log(img)

②結果：

2.1.3 re查詢

他不可以獨立用，只可以加在匹配項後面

①程式碼：

def parse(self,response):
    img_list = response.xpath("//img/@src")
    for img in img_list:
        img = img.re(".*png$")
        self.log(img)

②結果：

2.2 返回字元結果的方法

2.2.1 extract()針對單個物件

①程式碼：

def parse(self,response):
    img_list = response.xpath("//img/@src")
    for img in img_list:
        img = img.extract()
        self.log(img)

②結果：

2.2.2 extract_first()針對列表

①程式碼：

def parse(self,response):
    img_list = response.xpath("//img/@src")
    img_List = img_list.extract_first()
    self.log(img_List)

②結果：

3.Scrapy item

3.1 item介紹

（1）Scrapy有一個巨大的優勢，scrapy可以定義資料模型，我們用item可以定義我們的資料模型類（定義一個類），定義的方法類似django的模型，但也有不同。

（2）Scrapy預設會建立一個模型，我們可以在裡面定義我們想要的資料模型。

（3）scrapy 的item當中所有的欄位都可以為Field

（4）item將解析結果返回成字典形式

3.2 item使用

①程式碼（spider）：

import scrapy
from ScrapySpider.items import ScrapyTest
from scrapy import Request
class TestSpider(scrapy.spiders.Spider):

    name = "baiduSpider"

    def start_requests(self):
        url = "https://www.baidu.com/"
        headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36"
        }
        yield Request(url,headers=headers)

    def parse(self,response):
        img_list = response.xpath("//img/@src")
        for img in img_list:
            item = ScrapyTest()
            item["src"] = img.extract()
            self.log(item)

②程式碼（items）：

import scrapy

class ScrapyspiderItem(scrapy.Item):
    # define the fields for your item here like:
    name = scrapy.Field()

# 繼承ScrapyspiderItem
class ScrapyTest(ScrapyspiderItem):
    src = scrapy.Field()

③結果：

4.Scrapy pipeline

這個時候，我們可以把資料格式化了，但是資料輸出我們還需要使用piplines

4.1 pipeline介紹

（1）使用pipelines的第一步是在settings當中解開（取消註釋）pipelines的配置

這條配置分為兩部分

①pipelines的位置

②優先順序，優先順序分為1-1000，數值越小越先執行

（2）爬蟲要有生成器生成item步驟

4.2 pipelines使用

①程式碼（spider）：

②程式碼（pipelines）：

③結果：

5.Scrapy 專案例項

5.1 新建爬蟲檔案qiushi.py

import scrapy
from ScrapySpider.items import ScrapyTest

class qiushiTest(scrapy.spiders.Spider):
    name = "qiushi"
    def start_requests(self):
        url = "https://www.qiushibaike.com/"
        headers = {
            "Referer": "https://www.baidu.com/link?url=0NjZXCRuEfuf8lcVVYy8j3o_548KY5Nvc_GHkq6auqOxoY7-LnODt6dLkTcihaWC&wd=&eqid=8e7edbd1000211e4000000055bc19182",
            "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36"
        }
        yield scrapy.Request(url,headers=headers)

    def parse(self,response):
        img_list = response.xpath("//img/@src")
        for img in img_list:
            item = ScrapyTest()
            item["src"] = img.extract()
            self.log(item)
            yield item

5.2 items.py

5.3 pipelines.py

from urllib import request
class ScrapyspiderPipeline(object):
    def process_item(self, item, spider):
        src = item["src"]
        url = "http:" + src
        if "?" in src:
            URL = src.split("?")[0]   # 以“？”從左邊開始分割，取左邊第一個
            name = URL.rsplit("/", 1)[1]  # 以“/”從右邊開始分割一次，取右邊第一個
        else:
            name = src.rsplit("/", 1)[1]
        print("===========")
        print(name)
        print(url)
        path = "F:\\img\\" + name
        try:
            request.urlretrieve(url, path)
        except Exception as e:
            print(e)
        else:
            print("%s is down" % name)
        return item

5.4 run.py

from scrapy import cmdline
cmdline.execute("scrapy crawl qiushi".split())

scrapy框架-- response

1.Scrapy response

1.1 response方法和引數

1.2 response分類

2. Scrapy selector

2.1 Selector物件支援

2.1.1 css查詢

2.1.2 xpath查詢

2.1.3 re查詢

2.2 返回字元結果的方法

2.2.1 extract()針對單個物件

2.2.2 extract_first()針對列表

3.Scrapy item

3.1 item介紹

3.2 item使用

4.Scrapy pipeline

4.1 pipeline介紹

4.2 pipelines使用

5.Scrapy 專案例項

5.1 新建爬蟲檔案qiushi.py

5.2 items.py

5.3 pipelines.py

5.4 run.py

5.5 結果

相關推薦