帶引數的scrapy專案爬蟲

阿新 • • 發佈：2018-11-02

本章我們將會根據特定的 tag 來爬取 " http://quotes.toscrape.com/ "，的內容。

首先，我們先觀察這個網站的 url 結構，以 humor 這個 tag 為例，它的 url 是這樣的：" http://quotes.toscrape.com/tag/humor/ "。可見這個網站某個 tag 的 url 是：" http://quotes.toscrape.com/tag/ + tage_name "

下面我們來構建一個帶引數 tag 的爬蟲，tag 為我們選擇要爬取的 tag。

# -*- coding: utf-8 -*-

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        url = 'http://quotes.toscrape.com/tag/{}'.format(self.tag)
        yield scrapy.Request(url, self.parse)

    # 根據 tag 頁面爬取內容並處理翻頁
    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('small.author::text').extract_first(),
            }

        next_page = response.css('li.next a::attr(href)').extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, self.parse)

也可以使用 python 的內建函式 getattr()來實現。

# -*- coding: utf-8 -*-

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        url = 'http://quotes.toscrape.com/'
        # 如果 self.tag 存在返回 tag 本身的值，不存在返回 None
        tag = getattr(self, 'tag', None)
        if tag is not None:
            url = url + 'tag/' + tag
        # 把構建好的 url 用 parse() 方法處理
        yield scrapy.Request(url, self.parse)

    # 根據 tag 頁面爬取內容並處理翻頁
    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('small.author::text').extract_first(),
            }

        next_page = response.css('li.next a::attr(href)').extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, self.parse)

我們還能直接在 __init__，用超類的方法來定義 start_urls：

# -*- coding: utf-8 -*-

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def __init__(self, tag=None, *args, **kwargs):
        super(QuotesSpider, self).__init__(*args, **kwargs)
        self.start_urls = ['http://quotes.toscrape.com/tag/{}'.format(tag)]

    # 根據 tag 頁面爬取內容並處理翻頁
    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('small.author::text').extract_first(),
            }

        next_page = response.css('li.next a::attr(href)').extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, self.parse)

在命令列使用爬蟲爬取時，我們用 -a 來傳遞引數：

scrapy crawl quotes -o quotes-humor.json -a tag=humor

現在，tag 為 humor 的所有資訊都被爬取並儲存為 json 檔案了。

帶引數的scrapy專案爬蟲

本章我們將會根據特定的 tag 來爬取 " http://quotes.toscrape.com/ "，的內容。首先，我們先觀察這個網站的 url 結構，以 humor 這個 tag 為例，它的 url 是這樣的：" http://quotes.toscrape.com/tag/humor

使用scrapy 建立爬蟲專案

使用scrapy 建立爬蟲專案步驟一： scrapy startproject tutorial 步驟二： you can start your first spider with: cd tutorial scrapy genspider example example.

Scrapy搭建爬蟲專案

1.輸入workon命令，進入已經設定好的一個虛擬環境。 2.安裝scrapy框架:pip install scrapy 3.安裝本地Twisted的wheel檔案(上一篇部落格有下載地址和安裝方法) https://blog.csdn.net/qq_40655579/article/d

第二章 python分散式爬蟲打造搜尋引擎環境搭建第一節建立第一個scrapy專案

scrapy可以執行在python2.7、python3.3或者是更高的版本上；首先我們借用之前的知識，建立一個python環境。然後在此python環境中建立我們的scrapy專案。這裡你如果不理解的可以給你們大個比方，python版本就好比是

【python3爬蟲】Scrapy Win10下安裝與新建Scrapy專案

詳細安裝教程可參考： http://www.runoob.com/w3cnote/scrapy-detail.html https://segmentfault.com/a/1190000013178839 其他教程： https://oner-wv.gitbooks.io/sc

Scrapy專案部署到Gerapy分散式爬蟲框架流程

1 準備工作（1）安裝Gerapy 通過pip install gerapy即可（2）安裝Scrapyd 通過pip install scrapyd即可（3）寫好的Scrapy專案，如： 2 開始部署（1）在電腦任意位置新建一個資料夾，如：（2）開

Python的scrapy框架爬蟲專案中加入郵箱通知（爬蟲啟動關閉等資訊以郵件的方式傳送到郵箱）

前面關於傳送郵件的部落格參考：普通郵件部落格——點選開啟連結帶附件的郵件——點選開啟連結準備： 1、建立scrapy爬蟲專案 2、程式碼主要是兩部分：呈上程式碼第一部分是傳送郵

scrapy漫畫爬蟲+django頁面展示專案

comic 這裡主要包括兩大部分：漫畫爬蟲，將資料儲存到mysql 漫畫網站搭建，並且進行展示程式碼已經放在GitHub上，GitHub有更詳細的說明流溪閣線上漫畫感覺還不錯的請fork,star 配置mysql DR

python爬蟲scrapy專案詳解（關注、持續更新）

python爬蟲scrapy專案（一）　　爬取目標：騰訊招聘網站（起始url：https://hr.tencent.com/position.php?keywords=&tid=0&start）　　爬取內容：職位；職位型別；招聘人數；工作地點；釋出時間；招聘詳細連結；工作職責；工作要求

python爬蟲scrapy專案詳解（關注、持續更新）！

爬取目標：騰訊招聘網站（起始url：https://hr.tencent.com/position.php?keywords=&tid=0&start）爬取內容：職位；職位型別；招聘人數；工作地點；釋出時間；招聘詳細連結；工作職責；工作要求反反爬措施：設定隨機user-a

開啟Scrapy專案之旅之二：用Scrapy進行專案爬蟲管理

1、進入scrapy資料夾下，建立Scrapy專案 scrapy startproject myfirstpjt 2、進入自己建立的scrapy專案中： cd myfirstpjt 進入後，我們可以對該爬蟲專案進行管理，可以通過工具命令實現，下節介紹。 3、爬蟲專

用scrapy建立爬蟲專案

建立專案和爬蟲: 1.建立專案:`scrapy startproject [爬蟲的名字]` 2.建立爬蟲:進入到專案所在的路徑,執行命令:`scrapy genspider [爬蟲名字] [爬蟲的域名]注意：爬蟲名字不能和專案名稱一致專案目錄結構: 1. Items.py:用來存放爬蟲爬取下來

python爬蟲九：如何在pycharm中安裝scrapy，建立一個scrapy專案

親身經歷，非常痛苦1.網上嘗試各種辦法都沒成功~，很是蛋疼，最後只能死馬當活馬醫，http://landinghub.visualstudio.com/visual-cpp-build-tools下載Visual C++ 2015 Build Tools2.下好後，靜靜等靜靜

第三百三十三節，web爬蟲講解2—Scrapy框架爬蟲—Scrapy模擬瀏覽器登錄—獲取Scrapy框架Cookies

pid 設置 ade form 需要 span coo decode firefox 第三百三十三節，web爬蟲講解2—Scrapy框架爬蟲—Scrapy模擬瀏覽器登錄模擬瀏覽器登錄 start_requests()方法，可以返回一個請求給爬蟲的起始網站，這個返回的請求相

34.scrapy解決爬蟲翻頁問題

city bsp ride ebsp through del execute 出現 auto 這裏主要解決的問題：1.翻頁需要找到頁面中加載的兩個參數。 ‘__VIEWSTATE‘: ‘{}‘.format(response.meta[‘data‘][‘__VIEWS

一個簡單RPC框架是如何煉成的(III)——實現帶引數的RPC呼叫

上一篇，我們制定了一個很簡單的RPC訊息的格式，但是還遺留了兩個問題我們並沒有實現相應的encode和decode方法，沒有基於可以跨裝置的字串傳輸，而是直接的記憶體變數傳遞。現在的RPC request不支援帶引數的請求命令。如add(a, b), 如何在RPC訊息中描述

微信小遊戲帶引數二維碼生成

參考：微信小程式-獲取小程式碼微信小遊戲開發文件-小遊戲獲取二維碼開放能力開發一、請求微信介面，獲取token https://api.weixin.qq.com/cgi-bin/token?grant_type=client_credential&

form表單Get方式提交時,action中帶引數傳遞不了

form表單get方式提交時,action中帶引數傳遞不了樣例： <form action="getPostServlet/getPost.do?param4=param4" method="get"> <input type="hidden" name="p

Delphi程式帶引數執行

程式1 program E1; uses Forms,Dialogs,SysUtils, EndM1 in 'EndM1.pas' {Form2}; {$R *.res} begin Application.Initialize; Application.CreateFor

20181104_C#執行緒之Thread_ThreadPool_使用Thread實現回到和帶引數的回撥

C# .net Framework多執行緒演變路徑: 1.0 1.1 時代使用Thread 2.0 時代使用ThreadPool 3.0 時代使用Task 4

帶引數的scrapy專案爬蟲

相關推薦