pyspider_ API介面用法

阿新 • • 發佈：2018-11-02

self.crawl

2016-09-05 15:00:31

manbuheiniu

27405

最後編輯：manbuheiniu 於 2016-09-05 16:13:49

self.crawl(url, **kwargs)

self.crawl在pyspider 系統中是非常重要的介面，它的功能是告訴pyspider哪些ＵＲＬ需要抓取．

引數:

url

需要被抓取的url或url列表．

callback

這個引數用來指定爬取內容後需要哪個方法來處理內容．一般解析為 response. _default: __call__ _　如下面呼叫方法：

def on_start(self): 
    self.crawl('http://scrapy.org/', callback=self.index_page)

self.crawl還有以下可選引數

age

本引數用來指定任務的有效期，在有效期內不會重複抓取．預設值是-1（永遠不過期，意思是隻抓一次）

@config(age=10 * 24 * 60 * 60) 
    def index_page(self, response): 
        ...

解析：每一個回撥index_page的任務有效期是10天，在10天之內再遇到這個任務都會被忽略掉（除非有強制抓取引數才不會忽略）．

priority

這個引數用來指定任務的優先順序，數值越大越先被執行．預設值為0．

def index_page(self): 
    self.crawl('http://www.example.org/page2.html', callback=self.index_page)
    self.crawl('http://www.example.org/233.html', callback=self.detail_page,priority=1)

這兩個任務如果被同時放入到任務佇列裡，頁面233.html先被執行． Use this parameter can do a

BFS and reduce the number of tasks in queue(which may cost more memory resources).

exetime

the executed time of task in unix timestamp. default: 0(immediately)

import time def on_start(self):
    self.crawl('http://www.example.org/', callback=self.callback,exetime=time.time()+30*60)

The page would be crawled 30 minutes later.

retries

任務執行失敗後重試次數. default: 3

itag

任務標記值，此標記會在抓取時對比，如果這個值發生改變，不管有效期有沒有到都會重新抓取新內容．多數用來動態判斷內容是否修改或強制重爬．預設值是：None.

def index_page(self, response):  
    for item in response.doc('.item').items():
        self.crawl(item.find('a').attr.url, callback=self.detail_page,       itag=item.find('.update-time').text())

本例項中使用頁面中update-time元素的值當成itag來判斷內容是否有更新．

class Handler(BaseHandler): 
    crawl_config = { 'itag': 'v223' }

修改全域性引數itag，使所有任務都重新執行（需要點run按鈕來啟動任務）．

auto_recrawl

when enabled, task would be recrawled every age time. default: False

def on_start(self): 
    self.crawl('http://www.example.org/', callback=self.callback,age=5*60*60, auto_recrawl=True)

The page would be restarted every age 5 hours.

method

HTTP請求方法設定，預設值: GET

params

把一個字典引數附加到url引數裡，如：

def on_start(self): 
    self.crawl('http://httpbin.org/get', callback=self.callback,params={'a': 123, 'b': 'c'})
    self.crawl('http://httpbin.org/get?a=123&b=c', callback=self.callback)

解析：這兩個是相同的任務．

data

這個引數會附加到ＵＲＬ請求的body裡，如果是字典會經過form-encoding編碼再附加.

def on_start(self): 
    self.crawl('http://httpbin.org/post', callback=self.callback,method='POST', data={'a': 123, 'b': 'c'})

files

dictionary of {field: {filename: 'content'}} files to multipart upload.`

headers

自定義請求頭（字典型別）．

自定義請求的cookies（字典型別）．

connect_timeout

指定請求時連結超時時間,單位秒，預設值：20.

timeout

請求內容裡最大等待秒數．預設值：120．

allow_redirects

遇到30x狀態碼時是否重新請求跟隨．預設是：True.

validate_cert

遇到HTTPS型別的URL時是否驗證證書，預設值：True.

proxy

設定代理伺服器，格式如 username:[email protected]:port .暫時只支援http代理

class Handler(BaseHandler): 
    crawl_config = { 'proxy': 'localhost:8080' }

Handler.crawl_config裡配置proxy引數會對整個專案生效，本專案的所有任務都會使用代理爬取．

etag

use HTTP Etag mechanism to pass the process if the content of the page is not changed. default: True

last_modified

use HTTP Last-Modified header mechanism to pass the process if the content of the page is not changed. default: True

fetch_type

設定是否啟用JavaScript解析引擎. default: None

js_script

JavaScript run before or after page loaded, should been wrapped by a function like function() { document.write("binux"); }.

def on_start(self): 
    self.crawl('http://www.example.org/', callback=self.callback,fetch_type='js', js_script='''
               function() {
                   window.scrollTo(0,document.body.scrollHeight);
                   return 123;
               }
               ''')

The script would scroll the page to bottom. The value returned in function could be captured via Response.js_script_result.

js_run_at

run JavaScript specified via js_script at document-start or document-end. default: document-end

js_viewport_width/js_viewport_height

set the size of the viewport for the JavaScript fetcher of the layout process.

load_images

load images when JavaScript fetcher enabled. default: False

save

傳遞一個物件給任務，在任務解析時可以通過response.save來獲取傳遞的值．

def on_start(self): 
    self.crawl('http://www.example.org/', callback=self.callback,save={'a': 123}) def callback(self, response): return response.save['a']

在回撥裡123將被返回．

taskid

唯一性的taskid用來區別不同的任務．預設taskid是由URL經過md5計算得出．你也可以使用def get_taskid(self, task)方法覆蓋系統自帶的來自定義任務id.如：

import json from pyspider.libs.utils 
import md5string 
def get_taskid(self, task): 
    return md5string(task['url']+json.dumps(task['fetch'].get('data', '')))

本例項任務ID不只是url，不同的data引數也會生成不同的任務id

force_update

force update task params even if the task is in ACTIVE status.

cancel

cancel a task, should be used with force_update to cancel a active task. To cancel an auto_recrawltask, you should set auto_recrawl=False as well.

cURL command

self.crawl(curl_command)

cURL is a command line tool to make a HTTP request. It can easily get form Chrome Devtools > Network panel, right click the request and "Copy as cURL".

You can use cURL command as the first argument of self.crawl. It will parse the command and make the HTTP request just like curl do.

@config(**kwargs)

default parameters of self.crawl when use the decorated method as callback. For example:

@config(age=15*60) 
def index_page(self, response): 
    self.crawl('http://www.example.org/list-1.html', callback=self.index_page)
    self.crawl('http://www.example.org/product-233', allback=self.detail_page) 
@config(age=10*24*60*60) 
    def detail_page(self, response): return {...}

age of list-1.html is 15min while the age of product-233.html is 10days. Because the callback of product-233.html is detail_page, means it's a detail_page so it shares the config of detail_page.

Handler.crawl_config = {}

default parameters of self.crawl for the whole project. The parameters in crawl_config for scheduler (priority, retries, exetime, age, itag, force_update, auto_recrawl, cancel) will be joined when the task created, the parameters for fetcher and processor will be joined when executed. You can use this mechanism to change the fetch config (e.g. cookies) afterwards.

class Handler(BaseHandler): 
    crawl_config = { 'headers': { 'User-Agent': 'GoogleBot',
        }
    }

    ...

crawl_config set a project level user-agent.

pyspider_ API介面用法

self.crawl

self.crawl(url, **kwargs)

cURL command

@config(**kwargs)

Handler.crawl_config = {}

pyspider_ API介面用法

java：API介面通俗易懂 & perfect

詳解Android/IOS平臺下抓包工具使用以及抓取API介面

一些資料API介面

百度搜索排名API介面PC返回JSON資料格式

React 學習筆記（五）（獲取伺服器API介面資料：axios、fetchJSONP）

swagger實現api介面視覺化

Swagger2整合springBoot,自動生成API介面文件

網際網路API介面冪等設計

淺談移動端車牌識別api介面呼叫方法

學習了爬蟲之後總想幹些事情，這是一些常用的API介面，希望對你有用

拓展——各類用於測試的API介面整理

（三）vue.js中api介面的統一管理（參考）

API介面設計要考慮的因素

Web API介面設計經驗總結

使用API介面在zabbix系統中登陸、建立、刪除agent

通過CDH的API介面獲得yarn的線上使用者數

Restful規範-開發api介面

API介面自動化之1 常見的http請求

安全的API 介面解決方案

pyspider_ API介面用法

self.crawl

self.crawl(url, **kwargs)

cURL command

@config(**kwargs)

Handler.crawl_config = {}

相關推薦