scrapy爬蟲:scrapy.FormRequest中formdata引數詳解
1. 背景
在網頁爬取的時候,有時候會使用scrapy.FormRequest向目標網站提交資料(表單提交)。參照scrapy官方文件的標準寫法是:
# header資訊 unicornHeader = { 'Host': 'www.example.com','Referer': 'http://www.example.com/',} # 表單需要提交的資料 myFormData = {'name': 'John Doe','age': '27'} # 自定義資訊,向下層響應(response)傳遞下去 customerData = {'key1': 'value1','key2': 'value2'} yield scrapy.FormRequest(url = "http://www.example.com/post/action",headers = unicornHeader,method = 'POST',# GET or POST formdata = myFormData,# 表單提交的資料 meta = customerData,# 自定義,向response傳遞資料 callback = self.after_post,errback = self.error_handle,# 如果需要多次提交表單,且url一樣,那麼就必須加此引數dont_filter,防止被當成重複網頁過濾掉了 dont_filter = True )
但是,當表單提交資料myFormData 是形如字典內嵌字典的形式,又該如何寫?
2. 案例 — 引數為字典
在做亞馬遜網站爬取時,當進入商家店鋪,爬取店鋪內商品列表時,發現採取的方式是ajax請求,返回的是json資料。
請求資訊如下:
響應資訊如下:
如上圖所示,From Data中的資料包含一個字典:
marketplaceID:ATVPDKIKX0DER seller:A2FE6D62A4WM6Q productSearchRequestData:{"marketplace":"ATVPDKIKX0DER","seller":"A2FE6D62A4WM6Q","url":"/sp/ajax/products","pageSize":12,"searchKeyword":"","extraRestrictions":{},"pageNumber":"1"} # formDate 必須構造如下: myFormData = { 'marketplaceID' : 'ATVPDKIKX0DER','seller' : 'A2FE6D62A4WM6Q',# 注意下面這一行,內部字典是作為一個字串的形式 'productSearchRequestData' :'{"marketplace":"ATVPDKIKX0DER","pageNumber":"1"}' }
在amazon中實際使用的構造方法如下:
def sendRequestForProducts(response): ajaxParam = response.meta for pageIdx in range(1,ajaxParam['totalPageNum']+1): ajaxParam['isFirstAjax'] = False ajaxParam['pageNumber'] = pageIdx unicornHeader = { 'Host': 'www.amazon.com','Origin': 'https://www.amazon.com','Referer': ajaxParam['referUrl'],} ''' marketplaceID:ATVPDKIKX0DER seller:AYZQAQRQKEXRP productSearchRequestData:{"marketplace":"ATVPDKIKX0DER","seller":"AYZQAQRQKEXRP","pageNumber":1} ''' productSearchRequestData = '{"marketplace": "ATVPDKIKX0DER","seller": "' + f'{ajaxParam["sellerID"]}' + '","url": "/sp/ajax/products","pageSize": 12,"searchKeyword": "","extraRestrictions": {},"pageNumber": "' + str(pageIdx) + '"}' formdataProduct = { 'marketplaceID': ajaxParam['marketplaceID'],'seller': ajaxParam['sellerID'],'productSearchRequestData': productSearchRequestData } productAjaxMeta = ajaxParam # 請求店鋪商品列表 yield scrapy.FormRequest( url = 'https://www.amazon.com/sp/ajax/products',formdata = formdataProduct,func = 'POST',meta = productAjaxMeta,callback = self.solderProductAjax,errback = self.error,# 處理http error dont_filter = True,# 需要加此引數的 )
3. 原理分析
舉例來說,目前有如下一筆資料:
formdata = { 'Field': {"pageIdx":99,"size":"10"},'func': 'nextPage',}
從網頁上,可以看到請求資料如下:
Field=%7B%22pageIdx%22%3A99%2C%22size%22%3A%2210%22%7D&func=nextPage
第一種,按照如下方式發出請求,結果如下(正確):
yield scrapy.FormRequest( url = 'https://www.example.com/sp/ajax',formdata = { 'Field': '{"pageIdx":99,"size":"10"}',},callback = self.handleFunc,) # 請求資料為:Field=%7B%22pageIdx%22%3A99%2C%22size%22%3A%2210%22%7D&func=nextPage
第二種,按照如下方式發出請求,結果如下(錯誤,無法獲取到正確的資料):
yield scrapy.FormRequest( url = 'https://www.example.com/sp/ajax',formdata = { 'Field': {"pageIdx":99,) # 經過錯誤的編碼之後,傳送的請求為:Field=size&Field=pageIdx&func=nextPage
我們跟蹤看一下scrapy中的原始碼:
# E:/Miniconda/Lib/site-packages/scrapy/http/request/form.py # FormRequest class FormRequest(Request): def __init__(self,*args,**kwargs): formdata = kwargs.pop('formdata',None) if formdata and kwargs.get('func') is None: kwargs['func'] = 'POST' super(FormRequest,self).__init__(*args,**kwargs) if formdata: items = formdata.items() if isinstance(formdata,dict) else formdata querystr = _urlencode(items,self.encoding) if self.func == 'POST': self.headers.setdefault(b'Content-Type',b'application/x-www-form-urlencoded') self._set_body(querystr) else: self._set_url(self.url + ('&' if '?' in self.url else '?') + querystr) # 關鍵函式 _urlencode def _urlencode(seq,enc): values = [(to_bytes(k,enc),to_bytes(v,enc)) for k,vs in seq for v in (vs if is_listlike(vs) else [vs])] return urlencode(values,doseq=1)
分析過程如下:
# 第一步:items = formdata.items() if isinstance(formdata,dict) else formdata # 第一步結果:經過items()方法執行後,原始的dict格式變成如下列表形式: dict_items([('func','nextPage'),('Field',{'size': '10','pageIdx': 99})]) # 第二步:再經過後面的 _urlencode方法將items轉換成如下: [(b'func',b'nextPage'),(b'Field',b'size'),b'pageIdx')] # 可以看到就是在呼叫 _urlencode方法的時候出現了問題,上面的方法執行過後,會使字典形式的資料只保留了keys(value是字典的情況下,只保留了value字典中的key).
解決方案: 就是將字典當成普通的字串,然後編碼(轉換成bytes),進行傳輸,到達伺服器端之後,伺服器會反過來進行解碼,得到這個字典字串。然後伺服器按照Dict進行解析。
拓展:對於其他特殊型別的資料,都按照這種方式打包成字串進行傳遞。
4. 補充1 ——引數型別
formdata的 引數值 必須是unicode,str 或者 bytes object,不能是整數。
案例:
yield FormRequest( url = 'https://www.amztracker.com/unicorn.php',# formdata 的引數必須是字串 formdata={'rank': 10,'category': productDetailInfo['topCategory']},method = 'GET',meta = {'productDetailInfo': productDetailInfo},callback = self.amztrackerSale,# 本專案中這裡觸發errback佔絕大多數 dont_filter = True,# 按理來說是不需要加此引數的 ) # 提示如下ERROR: Traceback (most recent call last): File "E:\Miniconda\lib\site-packages\scrapy\utils\defer.py",line 102,in iter_errback yield next(it) File "E:\Miniconda\lib\site-packages\scrapy\spidermiddlewares\offsite.py",line 29,in process_spider_output for x in result: File "E:\Miniconda\lib\site-packages\scrapy\spidermiddlewares\referer.py",line 339,in <genexpr> return (_set_referer(r) for r in result or ()) File "E:\Miniconda\lib\site-packages\scrapy\spidermiddlewares\urllength.py",line 37,in <genexpr> return (r for r in result or () if _filter(r)) File "E:\Miniconda\lib\site-packages\scrapy\spidermiddlewares\depth.py",line 58,in <genexpr> return (r for r in result or () if _filter(r)) File "E:\PyCharmCode\categorySelectorAmazon1\categorySelectorAmazon1\spiders\categorySelectorAmazon1Clawer.py",line 224,in parseProductDetail dont_filter = True,File "E:\Miniconda\lib\site-packages\scrapy\http\request\form.py",line 31,in __init__ querystr = _urlencode(items,self.encoding) File "E:\Miniconda\lib\site-packages\scrapy\http\request\form.py",line 66,in _urlencode for k,vs in seq File "E:\Miniconda\lib\site-packages\scrapy\http\request\form.py",line 67,in <listcomp> for v in (vs if is_listlike(vs) else [vs])] File "E:\Miniconda\lib\site-packages\scrapy\utils\python.py",line 117,in to_bytes 'object,got %s' % type(text).__name__) TypeError: to_bytes must receive a unicode,str or bytes object,got int # 正確寫法: formdata = {'rank': str(productDetailInfo['topRank']),
原理部分(原始碼):
# 第一階段: 字典分解為items if formdata: items = formdata.items() if isinstance(formdata,self.encoding) # 第二階段: 對value,呼叫 to_bytes 編碼 def _urlencode(seq,enc): values = [(to_bytes(k,enc)) for k,vs in seq for v in (vs if is_listlike(vs) else [vs])] return urlencode(values,doseq=1) # 第三階段: 執行 to_bytes ,引數要求是bytes,str def to_bytes(text,encoding=None,errors='strict'): """Return the binary representation of `text`. If `text` is already a bytes object,return it as-is.""" if isinstance(text,bytes): return text if not isinstance(text,six.string_types): raise TypeError('to_bytes must receive a unicode,str or bytes ' 'object,got %s' % type(text).__name__)
5. 補充2 ——引數為中文
formdata的 引數值 必須是unicode,str 或者 bytes object,不能是整數。
以1688網站搜尋產品為案例:
搜尋資訊如下(搜尋關鍵詞為:動漫周邊):
可以看到 動漫周邊 == %B6%AF%C2%FE%D6%DC%B1%DF
# scrapy中這個請求的構造如下 # python3 所有的字串都是unicode unicornHeaders = { ':authority': 's.1688.com','Referer': 'https://www.1688.com/',} # python3 所有的字串都是unicode # 動漫周邊 tobyte為:%B6%AF%C2%FE%D6%DC%B1%DF formatStr = "動漫周邊".encode('gbk') print(f"formatStr = {formatStr}") yield FormRequest( url = 'https://s.1688.com/selloffer/offer_search.htm',headers = unicornHeaders,formdata = {'keywords': formatStr,'n': 'y','spm': 'a260k.635.1998096057.d1'},meta={},callback = self.parseCategoryPage,# 按理來說是不需要加此引數的 ) # 日誌如下: formatStr = b'\xb6\xaf\xc2\xfe\xd6\xdc\xb1\xdf' 2017-11-16 15:11:02 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://sec.1688.com/query.htm?smApp=searchweb2&smPolicy=searchweb2-selloffer-anti_Spider-seo-html-checklogin&smCharset=GBK&smTag=MTE1LjIxNi4xNjAuNDYsLDU5OWQ1NWIyZTk0NDQ1Y2E5ZDAzODRlOGM1MDI2OTZj&smReturn=https%3A%2F%2Fs.1688.com%2Fselloffer%2Foffer_search.htm%3Fkeywords%3D%25B6%25AF%25C2%25FE%25D6%25DC%25B1%25DF%26n%3Dy%26spm%3Da260k.635.1998096057.d1&smSign=05U0%2BJXfKLQmSbsnce55Yw%3D%3D> from <GET https://s.1688.com/selloffer/offer_search.htm?keywords=%B6%AF%C2%FE%D6%DC%B1%DF&n=y&spm=a260k.635.1998096057.d1> # https://s.1688.com/selloffer/offer_search.htm?keywords=%B6%AF%C2%FE%D6%DC%B1%DF&n=y&spm=a260k.635.1998096057.d1
以上這篇scrapy爬蟲:scrapy.FormRequest中formdata引數詳解就是小編分享給大家的全部內容了,希望能給大家一個參考,也希望大家多多支援我們。