python爬蟲實現POST request payload形式的請求
1. 背景
最近在爬取某個站點時,發現在POST資料時,使用的資料格式是request payload,有別於之前常見的 POST資料格式(Form data)。而使用Form data資料的提交方式時,無法提交成功。
1.1. Http請求中Form Data 和 Request Payload的區別
AJAX Post請求中常用的兩種傳引數的形式:form data 和 request payload
1.1.1. Form data
get請求的時候,我們的引數直接反映在url裡面,形式為key1=value1&key2=value2形式,比如:
http://news.baidu.com/ns?word=NBA&tn=news&from=news&cl=2&rn=20&ct=1
而如果是post請求,那麼表單引數是在請求體中,也是以key1=value1&key2=value2的形式在請求體中。通過chrome的開發者工具可以看到,如下:
RequestURL:http://127.0.0.1:8080/test/test.do Request Method:POST Status Code:200 OK Request Headers Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8 Accept-Encoding:gzip,deflate,sdch Accept-Language:zh-CN,zh;q=0.8,en;q=0.6 AlexaToolbar-ALX_NS_PH:AlexaToolbar/alxg-3.2 Cache-Control:max-age=0 Connection:keep-alive Content-Length:25 Content-Type:application/x-www-form-urlencoded Cookie:JSESSIONID=74AC93F9F572980B6FC10474CD8EDD8D Host:127.0.0.1:8080 Origin:http://127.0.0.1:8080 Referer:http://127.0.0.1:8080/test/index.jsp User-Agent:Mozilla/5.0 (Windows NT 6.1)AppleWebKit/537.36 (KHTML,like Gecko) Chrome/33.0.1750.149 Safari/537.36 Form Data name:mikan address:street Response Headers Content-Length:2 Date:Sun,11 May 2014 11:05:33 GMT Server:Apache-Coyote/1.1
這裡要注意post請求的Content-Type為application/x-www-form-urlencoded(預設的),引數是在請求體中,即上面請求中的Form Data。
前端程式碼:提交資料
xhr.setRequestHeader("Content-type","application/x-www-form-urlencoded");
xhr.send("name=foo&value=bar");
後端程式碼:接收提交的資料。在servlet中,可以通過request.getParameter(name)的形式來獲取表單引數。
/** * 獲取httpRequest的引數 * * @param request * @param name * @return */ protected String getParameterValue(HttpServletRequest request,String name) { return StringUtils.trimToEmpty(request.getParameter(name)); }
1.1.2. Request payload
如果使用原生AJAX POST請求的話,那麼請求在chrome的開發者工具的表現如下,主要是引數在
Remote Address:192.168.234.240:80 Request URL:http://tuanbeta3.XXX.com/qimage/upload.htm Request Method:POST Status Code:200 OK Request Headers Accept:application/json,text/javascript,*/*; q=0.01 Accept-Encoding:gzip,en;q=0.6 Connection:keep-alive Content-Length:151 Content-Type:application/json;charset=UTF-8 Cookie:JSESSIONID=E08388788943A651924CA0A10C7ACAD0 Host:tuanbeta3.XXX.com Origin:http://tuanbeta3.XXX.com Referer:http://tuanbeta3.XXX.com/qimage/customerlist.htm?menu=19 User-Agent:Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/35.0.1916.114 Safari/537.36 X-Requested-With:XMLHttpRequest Request Payload [{widthEncode:NNNcaXN,heightEncode:NNNN5NN,displayUrl:201409/03/66I5P266rtT86oKq6,…}] Response Headers Connection:keep-alive Content-Encoding:gzip Content-Type:application/json;charset=UTF-8 Date:Thu,04 Sep 2014 06:49:44 GMT Server:nginx/1.4.7 Transfer-Encoding:chunked Vary:Accept-Encoding
注意請求的Content-Type是application/json;charset=UTF-8,而請求表單的引數在Request Payload中。
後端程式碼:獲取資料(這裡使用org.apache.commons.io.):
/** * 從 request 獲取 payload 資料 * * @param request * @return * @throws IOException */ private String getRequestPayload(HttpServletRequest request) throws IOException { return IOUtils.toString(request.getReader()); }
1.1.3. 二者區別
如果一個請求的Content-Type設定為application/x-www-form-urlencoded,那麼這個Post請求會被認為是Http Post表單請求,那麼請求主體將以一個標準的鍵值對和&的querystring形式出現。這種方式是HTML表單的預設設定,所以在過去這種方式更加常見。
其他形式的POST請求,是放到 Request payload 中(現在是為了方便閱讀,使用了Json這樣的資料格式),請求的Content-Type設定為application/json;charset=UTF-8或者不指定。
2. 環境
python 3.6.1
系統:win7
IDE:pycharm
requests 2.14.2
scrapy 1.4.0
3. 使用requests模組post payload請求
import json import requests import datetime postUrl = 'https://sellercentral.amazon.com/fba/profitabilitycalculator/getafnfee?profitcalcToken=en2kXFaY81m513NydhTZ9sdb6hoj3D' # payloadData資料 payloadData = { 'afnPriceStr': 10,'currency':'USD','productInfoMapping': { 'asin': 'B072JW3Z6L','dimensionUnit': 'inches',} } # 請求頭設定 payloadHeader = { 'Host': 'sellercentral.amazon.com','Content-Type': 'application/json',} # 下載超時 timeOut = 25 # 代理 proxy = "183.12.50.118:8080" proxies = { "http": proxy,"https": proxy,} r = requests.post(postUrl,data=json.dumps(payloadData),headers=payloadHeader) dumpJsonData = json.dumps(payloadData) print(f"dumpJsonData = {dumpJsonData}") res = requests.post(postUrl,data=dumpJsonData,headers=payloadHeader,timeout=timeOut,proxies=proxies,allow_redirects=True) # 下面這種直接填充json引數的方式也OK # res = requests.post(postUrl,json=payloadData,headers=header) print(f"responseTime = {datetime.datetime.now()},statusCode = {res.status_code},res text = {res.text}")
4. 在scrapy中post payload請求
這兒有個壞訊息,那就是scrapy目前還不支援payload這種request請求。而且scrapy對formdata的請求也有很嚴格的要求,具體可以參考這篇文章:https://www.jb51.net/article/185824.htm
4.1. 分析scrapy原始碼
參考註解
# 檔案:E:\Miniconda\Lib\site-packages\scrapy\http\request\form.py class FormRequest(Request): def __init__(self,*args,**kwargs): formdata = kwargs.pop('formdata',None) if formdata and kwargs.get('method') is None: kwargs['method'] = 'POST' super(FormRequest,self).__init__(*args,**kwargs) if formdata: items = formdata.items() if isinstance(formdata,dict) else formdata querystr = _urlencode(items,self.encoding) # 這兒寫死了,當提交資料時,設定好Content-Type,也就是form data型別 # 就算改寫這兒,後面也沒有對 json資料解析的處理 if self.method == 'POST': self.headers.setdefault(b'Content-Type',b'application/x-www-form-urlencoded') self._set_body(querystr) else: self._set_url(self.url + ('&' if '?' in self.url else '?') + querystr)
4.2. 思路:在scrapy中嵌入requests模組
分析請求
返回的查詢結果
第一步:在爬蟲中構造請求,把所有的引數以及必要資訊帶進去。
返回的查詢結果
第一步:在爬蟲中構造請求,把所有的引數以及必要資訊帶進去。
# 檔案 mySpider.py中 payloadData = {} payloadData['afnPriceStr'] = 0 payloadData['currency'] = asinInfo['currencyCodeHidden'] payloadData['futureFeeDate'] = asinInfo['futureFeeDateHidden'] payloadData['hasFutureFee'] = False payloadData['hasTaxPage'] = True payloadData['marketPlaceId'] = asinInfo['marketplaceIdHidden'] payloadData['mfnPriceStr'] = 0 payloadData['mfnShippingPriceStr'] = 0 payloadData['productInfoMapping'] = {} payloadData['productInfoMapping']['asin'] = dataFieldJson['asin'] payloadData['productInfoMapping']['binding'] = dataFieldJson['binding'] payloadData['productInfoMapping']['dimensionUnit'] = dataFieldJson['dimensionUnit'] payloadData['productInfoMapping']['dimensionUnitString'] = dataFieldJson['dimensionUnitString'] payloadData['productInfoMapping']['encryptedMarketplaceId'] = dataFieldJson['encryptedMarketplaceId'] payloadData['productInfoMapping']['gl'] = dataFieldJson['gl'] payloadData['productInfoMapping']['height'] = dataFieldJson['height'] payloadData['productInfoMapping']['imageUrl'] = dataFieldJson['imageUrl'] payloadData['productInfoMapping']['isAsinLimits'] = dataFieldJson['isAsinLimits'] payloadData['productInfoMapping']['isWhiteGloveRequired'] = dataFieldJson['isWhiteGloveRequired'] payloadData['productInfoMapping']['length'] = dataFieldJson['length'] payloadData['productInfoMapping']['link'] = dataFieldJson['link'] payloadData['productInfoMapping']['originalUrl'] = dataFieldJson['originalUrl'] payloadData['productInfoMapping']['productGroup'] = dataFieldJson['productGroup'] payloadData['productInfoMapping']['subCategory'] = dataFieldJson['subCategory'] payloadData['productInfoMapping']['thumbStringUrl'] = dataFieldJson['thumbStringUrl'] payloadData['productInfoMapping']['title'] = dataFieldJson['title'] payloadData['productInfoMapping']['weight'] = dataFieldJson['weight'] payloadData['productInfoMapping']['weightUnit'] = dataFieldJson['weightUnit'] payloadData['productInfoMapping']['weightUnitString'] = dataFieldJson['weightUnitString'] payloadData['productInfoMapping']['width'] = dataFieldJson['width'] # https://sellercentral.amazon.com/fba/profitabilitycalculator/getafnfee?profitcalcToken=en2kXFaY81m513NydhTZ9sdb6hoj3D postUrl = f"https://sellercentral.amazon.com/fba/profitabilitycalculator/getafnfee?profitcalcToken={asinInfo['tokenValue']}" payloadHeader = { 'Host': 'sellercentral.amazon.com',} # scrapy原始碼:self.headers.setdefault(b'Content-Type',b'application/x-www-form-urlencoded') print(f"payloadData = {payloadData}") # 這個request並不真正用來排程,去發出請求,因為這種方式構造方式,是無法提交成功的,會返回404錯誤 # 這樣構造主要是把查詢引數提交出去,在下載中介軟體部分用request模組下載,用 “payloadFlag” 標記這種request yield Request(url = postUrl,headers = payloadHeader,meta = {'payloadFlag': True,'payloadData': payloadData,'headers': payloadHeader,'asinInfo': asinInfo},callback = self.parseAsinSearchFinallyRes,errback = self.error,dont_filter = True )
第二步:在中介軟體中,用requests模組處理這個請求
# 檔案:middlewares.py class PayLoadRequestMiddleware: def process_request(self,request,spider): # 如果有的請求是帶有payload請求的,在這個裡面處理掉 if request.meta.get('payloadFlag',False): print(f"PayLoadRequestMiddleware enter") postUrl = request.url headers = request.meta.get('headers',{}) payloadData = request.meta.get('payloadData',{}) proxy = request.meta['proxy'] proxies = { "http": proxy,} timeOut = request.meta.get('download_timeout',25) allow_redirects = request.meta.get('dont_redirect',False) dumpJsonData = json.dumps(payloadData) print(f"dumpJsonData = {dumpJsonData}") # 發現這個居然是個同步 阻塞的過程,太過影響速度了 res = requests.post(postUrl,headers=headers,allow_redirects=allow_redirects) # res = requests.post(postUrl,headers=header) print(f"responseTime = {datetime.datetime.now()},res text = {res.text},statusCode = {res.status_code}") if res.status_code > 199 and res.status_code < 300: # 返回Response,就進入callback函式處理,不會再去下載這個請求 return HtmlResponse(url=request.url,body=res.content,request=request,# 最好根據網頁的具體編碼而定 encoding='utf-8',status=200) else: print(f"request mode getting page error,Exception = {e}") return HtmlResponse(url=request.url,status=500,request=request)
4.3. 遺留下的問題
scrapy之所以強大,就是因為併發度高。大家都知道,由於Python GIL的原因,導致python無法通過多執行緒來提高效能。但是至少可以做到下載與解析同步的過程,在下載空檔的時候,進行資料的解析,排程等等,這都歸功於scrapy採用的非同步結構。
但是,我們在中介軟體中使用requests模組進行網頁下載,因為這是個同步過程,所以會阻塞在這個地方,拉低了整個爬蟲的效率。
所以,需要根據專案具體的情況,來決定合適的方案。當然這裡又涉及到一個新的話題,就是scrapy提供的兩種爬取模式:深度優先模式和廣度優先模式。如何儘可能最大限度的利用scrapy的併發?在環境不穩定的情形下如何保證儘可能穩定的拿到資料?
深度優先模式和廣度優先模式是在settings中設定的。
# 檔案: settings.py # DEPTH_PRIORITY(預設值為0)設定為一個正值後,Scrapy的排程器的佇列就會從LIFO變成FIFO,因此抓取規則就由DFO(深度優先)變成了BFO(廣度優先) DEPTH_PRIORITY = 1,# 廣度優先(肯呢個會累積大量的request,累計佔有大量的記憶體,最終資料也在最後一批爬取)
深度優先:DEPTH_PRIORITY = 0
廣度優先:DEPTH_PRIORITY = 1
想將這個過程做成非同步的,一直沒有思路,歡迎大神提出好的想法
以上這篇python爬蟲實現POST request payload形式的請求就是小編分享給大家的全部內容了,希望能給大家一個參考,也希望大家多多支援我們。