python使用requests庫爬取拉勾網招聘資訊的實現
阿新 • • 發佈:2020-11-23
按F12開啟開發者工具抓包,可以定位到招聘資訊的介面
在請求中可以獲取到介面的url和formdata,表單中pn
為請求的頁數,kd
為關請求職位的關鍵字
使用python構建post請求
data = { 'first': 'true','pn': '1','kd': 'python' } headers = { 'referer': 'https://www.lagou.com/jobs/list_python/p-city_0?&cl=false&fromSearch=true&labelWords=&suginput=','user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/86.0.4240.198 Safari/537.36' } res = requests.post("https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false",data=data,headers=headers) print(res.text)
發現沒有從介面獲取到資料
換了個網路後接口還是會返回操作頻繁的錯誤資訊,仔細檢查後發現這個介面需要一個動態的cookies不然會一值返回錯誤頻繁
data = { 'first': 'true','kd': 'python' } #頭部中必須有user-agent和referer不然不會返回cookies headers = { 'referer': 'https://www.lagou.com/jobs/list_python/p-city_0?&cl=false&fromSearch=true&labelWords=&suginput=',like Gecko) Chrome/86.0.4240.198 Safari/537.36' } #通過訪問主頁獲取cookies r1= requests.get("https://www.lagou.com/jobs/list_python/p-city_0?&cl=false&fromSearch=true&labelWords=&suginput='",headers=headers) #再post請求中傳入cookies r2 = requests.post("https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false",headers=headers,cookies=r2.cookies) print(r2.text)
注意!每請求十次介面cookies也會重新整理一次,下面貼上完整爬蟲程式碼
import json import logging import requests #獲取cookie def getCookie(): res = requests.get("https://www.lagou.com/jobs/list_python/p-city_0?&cl=false&fromSearch=true&labelWords=&suginput=",headers=headers) return res.cookies #獲取json資料 def getPage(i,cookies,kw): data = { 'first': 'true','pn': i,'kd': kw } res = requests.post("https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false",cookies=cookies) return json.loads(res.text) #合併列表 def reduceList(l): text = "" for i in l: text += i + " " return text.strip() #提取欄位並儲存到檔案中 def saveInCsv(f,data): js = data["content"]["positionResult"]["result"] for node in js: # 對空值進行處理 district = node["district"] if district != None: district = "-" + district else: district = "" f.write( node["positionName"] + "·" + node["city"] + district + "·" + node[ "salary"] + "·" + node["workYear"] + "·" + node["education"] + "·" + reduceList(node["skillLables"]) + "·" + node["companyShortName"] + "·" + node["companySize"] + "·" + node["positionAdvantage"] + "\n") if __name__ == '__main__': #定義頭部 headers = { 'referer': 'https://www.lagou.com/jobs/list_python/p-city_0?&cl=false&fromSearch=true&labelWords=&suginput=',like Gecko) Chrome/86.0.4240.198 Safari/537.36' } #初始化cookie cookies = getCookie() with open("file.csv","w",encoding="utf-8") as f: for i in range(1,31): #每十個請求重新獲取cookie if (i % 10 == 0): cookies = getCookie() #解析欄位並存儲 data = getPage(i,"python") saveInCsv(f,data)
到此這篇關於python使用requests庫爬取拉勾網招聘資訊的實現的文章就介紹到這了,更多相關python requests爬取拉勾網內容請搜尋我們以前的文章或繼續瀏覽下面的相關文章希望大家以後多多支援我們!