拉勾爬蟲實戰(通過輸入關鍵字爬取任意職位並自動儲存為.csv文字)
![178bc26d6a28e9f177010e9150d849f2.jpg](https://upload-images.jianshu.io/upload_images/9136378-068a8b1de5a0204f.jpg?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
#這裡只分析主要的思路和要注意的點。有什麼不懂的可以評論提問,研究給出的程式碼理解。
1.通過分析可以知道,拉鉤的職位資料是ajax載入的。而且拉勾每個職位給使用者看的只有30頁,使用者在不同瀏覽器搜尋看到的資料是不一樣的。但是它的伺服器上面一個職位並不止30頁,它是根據使用者的搜尋按照一定的規則展示30頁出來,像百度一樣。但是我們可以通過請求迴圈,把它伺服器上所有的職位的資料請求下來。
![image.png](https://upload-images.jianshu.io/upload_images/9136378-e5f7f26d796b3a8b.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
2.接下來分析post請求
![image.png](https://upload-images.jianshu.io/upload_images/9136378-5bbacf3a4902ed34.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
3.接下來就是構造請求,獲取資料,編寫解析規則的事情了
![image.png](https://upload-images.jianshu.io/upload_images/9136378-8955835c987e1f55.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
4.可以通過構造一個請求,如輸入page==1000來測出當最大頁 數時候返回是什麼資料,然後觀察那些欄位可以確定最大頁數的,然後寫個if判斷,來確定這個職位資料已爬取完畢,然後退出程式,通過請求發現,當page==0時為職位的最大頁數,具體看下面給出的程式碼
5.請閱讀程式碼,不懂的地方可以發表評論提問
import requests import csv from retrying import retry from json import JSONDecodeError import json class lagouspider(): def __init__(self): self.header = {'Host': 'www.lagou.com', 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36', 'Accept': 'application/json, text/javascript, */*; q=0.01', 'Accept-Language': 'zh-CN,en-US;q=0.7,en;q=0.3', 'Accept-Encoding': 'gzip, deflate, br', 'Referer': 'https://www.lagou.com/jobs/list_Python?labelWords=&fromSearch=true&suginput=', 'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8', 'X-Requested-With': 'XMLHttpRequest', 'X-Anit-Forge-Token': 'None', 'X-Anit-Forge-Code': '0', 'Content-Length': '26', 'Cookie': '_ga=GA1.2.1015646365.1538989463; user_trace_token=20181008170425-277ca381-cad9-11e8-bb68-5254005c3644; LGUID=20181008170425-277caaa8-cad9-11e8-bb68-5254005c3644; index_location_city=%E5%85%A8%E5%9B%BD; JSESSIONID=ABAAABAAAGGABCB4878F933065B5D43F6FB0748D8A0E39E; LGSID=20181018091439-2f6bfe7d-d273-11e8-bda4-5254005c3644; PRE_UTM=; PRE_HOST=www.google.com; PRE_SITE=https%3A%2F%2Fwww.google.com%2F; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2F; _gid=GA1.2.465338117.1539825277; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1538989463,1539825277,1539825286,1539825338; TG-TRACK-CODE=index_search; _gat=1; LGRID=20181018094311-2bef568a-d277-11e8-8058-525400f775ce; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1539826990; SEARCH_ID=44a9a700439e406a80372da370820d72', 'Connection': 'keep-alive', 'Pragma': 'no-cache', 'Cache-Control': 'no-cache'} self.start_url="https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false" def get_post(self,url,data): response=requests.post(url,headers=self.header,data=data) return response.text def get_data(self,kd,pn): data={ "first": "false", "pn": pn, "kd": kd } return data def json2dict(self,infos): try: items=json.loads(infos) return items except JSONDecodeError as e: print(e) @retry(stop_max_attempt_number=3) def get_info(self,items,item): try: for info in items["content"]["positionResult"]["result"]: item['職位'] = info['positionName'] item['職位釋出時間'] = info['createTime'] # 把字典裡面的東西提取出來,再加上一個名字 item['工作經驗'] = info['workYear'] item['學歷要求'] = info['education'] item['工資'] = info['salary'] item['職位誘惑'] = info['companyLabelList'] if len(info['companyLabelList']) > 0 else '空欄位' item['職位標籤'] = info['positionLables'] if len(info['positionLables']) > 0 else '空欄位' item['公司名稱'] = info['companyFullName'] yield item except : pass def run(self,key): num=1 item = {} while 1: try: data = self.get_data(kd=key, pn=num) infos = self.get_post(self.start_url, data=data) items=self.json2dict(infos) page=items["content"]["pageNo"] print("*" * 100) print("正在爬取第%d頁"%page) print("*"*100) try: for i in self.get_info(items,item): with open(key+'拉鉤網職位資訊.csv', 'a', newline='') as f: writer = csv.DictWriter(f, headers) writer.writerow(i) except PermissionError as e : print(e) except UnicodeEncodeError as f: print(f) num+=1 if page==0: print("爬取結束") break except TypeError as m: print(m) if __name__ == '__main__': headers = ['職位', '職位釋出時間', '工作經驗', '學歷要求', '工資', '職位誘惑', '職位標籤', '公司名稱'] key=input("請輸入需要要爬取職位的關鍵字:") with open(key+'拉鉤網職位資訊.csv', 'w', newline='') as f: writer = csv.DictWriter(f, headers) writer.writeheader() spider=lagouspider() spider.run(key)