拉勾爬蟲實戰（通過輸入關鍵字爬取任意職位並自動儲存為.csv文字）

阿新 • • 發佈：2019-01-09

![178bc26d6a28e9f177010e9150d849f2.jpg](https://upload-images.jianshu.io/upload_images/9136378-068a8b1de5a0204f.jpg?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

#這裡只分析主要的思路和要注意的點。有什麼不懂的可以評論提問，研究給出的程式碼理解。
1.通過分析可以知道，拉鉤的職位資料是ajax載入的。而且拉勾每個職位給使用者看的只有30頁，使用者在不同瀏覽器搜尋看到的資料是不一樣的。但是它的伺服器上面一個職位並不止30頁，它是根據使用者的搜尋按照一定的規則展示30頁出來，像百度一樣。但是我們可以通過請求迴圈，把它伺服器上所有的職位的資料請求下來。
![image.png](https://upload-images.jianshu.io/upload_images/9136378-e5f7f26d796b3a8b.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
2.接下來分析post請求
![image.png](https://upload-images.jianshu.io/upload_images/9136378-5bbacf3a4902ed34.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

3.接下來就是構造請求，獲取資料，編寫解析規則的事情了
![image.png](https://upload-images.jianshu.io/upload_images/9136378-8955835c987e1f55.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

4.可以通過構造一個請求，如輸入page==1000來測出當最大頁數時候返回是什麼資料，然後觀察那些欄位可以確定最大頁數的，然後寫個if判斷，來確定這個職位資料已爬取完畢，然後退出程式，通過請求發現，當page==0時為職位的最大頁數，具體看下面給出的程式碼

5.請閱讀程式碼，不懂的地方可以發表評論提問

import requests
import csv
from retrying import retry
from json import JSONDecodeError
import json


class lagouspider():
    def __init__(self):
        self.header = {'Host': 'www.lagou.com',
                          'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36',
                          'Accept': 'application/json, text/javascript, */*; q=0.01',
                          'Accept-Language': 'zh-CN,en-US;q=0.7,en;q=0.3',
                          'Accept-Encoding': 'gzip, deflate, br',
                          'Referer': 'https://www.lagou.com/jobs/list_Python?labelWords=&fromSearch=true&suginput=',
                          'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
                          'X-Requested-With': 'XMLHttpRequest',
                          'X-Anit-Forge-Token': 'None',
                          'X-Anit-Forge-Code': '0',
                          'Content-Length': '26',
                          'Cookie': '_ga=GA1.2.1015646365.1538989463; user_trace_token=20181008170425-277ca381-cad9-11e8-bb68-5254005c3644; LGUID=20181008170425-277caaa8-cad9-11e8-bb68-5254005c3644; index_location_city=%E5%85%A8%E5%9B%BD; JSESSIONID=ABAAABAAAGGABCB4878F933065B5D43F6FB0748D8A0E39E; LGSID=20181018091439-2f6bfe7d-d273-11e8-bda4-5254005c3644; PRE_UTM=; PRE_HOST=www.google.com; PRE_SITE=https%3A%2F%2Fwww.google.com%2F; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2F; _gid=GA1.2.465338117.1539825277; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1538989463,1539825277,1539825286,1539825338; TG-TRACK-CODE=index_search; _gat=1; LGRID=20181018094311-2bef568a-d277-11e8-8058-525400f775ce; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1539826990; SEARCH_ID=44a9a700439e406a80372da370820d72',
                          'Connection': 'keep-alive',
                          'Pragma': 'no-cache',
                          'Cache-Control': 'no-cache'}
        self.start_url="https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false"

    def get_post(self,url,data):
        response=requests.post(url,headers=self.header,data=data)
        return response.text
    def get_data(self,kd,pn):
        data={
            "first": "false",
            "pn": pn,
            "kd": kd
        }
        return data
    def json2dict(self,infos):
        try:
            items=json.loads(infos)
            return items
        except JSONDecodeError as e:
            print(e)

    @retry(stop_max_attempt_number=3)
    def get_info(self,items,item):
        try:
            for info in items["content"]["positionResult"]["result"]:
                item['職位'] = info['positionName']
                item['職位釋出時間'] = info['createTime']
                # 把字典裡面的東西提取出來，再加上一個名字
                item['工作經驗'] = info['workYear']
                item['學歷要求'] = info['education']
                item['工資'] = info['salary']
                item['職位誘惑'] = info['companyLabelList'] if len(info['companyLabelList']) > 0 else '空欄位'
                item['職位標籤'] = info['positionLables'] if len(info['positionLables']) > 0 else '空欄位'
                item['公司名稱'] = info['companyFullName']
                yield item

        except :
            pass


    def run(self,key):
        num=1
        item = {}
        while 1:
            try:
                data = self.get_data(kd=key, pn=num)
                infos = self.get_post(self.start_url, data=data)
                items=self.json2dict(infos)
                page=items["content"]["pageNo"]
                print("*" * 100)
                print("正在爬取第%d頁"%page)
                print("*"*100)
                try:
                    for i in self.get_info(items,item):
                        with open(key+'拉鉤網職位資訊.csv', 'a', newline='') as f:
                            writer = csv.DictWriter(f, headers)
                            writer.writerow(i)
                except PermissionError as e :
                    print(e)
                except UnicodeEncodeError as f:
                    print(f)
                num+=1
                if page==0:
                    print("爬取結束")
                    break
            except TypeError as m:
                print(m)

if __name__ == '__main__':
    headers = ['職位', '職位釋出時間', '工作經驗', '學歷要求', '工資', '職位誘惑', '職位標籤', '公司名稱']
    key=input("請輸入需要要爬取職位的關鍵字:")
    with open(key+'拉鉤網職位資訊.csv', 'w', newline='') as f:
        writer = csv.DictWriter(f, headers)
        writer.writeheader()
    spider=lagouspider()
    spider.run(key)

拉勾爬蟲實戰（通過輸入關鍵字爬取任意職位並自動儲存為.csv文字）

拉勾爬蟲實戰（通過輸入關鍵字爬取任意職位並自動儲存為.csv文字）

前程無憂爬蟲實戰（通過輸入關鍵字爬取任意職位並自動儲存為.csv文字）

python3[爬蟲實戰] 使用selenium，xpath爬取京東手機（上）

TP5中（通過一個表去取另一個表的相對應的名稱）

Python爬蟲實戰專案之小說資訊爬取

Python 爬蟲入門-爬取拉勾網實戰

Python 爬蟲實戰（二）：使用 requests-html

python Scrapy網路爬蟲實戰（存Json檔案以及存到mysql資料庫）

Python爬蟲入門實戰系列（一）--爬取網路小說並存放至txt檔案

python爬蟲實戰（一）

Python3爬蟲實戰（requests模組）

Python3爬蟲實戰（urllib模組）

Python3.X 爬蟲實戰（動態頁面爬取解析）

python3爬蟲實戰（三）：mitmproxy對接python下載抖音小視訊

python爬蟲實戰（四）：selenium爬蟲抓取阿里巴巴採購批發商品

爬蟲實戰 -- （爬取證券期貨市場失信記錄平臺）

7.1 python拉勾網實戰並儲存到mongodb

python爬蟲實戰（關於工作中遇到的問題）

python2.7爬蟲實戰（房地產資訊抓取）

Python3.X 爬蟲實戰（併發爬取）

拉勾爬蟲實戰（通過輸入關鍵字爬取任意職位並自動儲存為.csv文字）

相關推薦