pyspider爬蟲框架之拉勾網招聘資訊爬取

阿新 • • 發佈：2019-01-27

需求

遍歷所有職位目錄
點選職位分類，進入之後按照地區抓取，職位名稱，釋出時間，薪酬，工作年限要求，學歷要求，招聘公司，所屬行業，所處輪次
進入職位詳情頁，抓取HR聊天意願（用時），簡歷處理，活躍時段。

程式碼

程式碼有詳細的註解，就不一步一步講解了，再有個問題是，如果沒有代理，慎用，IP容易被禁

#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# Created on 2018-08-17 14:49:15
# Project: lagou

from pyspider.libs.base_handler import *
import re
import 
 datetime
from pymongo import MongoClient

DB_NAME = 'research'
DB_COL = 'lagou_recruit'
db = client[DB_NAME]
col = db[DB_COL]


class Handler(BaseHandler):

    crawl_config = {
        "headers": {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36" 

                  },
        "proxy": "http://localhost:6666"  ## 搭建的代理服務
    }

    url = 'https://www.lagou.com/'

    def format_date(self, date):
        return datetime.datetime.strptime(date, '%Y-%m-%d')

    def get_today(self):
        return datetime.datetime.strptime(datetime.datetime.now().strftime('%Y-%m-%d' 
), '%Y-%m-%d')

    @every(minutes=24 * 60)
    def on_start(self):
        self.crawl(self.url, callback=self.index_page)



    @config(age=60)
    def index_page(self, response):
        page = response.etree

        ## 職位分類列表
        cat_list = page.xpath("//div[@class='mainNavs']/div[@class='menu_box']/div[@class='menu_sub dn']/dl")
        ## 遍歷每個類別
        for each in cat_list:
            ## 大類
            b_title = each.xpath("./dt/span/text()")[0]
            print('-----------',b_title,'------------')

            ##子類列表
            sub_list = each.xpath("./dd/a")
            for sub in sub_list:
                sub_title = sub.xpath("./text()")[0]
                link_url = sub.xpath("./@href")[0]
                print(sub_title,' ',link_url)
                save = {"belonging": sub_title}
                self.crawl(link_url, callback=self.parse_categories, save=save)


    @config(age=60)
    def parse_categories(self, response):
        page = response.etree

        base_url = "https://www.lagou.com/jobs/list_"

        ## 每個城市列表
        city_list = page.xpath("//div[@class='details']/div[@class='has-more']/div[@class='more more-positions workPosition']/li/a")[1:-1] ## 去掉第一個全國和最後一個。。。。

        ## 遍歷每個城市
        for each in city_list:
            city = each.xpath("./text()")[0]
            print(city)
            link_url = base_url + response.save["belonging"]
            params = {"px": "default",
                        "city": city
                     }

            save = {"belonging": response.save["belonging"], "city": city}

            self.crawl(link_url, callback=self.parse_city, params=params, save=save)


    @config(age=60)
    def parse_city(self, response):
        page = response.etree

        ## 地區列表
        district_list = page.xpath("//div[@class='contents' and @data-type='district']/a")[1:] ## 不要 不限
        print(response.url)
        ## 遍歷地區
        for num,each in enumerate(district_list):
            district = each.xpath("./text()")[0]
            print(district)
            params = {
                        "district": district
                     }
            link_url = response.url + "#%s" % num

            save = {"belonging": response.save["belonging"],
                    "city": response.save["city"],
                    "district": district
                   }

            self.crawl(link_url, callback=self.parse_district, params=params, save=save)


    @config(age=60)
    def parse_district(self, response):
        page = response.etree

        headers = {"Host": "www.lagou.com",          #### 不加這些請求頭引數，資料請求不到
                    "Origin": "https://www.lagou.com",
                   "Referer": response.url,
                    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36"
                  }

        base_url = 'https://www.lagou.com/jobs/positionAjax.json'
        ##翻頁
        try:
            page_num = int(page.xpath("//div[@class='page-number']/span[@class='span totalNum']/text()")[0])
            print(page_num)
        except:
            return

        for each in range(1,page_num+1):
            data = {"first": "false",
                    "pn": each,
                    "kd": response.save["belonging"]
                   }

            params = {"px": "default",
                        "city": response.save["city"],
                        "district": response.save["district"],
                        "needAddtionalResult": "false"
                     }

            link_url = base_url + "#%s" % each

            self.crawl(link_url, callback=self.parse_page, params=params, method='POST', data=data, save=response.save, headers=headers)

    @config(age=60)        
    def parse_page(self, response):
        page = response.json

        base_url = 'https://www.lagou.com/jobs/{}.html'
        #內容列表
        contents = page["content"]["positionResult"]["result"]
        print(contents)

        ## 遍歷
        for each in contents:
            ## 職位名稱
            position_name = each["positionName"]
            print(position_name)
            ## 釋出時間
            public_time = each["formatCreateTime"]
            print(public_time)
            if re.findall('\d+:\d+',public_time):
                public_time = datetime.datetime.now().strftime('%Y-%m-%d')
                print(public_time)
            if re.findall(u'(\d+)天前',public_time):
                delta = int(re.findall(u'(\d+)天前',public_time)[0])
                public_time = (datetime.datetime.now()+datetime.timedelta(days=-delta)).strftime('%Y-%m-%d')
                print(public_time)

            if re.findall(u'昨天',public_time):
                public_time = (datetime.datetime.now()+datetime.timedelta(days=-1)).strftime('%Y-%m-%d')
                print(public_time)

            ## 薪酬
            salary = each["salary"]
            print(salary)
            ## 工作年限要求
            experience = each["workYear"]
            print(experience)
            ## 學歷要求
            education = each["education"]
            print(education)
            ## 公司
            company = each["companyFullName"]
            print(company)
            ## 公司所屬行業
            company_belong = each["industryField"]
            print(company_belong)
            ## 所屬輪次
            rounds = each["financeStage"]
            print(rounds)
            ## 福利
            welfare = '-'.join(each["companyLabelList"])
            print(welfare)
            print('----------------------------------------')

            save = {"belonging": response.save["belonging"],
                    "city": response.save["city"],
                    "district": response.save["district"],
                     "position_name": position_name,
                     "public_time": public_time,
                     "salary": salary,
                     "experience": experience,
                     "education": education,
                     "company": company,
                     "company_belong": company_belong,
                     "rounds": rounds,
                     "welfare": welfare
                   }

            position_id = each["positionId"]
            link_url = base_url.format(position_id)

            self.crawl(link_url, callback=self.parse_detail, save=save)


    def parse_detail(self, response):
        page = response.etree

        try:
            ## hr的資訊
            hr_info = page.xpath("//dd[@class='jd_publisher']/div/div[@class='publisher_data']")[0]

            chat_will = hr_info.xpath("./div[1]/span[@class='data']/text()")[0]
            print(chat_will)    
            resume_processing = hr_info.xpath("./div[2]/span[@class='data']/text()")[0]
            print(resume_processing)
            active_time = hr_info.xpath("./div[3]/span[@class='data']/text()")[0]
            print(active_time)

        except:
            chat_will = ''
            resume_processing = ''
            active_time = ''


        result = {"belonging": response.save["belonging"],
                "city": response.save["city"],
                "district": response.save["district"],
                "position_name": response.save["position_name"],
                "public_time": self.format_date(response.save["public_time"]),
                "salary": response.save["salary"],
                "experience": response.save["experience"],
                "education": response.save["education"],
                "company": response.save["company"],
                "company_belong": response.save["company_belong"],
                "rounds": response.save["rounds"],
                "welfare": response.save["welfare"],
                  "chat_will": chat_will,
                  "resume_processing": resume_processing,
                  "active_time": active_time,
                  "update_time": datetime.datetime.now(),
                  "date": self.get_today()
               }

        yield result




    def on_result(self, result):
        if result is None:
            return

        update_key = {
                'position_name': result['position_name'],
                'public_time': result['public_time'],
                'city': result['city'],
                'district': result['district'],
                'company': result['company'],
                'belonging': result['belonging']
            }

        col.update(update_key, {'$set': result}, upsert=True)

pyspider爬蟲框架之拉勾網招聘資訊爬取

需求遍歷所有職位目錄點選職位分類，進入之後按照地區抓取，職位名稱，釋出時間，薪酬，工作年限要求，學歷要求，招聘公司，所屬行業，所處輪次進入職位詳情頁，抓取HR聊天意願（用時），簡歷處理，活躍時段。程式碼程式碼有詳細的註解，就不一步一步講解了，

pyspider爬蟲框架之boss直聘招聘資訊爬取

需求需求： 1、遍歷首頁所有職位分類 2、點選進入職位分類詳情頁，按照地區抓取，職位名稱，月薪，經驗年限要求，學歷要求，招聘公司，所屬行業，輪次，人數（規模），釋出時間 3、點選進入職位詳情頁，抓取該職位的技能標籤。程式碼程式碼有註釋

爬蟲之刃----趕集網招聘類爬取案例詳解(系列四)

前言本篇承襲之前的系列文章，開始動真格。以趕集網招聘類資訊爬取為例，詳細解說爬蟲程式構建過程。準備工作：閱讀之前的系列一、系列二、系列三，有一定遞進關係登陸趕集網，瞭解下“地形” OK，let’s go！構建URL庫

用python爬取拉勾網招聘資訊並以CSV檔案儲存

爬取拉勾網招聘資訊 1、在網頁原始碼中搜索資訊，並沒有搜到，判斷網頁資訊使用Ajax來實現的 2、檢視網頁中所需的資料資訊，返回的是JSON資料； 3、條件為北京+資料分析師的公司一共40087家，而實際拉勾網展示的資料只有 15條/頁 * 30頁 = 450條，所以需要判斷

HttpClient爬取拉勾網招聘資訊

1.匯入jar包 <dependency> <groupId>org.apache.httpcomponents</groupId> <artifactId>htt

Python爬取拉勾網招聘資訊存入資料庫

先抓包分析我們想要獲取的資料，很明顯都是動態資料，所以直接到Network下的XHR裡去找，這裡我們找到具體資料後，就要去尋分析求地址與請求資訊了。還有需要提交的表單資訊分析完畢之後，我們就可以開始寫我們的爬蟲專案了。一.編寫Itemitem編寫比較簡單# 拉鉤職位資訊 cl

Python爬取拉勾網招聘資訊

此程式碼執行建議Python3，省卻中文編碼的麻煩遇到的幾個問題：（1）拉鉤網的資料是通過js的ajax動態生成，所以不能直接爬取，而是通過post’http://www.lagou.com/jobs/positionAjax.json?needAddt

Python爬取拉勾網招聘資訊並可視化分析

需求: 1:獲取指定崗位的招聘資訊 2:對公司地區,公司待遇,學歷情況,工作經驗進行簡單分析並可視化展示視覺化分析: 公司地區:柱狀圖,地圖公司待遇:雲圖公司-學歷情況:餅圖公司工作經

Python scrapy 爬取拉勾網招聘資訊

週末折騰了好久，終於成功把拉鉤網的招聘資訊爬取下來了。現在總結一下！環境： windows 8.1 + python 3.5.0 首先使用 scrapy 建立一個專案： E:\mypy> scrapy startproject lgjob 建立後目錄結構：

爬取分析拉勾網招聘資訊

[TOC] 愛寫bug（ID：icodebugs）作者：愛寫bug 前言：看了很多網站，只發現獲取拉勾網招聘資訊是隻用

拉勾網 selenium 模擬爬取

#encoding: utf-8 from selenium import webdriver from selenium.webdriver.support.ui import Select,WebDriverWait from selenium.webdriver.common.by import

爬蟲之拉勾網職位獲取

重點在於演示urllib.request.Request()請求中各項引數的書寫格式譬如： url data headers... Demo 1 import urllib.request 2 import urllib.parse 3 import json,

爬蟲學習之17：爬取拉勾網網招聘資訊（非同步載入+Cookie模擬登陸）

很多網站需要通過提交表單來進行登陸或相應的操作，可以用requests庫的POST方法，通過觀測表單原始碼和逆向工程來填寫表單獲取網頁資訊。本程式碼以獲取拉勾網Python相關招聘職位為例作為練習。開啟拉鉤網，F12進入瀏覽器開發者工具，可以發現網站使用了A

爬蟲練習二(拉勾網投遞簡歷)

ken 信息 split lagou 成功 gpo ram 取不到值 params 一、流程分析第一步：訪問登陸頁,拿到X_Anti_Forge_Token，X_Anti_Forge_Code 1、請求url:https://passport.lagou.com/l

scrapy抓取拉勾網職位資訊（一）——scrapy初識及lagou爬蟲專案建立

本次以scrapy抓取拉勾網職位資訊作為scrapy學習的一個實戰演練 python版本：3.7.1 框架：scrapy（pip直接安裝可能會報錯，如果是vc++環境不滿足，建議直接安裝一個visual studio一勞永逸，如果報錯缺少前置依賴，就先安裝依賴）本篇

【圖文詳解】scrapy爬蟲與動態頁面——爬取拉勾網職位資訊（1）

5-14更新注意：目前拉勾網換了json結構，之前是content - result 現在改成了content- positionResult - result,所以大家寫程式碼的時候要特別注意加上

拉勾網招聘資料分析

背景大資料概念越來越火爆，很多學生開始學習大資料，社會人士也蠢蠢欲動準備轉行，對這個新興職業充滿期待。在感性背後，我們來理性看待下目前的資料探勘崗位需求，如果您確定要進入資料探勘行業，您要學習哪些知識。本文主要探索的話題為: 資料探勘崗位市場需求量資料

【圖文詳解】scrapy爬蟲與動態頁面——爬取拉勾網職位資訊（2）

上次挖了一個坑，今天終於填上了，還記得之前我們做的拉勾爬蟲嗎？那時我們實現了一頁的爬取，今天讓我們再接再厲，實現多頁爬取，順便實現職位和公司的關鍵詞搜尋功能。之前的內容就不再介紹了，不熟悉的請一定要去看之前的文章，程式碼是在之前的基礎上修改的

Python 爬蟲-模擬登入知乎-爬取拉勾網職位資訊

用Python寫爬蟲是很方便的,最近看了xlzd.me的文章，他的文章寫的很到位，提供了很好的思路。因為他的文章部分程式碼省略了。下面是基於他的文章的三個程式碼片段: 基於Python3,Python2的話需要修改下input輸入函式和print的用法。爬取豆瓣電影top250 爬取拉勾網職位資訊模擬

100%教會爬取全國保險業務員微訊號碼和拉勾網招聘資料

下方兩幅圖是最近爬取某保險網站和拉勾的招聘資料，大家隨意感受一下截圖部分保險業務員微信二維碼截圖部分拉勾網爬取資訊本文主要講爬取某保險網站所有賣保險的人的微訊號，個人網站，所在地區、所屬保險公司等資訊，程式碼已上傳到QQ群（627714866）開發環境：烏班圖、pycharm

pyspider爬蟲框架之拉勾網招聘資訊爬取

需求

程式碼

相關推薦