Scrapy爬取前程無憂（51job）相關職位資訊

阿新 • • 發佈：2018-12-22

Scrapy爬取前程無憂（51job）python職位資訊

開始是想做資料分析的，上網上找教程，看到相關部落格我就跟著做，但是沒資料就只能開始自己爬唄。順便給51job的工作人員提提建議，我爬的時候Scrapy訪問量開到128，relay僅有兩秒，還以為會封ip。沒想到只是改請求頭就能萬事大吉。。。這基本算是沒有反扒機制吧。而且後面資料清洗的時候發現很多虛假的招聘廣告，這個應該官方可以控制下吧。
靈感來源：https://www.jianshu.com/p/309493fe5c7b ，https://blog.csdn.net/lbship/article/details/79452459

簡單分析

進入官網搜尋python，跳轉到：https://search.51job.com/list/000000%252C00,000000,0000,00,9,99,python,2,1.html，後面還有一長串不知道是什麼的資訊，經過測試沒什麼用。自主要的是.html前的數字，你可以改成2，就明白我的意思了。沒有scrapy的可以先看這篇：https://www.imooc.com/learn/1017

在這裡插入圖片描述
網頁結構很清晰，parse()函式裡xpath或者css拿到對應的列表資訊，再解析每一條yield item，最後爬取完一頁之後，判斷是否為最後一頁，跳到下一頁繼續爬取。

class JobspiderSpider(scrapy.Spider):
    name = 'jobSpider'
    allowed_domains = ['51job.com']
    start_urls = ['https://search.51job.com/list/000000,000000,0000,00,9,99,python,2,1.html']

    def parse(self, response):
        jobs = response.xpath('//*[@id="resultList"]/div[@class="el"]')
        item = S1JobobpypositionItem()

        for job in jobs:
            position = job.xpath('./p//a/@title').extract_first()
            company = job.xpath('./span[1]/a/@title').extract_first()
            location = job.xpath('./span[2]/text()').extract_first()
            salary = job.xpath('./span[3]/text()').extract_first()
            time = job.xpath('./span[4]/text()').extract_first()

            item['position'] = position
            item['company'] = company
            item['location'] = location
            item['salary'] = salary
            item['time'] = time

            print (item)
            yield item


        page_total = response.xpath('//*[@id="resultList"]//span[@class="td"]/text()').extract_first()

        next_page_url = response.css('#resultList li.bk a::attr(href)').extract()[-1]

        if next_page_url:
            yield scrapy.Request(url=next_page_url,callback=self.parse)

item裡的資料在piplines中儲存到mysql資料庫（有關mysql資料庫的連結：https://blog.csdn.net/qq_42776455/article/details/82959857 ）：

class saveMysql(object):
    def __init__(self):
        self.client = pymysql.Connect(
                host='localhost',
                port=3306,
                user='root',
                passwd='root',
                db='51job',
                charset='utf8'
            )
        self.cursor = self.client.cursor()
    def process_item(self,item,spider):
        sql = 'insert into python_jobs (position,company,salary,location,time) values ("%s","%s","%s","%s","%s")'
        self.cursor.execute(sql % (item['position'],item['company'],item['salary'],item['location'],item['time']))
        self.client.commit()
    def close_spider(self, spider):
        self.cursor.close()
        self.client.close()

記得在settings裡新增配置。

新增中介軟體構造請求頭：https://blog.csdn.net/qq_42776455/article/details/83150012

結果：

在這裡插入圖片描述

Scrapy爬取前程無憂（51job）相關職位資訊

Scrapy爬取前程無憂（51job）python職位資訊

簡單分析

結果：

Scrapy爬取前程無憂（51job）相關職位資訊

scrapy爬取前程無憂51job網職位資訊並存儲到資料庫

scrapy框架爬取前程無憂

R語言爬取前程無憂網招聘職位

Scrapy-Splash爬取淘寶排行榜（三）

python scrapy爬取全部豆瓣電影（可能是54萬資料）

Scrapy研究探索（六）——自動爬取網頁之II（CrawlSpider）

Python網路資料爬取----網路爬蟲基礎（一）

爬取伯樂線上文章（二）通過xpath提取原始檔中需要的內容

爬取伯樂線上文章（三）爬取所有頁面的文章

Python3+Fiddler爬取手機端APP（三） ————使用Appium在真機模擬爬取

python爬蟲-爬取愛情公寓電影（2018）豆瓣短評並資料分析

用Scrapy抓取豆瓣小組資料（一）

爬取豆瓣電影排行（T250）的資訊

爬蟲]利用xpath爬取豆瓣電影top250（轉）

python爬蟲系列之爬取百度文庫（一）

使用scrapy爬取知乎問題和答案的相關欄位完整程式碼

一個簡單Python爬蟲例項（爬取的是前程無憂網的部分招聘資訊）

前程無憂爬蟲實戰（通過輸入關鍵字爬取任意職位並自動儲存為.csv文字）

Python爬蟲從入門到放棄（十八）之 Scrapy爬取所有知乎用戶信息(上)

Scrapy爬取前程無憂（51job）相關職位資訊

Scrapy爬取前程無憂（51job）python職位資訊

簡單分析

結果：

相關推薦