Python爬取前程無憂職位資訊

阿新 • • 發佈：2021-06-20

一、選題背景

剛畢業往往會為自己不知道每個職位之間各種待遇的差異而迷茫,所以為了瞭解畢業後職位的待遇等方面做多種參考，貨比三家。

1.資料來源

前程無憂（https://www.51job.com/）

2.爬取內容

爬取內容包括職位名稱，公司名稱，地點，薪資，學歷要求，以及釋出日期等。

二、實現爬取的步驟

1.程式碼所需包

1 import urllib.request
2 import xlwt
3 import re
4 import urllib.parse
5 import time

2.進入前程無憂官網，搜尋職位資訊

3.開啟開發者模式

4.模擬瀏覽器

1 header={
 
2     'Host':'search.51job.com',
3     'Upgrade-Insecure-Requests':'1',
4     'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
5 }

5.為了實現爬取，我寫了一個能夠實現輸入想了解的職位就能爬取相關內容的函式

 1 #page是頁數，item是輸入的字串，見後文
 2 def getfront(page,item):
 
 3 #先把字串轉成十六進位制編碼      
 4      result = urllib.parse.quote(item)                    
 5      ur1 = result+',2,'+ str(page)+'.html'
 6      ur2 = 'https://search.51job.com/list/000000,000000,0000,00,9,99,'
 7      res = ur2+ur1                                                            #拼接網址
 8      a = urllib.request.urlopen(res)
 
 9 # 讀取原始碼並轉為unicode
10      html = a.read().decode('gbk')          
11      return html

1 def getInformation(html):
2     #匹配換行符
3     reg = re.compile(r'class="t1 ">.*? <a target="_blank" title="(.*?)" href="(.*?)".*? <span class="t2"><a target="_blank" title="(.*?)" href="(.*?)".*?<span class="t3">(.*?)</span>.*?<span class="t4">(.*?)</span>.*?<span class="t5">(.*?)</span>.*?',re.S)
4     items=re.findall(reg,html)
5     return items

除了爬取基本資訊外，還把職位超連結後的網址，以及公司超連結的網址爬取下來了。

6.把爬取的資訊以Excel檔案形式儲存起來，比較清晰直觀。

 1 #新建表格空間
 2 excel1 = xlwt.Workbook()
 3 # 設定單元格格式
 4 sheet1 = excel1.add_sheet('Job', cell_overwrite_ok=True)
 5 
 6 sheet1.write(0, 0, '序號')
 7 
 8 sheet1.write(0, 1, '職位')
 9 
10 sheet1.write(0, 2, '公司名稱')
11 
12 sheet1.write(0, 3, '公司地點')
13 
14 sheet1.write(0, 4, '公司性質')
15 
16 sheet1.write(0, 5, '薪資')
17 
18 sheet1.write(0, 6, '學歷要求')
19 
20 sheet1.write(0, 7, '工作經驗')
21 
22 sheet1.write(0, 8, '公司規模')
23 
24 sheet1.write(0, 9, '公司型別')
25 
26 sheet1.write(0, 10,'公司福利')
27 
28 sheet1.write(0, 11,'釋出時間')

爬取程式碼如下

 1 number = 1
 2 item = input()
 3 
 4 for j in range(1,1000):
 5     try:
 6         print("正在爬取第"+str(j)+"頁資料...")
 7 #呼叫獲取網頁原碼
 8         html = getfront(j,item)      
 9 
10         for i in getInformation(html):
11             try:
12 #職位網址
13                 url1 = i[1]          
14                 res1 = urllib.request.urlopen(url1).read().decode('gbk')
15                 company = re.findall(re.compile(r'<div class="com_tag">.*?<p class="at" title="(.*?)"><span class="i_flag">.*?<p class="at" title="(.*?)">.*?<p class="at" title="(.*?)">.*?',re.S),res1)
16 
17                 job_need = re.findall(re.compile(r'<p class="msg ltype".*?>.*?&nbsp;&nbsp;<span>|</span>&nbsp;&nbsp;(.*?)&nbsp;&nbsp;<span>|</span>&nbsp;&nbsp;(.*?)&nbsp;&nbsp;<span>|</span>&nbsp;&nbsp;.*?</p>',re.S),res1)
18 
19                 welfare = re.findall(re.compile(r'<span class="sp4">(.*?)</span>',re.S),res1)
20                 print(i[0],i[2],i[4],i[5],company[0][0],job_need[2]
21 [0],job_need[1][0],company[0][1],company[0][2],welfare,i[6])
22                 sheet1.write(number,0,number)
23 
24                 sheet1.write(number,1,i[0])
25 
26                 sheet1.write(number,2,i[2])
27 
28                 sheet1.write(number,3,i[4])
29 
30                 sheet1.write(number,4,company[0][0])
31 
32                 sheet1.write(number,5,i[5])
33 
34                 sheet1.write(number,6,job_need[1][0])
35 
36                 sheet1.write(number,7,job_need[2][0])
37 
38                 sheet1.write(number,8,company[0][1])
39 
40                 sheet1.write(number,9,company[0][2])
41 
42                 sheet1.write(number,10,("  ".join(str(i) for i in welfare)))
43 
44                 sheet1.write(number,11,i[6])
45 
46                 number+=1
47                 excel1.save("51job.xls")
48 #休息間隔，避免爬取海量資料時被誤判為攻擊，IP遭到封禁
49                 time.sleep(0.3) 
50             except:
51                 pass
52     except:
53         pass

結果如下：

三、資料清洗與處理

1.先開啟檔案

1 #coding:utf-8
2 import pandas as pd
3 import re
4 
5 #除此之外還要安裝xlrd包
6 
7 data = pd.read_excel(r'51job.xls',sheet_name='Job')
8 result = pd.DataFrame(data)

清洗思路：

1、出現有空值的資訊，直接刪除整行

1 a = result.dropna(axis=0,how='any')
2 #輸出全部行，不省略
3 pd.set_option('display.max_rows',None)

2.職位出錯（爬取職位與預想職位無關）

 1  1 b = u'資料'
 2  2 number = 1
 3  3 li = a['職位']
 4  4 for i in range(0,len(li)):
 5  5     try:
 6  6         if b in li[i]:
 7  7             #print(number,li[i])
 8  8             number+=1
 9  9         else:
10 10             a = a.drop(i,axis=0)
11 11     except:
12 12         pass

3.其他地方出現的資訊錯位，比如在學歷裡出現 ‘招多少人’

 1 b2= u'人'
 2 li2 = a['學歷要求']
 3 for i in range(0,len(li2)):
 4     try:
 5         if b2 in li2[i]:
 6             #print(number,li2[i])
 7             number+=1
 8             a = a.drop(i,axis=0)
 9     except:
10         pass

4.轉換薪資單位不一致

 1 b3 =u'萬/年'
 2 b4 =u'千/月'
 3 li3 = a['薪資']
 4 
 5 #註釋部分的print都是為了除錯用的
 6 
 7 for i in range(0,len(li3)):
 8     try:
 9         if b3 in li3[i]:
10             x = re.findall(r'\d*\.?\d+',li3[i])
11             #print(x)
12 
13 #轉換成浮點型並保留兩位小數
14             min_ = format(float(x[0])/12,'.2f')              
15             max_ = format(float(x[1])/12,'.2f')
16             li3[i][1] = min_+'-'+max_+u'萬/月'
17         if b4 in li3[i]:
18             x = re.findall(r'\d*\.?\d+',li3[i])
19             #print(x)
20 
21             #input()
22             min_ = format(float(x[0])/10,'.2f')
23             max_ = format(float(x[1])/10,'.2f')
24             li3[i][1] = str(min_+'-'+max_+'萬/月')
25         print(i,li3[i])
26 
27     except:
28         pass

清洗完成後儲存到新的Excel檔案裡。

1 a.to_excel('51job2.xlsx', sheet_name='Job', index=False)

四、資料視覺化

經過視覺化處理能使資料更加直觀，更有利於分析甚至可以說視覺化是資料探勘最重要的內容。

1.檢視需要的包

1 # -*- coding: utf-8 -*-
2 import pandas as pd
3 import re
4 from pyecharts import Funnel,Pie,Geo
5 import matplotlib.pyplot as plt

2.開啟檔案

1 file = pd.read_excel(r'51job2.xls',sheet_name='Job')
2 f = pd.DataFrame(file)
3 pd.set_option('display.max_rows',None)

3.建立多個列表來單獨存放薪資，工作經驗，學歷要求，公司地點等資訊

 1 add = f['公司地點']
 2 sly = f['薪資']
 3 edu = f['學歷要求']
 4 exp = f['工作經驗']
 5 address =[]
 6 salary = []
 7 education = []
 8 experience = []
 9 for i in range(0,len(f)):
10     try:
11         a = add[i].split('-')
12         address.append(a[0])
13         #print(address[i])
14         s = re.findall(r'\d*\.?\d+',sly[i])
15         s1= float(s[0])
16         s2 =float(s[1])
17         salary.append([s1,s2])
18         #print(salary[i])
19         education.append(edu[i])
20         #print(education[i])
21         experience.append(exp[i])
22         #print(experience[i])
23     except:
24        pass

4.工作經驗—薪資圖與學歷—薪資圖

 1 #定義存放最低薪資的列表
 2 min_s=[]
 3 #定義存放最高薪資的列表
 4 max_s=[]
 5 for i in range(0,len(experience)):
 6     min_s.append(salary[i][0])
 7     max_s.append(salary[i][0])
 8 
 9 my_df = pd.DataFrame({'experience':experience, 'min_salay' : min_s, 
10 #關聯工作經驗與薪資
11 'max_salay' : max_s})
12 data1 = my_df.groupby('experience').mean()['min_salay'].plot(kind='line')
13 plt.show()
14 
15 my_df2 = pd.DataFrame({'education':education, 'min_salay' : min_s, 
16 #關聯學歷與薪資
17 'max_salay' : max_s})
18 data2 = my_df2.groupby('education').mean()['min_salay'].plot(kind='line')
19 plt.show()

5.學歷要求圓環圖

 1 def get_edu(list):
 2     education2 = {}
 3     for i in set(list):
 4         education2[i] = list.count(i)
 5     return education2
 6 dir1 = get_edu(education)
 7 
 8 # print(dir1)
 9 
10 attr= dir1.keys()
11 value = dir1.values()
12 pie = Pie("學歷要求")
13 pie.add("", attr, value, center=[50, 50], is_random=False, radius=[30, 75], rosetype='radius',
14         is_legend_show=False, is_label_show=True,legend_orient='vertical')
15 pie.render('學歷要求玫瑰圖.html')

6.大資料城市需求地理位置分佈圖

 1 def get_address(list):
 2     address2 = {}
 3     for i in set(list):
 4         address2[i] = list.count(i)
 5 
 6     address2.pop('異地招聘')
 7 
 8     #address2.pop('山東')
 9     #address2.pop('怒江')
10     #address2.pop('池州')
11 
12     return address2
13 
14 dir2 = get_address(address)
15 
16 #print(dir2)
17 
18 geo = Geo("大資料人才需求分佈圖", title_color="#2E2E2E",
19           title_text_size=24,title_top=20,title_pos="center", width=1300,height=600)
20 
21 attr2 = dir2.keys()
22 value2 = dir2.values()
23 
24 geo.add("",attr2, value2, type="effectScatter", is_random=True, visual_range=[0, 1000], maptype='china',symbol_size=8, effect_scale=5, is_visualmap=True)
25 
26 geo.render('大資料城市需求分佈圖.html')

7.工作經驗要求漏斗圖

 1 def get_experience(list):
 2     experience2 = {}
 3     for i in set(list):
 4 
 5          experience2[i] = list.count(i)
 6 
 7     return experience2
 8 
 9 dir3 = get_experience(experience)
10 
11 #print(dir3)
12 
13 attr3= dir3.keys()
14 value3 = dir3.values()
15 funnel = Funnel("工作經驗漏斗圖",title_pos='center')
16 
17 funnel.add("", attr3, value3,is_label_show=True,label_pos="inside", label_text_color="#fff",legend_orient='vertical',legend_pos='left')
18 
19 funnel.render('工作經驗要求漏斗圖.html')

五、總結

本次主題的爬蟲因基礎薄弱進行的時間較久，但結果還是好的。通過Execll檔案和視覺化分析可以清晰直觀的瞭解到應聘職位的各種要求，

基本達到了想要的結果。但是pyecharts裡面的圖還有很多種，還是要繼續慢慢發掘，加強自己的專業知識。

Python爬取前程無憂職位資訊

一、選題背景剛畢業往往會為自己不知道每個職位之間各種待遇的差異而迷茫,所以為了瞭解畢業後職位的待遇等方面做多種參考，貨比三家。

前程無憂職位資訊爬取

技術標籤：pythonpython 前程無憂職位資訊爬取 # coding=UTF-8 from urllib.request import urlopen, Request

python爬取拉勾網職位資訊-python相關職位

import requestsimport mathimport pandas as pdimport timefromlxml import etreeurl = \'https://www.lagou.com/jobs/positionAjax.json?px=default&needAddtionalResult=false\'headers = {\'Accept\': \"app

Python爬取愛奇藝電影資訊程式碼例項

這篇文章主要介紹了Python爬取愛奇藝電影資訊程式碼例項,文中通過示例程式碼介紹的非常詳細，對大家的學習或者工作具有一定的參考學習價值,需要的朋友可以參考下

用Python爬取LOL所有的英雄資訊以及英雄面板的示例程式碼

實現思路：分為兩部分，第一部分，獲取網頁上資料並使用xlwt生成excel（當然你也可以選擇儲存到資料庫），第二部分獲取網頁資料使用IO流將圖片儲存到本地

python 爬取位元組內推招聘資訊

　　今天收到一個任務，用 python 爬取招聘網站資訊。招聘網址是這個：https://job.toutiao.com/s/JNcJSRo。開啟之後自動跳轉到了這裡：https://job.bytedance.com/referral/pc/position?token=MzsxNTk0NDQzMDMxOTkz