Python--智聯招聘網站的資料分析
一、選題的背景
為什麼要選擇此選題?要達到的資料分析的預期目標是什麼?(10 分)
眾多的招聘崗位中,大資料崗位分佈在全國各個城市,崗位與企業之間又有著錯綜複雜的聯絡,企業型別多樣,不同的企業有著各自不同的文化,對應聘者也有著不同約束。應聘者不同經驗獲得的薪資也不一樣,找到符合自己的職位,需要考慮招聘者釋出的基本要求,如:經驗,學歷等各方面的需求。應聘者也會考查企業性質和型別。以下我們對釋出求職公司進行分析。
二、主題式網路爬蟲設計方案(10 分)
1.主題式網路爬蟲名稱
Python--智聯招聘網站的資料分析
2.主題式網路爬蟲爬取的內容與資料特徵分析
爬取智聯網路介面資訊,選取
從資料庫中匯入資料
資料清理,包括缺失值處理等資料預處理
資料分析與資料視覺化3.1 平均工資3.2 工資與工作經驗的關係3.3 工資與學歷的關係3.4 職位描述文字分析
3.主題式網路爬蟲設計方案概述(包括實現思路與技術難點)
本文預爬取的欄位包括:(1)職位資訊 (2)工資 (3)所在城市 (4)工作經驗 (5)學歷要求 (6)招聘人數 (7)職位亮點 (8) 職位描述 (9)公司地址 (10)公司名稱 (11)公司行業所屬 (12) 公司規模 (13)公司簡要描述
三、主題頁面的結構特徵分析(10 分)
1.主題頁面的結構與特徵分析
首頁和他的佈局結構
2.Htmls 頁面解析
3.節點(標籤)查詢方法與遍歷方法
(必要時畫出節點樹結構)
四、網路爬蟲程式設計(60 分)
爬蟲程式主體要包括以下各部分,要附原始碼及較詳細註釋,並在每部分程式後
面提供輸出結果的截圖。
- 資料爬取與採集
1 #!/usr/bin/python3 2 # -*- coding: utf-8 -*- 3 """ 4 Created on Fri Aug 14 17:47:47 2020: 2021/3/30 上午1:13 5 @Author : liudong 6 @Software: PyCharm 7 """ 8 9 import requests 10 import re 11 from copyheaders importheaders_raw_to_dict 12 from bs4 import BeautifulSoup 13 import pandas as pd 14 15 16 # 根據url和引數獲取網頁的HTML: 17 18 def get_html(url, params): 19 20 my_headers = b''' 21 accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9 22 accept-language: zh-CN,zh;q=0.9 23 cache-control: max-age=0 24 cookie: x-zp-client-id=448f2b96-6b3a-48e3-e912-e6c8dd73e6cb; adfbid=0; adfbid2=0; Hm_lvt_38ba284938d5eddca645bb5e02a02006=1617108464; sajssdk_2015_cross_new_user=1; sts_deviceid=178832cf3f2680-0b20242883a4a9-6618207c-1296000-178832cf3f3780; sts_sg=1; sts_chnlsid=Unknown; zp_src_url=https%3A%2F%2Fwww.google.com.hk%2F; FSSBBIl1UgzbN7N443S=kc8_mcJe5xsW.UilCMHXpkoWeyQ8te3q7QhYV8Y8aA0Se9k9JJXcnQVvrOJ9NYDP; locationInfo_search={%22code%22:%22538%22%2C%22name%22:%22%E4%B8%8A%E6%B5%B7%22%2C%22message%22:%22%E5%8C%B9%E9%85%8D%E5%88%B0%E5%B8%82%E7%BA%A7%E7%BC%96%E7%A0%81%22}; zp_passport_deepknow_sessionId=a2ea7206sade7641768f38078ea6b45afef0; at=02a0ea392e1d4fd6a4d6003ac136aae0; rt=82f98e13344843d6b5bf3dadf38e8bb2; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%221071739258%22%2C%22first_id%22%3A%22178832cf3bd20f-0be4af1633ae3d-6618207c-1296000-178832cf3be4b8%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%2C%22%24latest_referrer%22%3A%22%22%7D%2C%22%24device_id%22%3A%22178832cf3bd20f-0be4af1633ae3d-6618207c-1296000-178832cf3be4b8%22%7D; urlfrom=121126445; urlfrom2=121126445; adfcid=none; adfcid2=none; ZL_REPORT_GLOBAL={%22//www%22:{%22seid%22:%2202a0ea392e1d4fd6a4d6003ac136aae0%22%2C%22actionid%22:%2243ffc74e-c32e-42ee-ba04-1e24611fecde-cityPage%22}}; LastCity=%E4%B8%8A%E6%B5%B7; LastCity%5Fid=538; Hm_lpvt_38ba284938d5eddca645bb5e02a02006=1617111259; zpfe_probe_token=ae612f12s0feb44ac697a7434fe1f22af086; d4d6cd0b4a19fa72b8cc377185129bb7=ab637759-b57a-4214-a915-8dcbc5630065; selectCity_search=538; FSSBBIl1UgzbN7N443T=5pRoIYmxrZTzxVozDFEYjcClKKRpXbK9zf0gYH4zU5AyLqGUMT5fnVzyE0SMv7ZDGFLY0HV8o6iXLPBGBBTJhDhz3TIaQ3omm324Q2m4BSJzD0VgZzesPGIXudf636xQZkuag1QJmdqzgFLv6YPcKq.ukZPymp1IazfsOec5vBcMT9yemSrYb9UBk2XF.rZIeM3mIOBqpNii26kDRzjxHP5TsGLJzWaaZvklHnh61NT4acHPQt3Lq1.w2X4htg9ck.uGhzHt9w954igFEqhLCmggLi9OjPUaiU8TA4yn1oR1T5Qmjm1I5AA0PIu76e0T2u6w2f7thMkv6E7lkoDggrRMta0Z_uVEP3Y1sS8hJw7ycE2PTVtVassRyoN6UuTBHtSZ 25 sec-ch-ua: "Google Chrome";v="89", "Chromium";v="89", ";Not A Brand";v="99" 26 sec-ch-ua-mobile: ?0 27 sec-fetch-dest: document 28 sec-fetch-mode: navigate 29 sec-fetch-site: same-origin 30 sec-fetch-user: ?1 31 upgrade-insecure-requests: 1 32 user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 11_2_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36 33 ''' 34 my_headers = headers_raw_to_dict(my_headers) # 把複製的瀏覽器需求頭轉化為字典形式 35 req = requests.get(url, headers=my_headers, params=params) 36 req.encoding = req.apparent_encoding 37 html = req.text 38 39 return html 40 41 42 # 輸入url和城市編號,獲取由所有職位資訊的html標籤的字串組成的列表: 43 44 def get_html_list(url, city_num): 45 46 html_list = list() 47 48 for i in range(1, 12): 49 params = {'jl': str(city_num), 'kw': '資料分析師', 'p': str(i)} 50 html = get_html(url, params) 51 soup = BeautifulSoup(html, 'html.parser') 52 html_list += soup.find_all(name='a', attrs={'class': 'joblist-box__iteminfo iteminfo'}) 53 54 for i in range(len(html_list)): 55 html_list[i] = str(html_list[i]) 56 57 return html_list 58 59 60 # 根據上面的HTML標籤列表,把每個職位資訊的有效資料提取出來,儲存csv檔案: 61 62 def get_csv(html_list): 63 64 # city = position = company_name = company_size = company_type = salary = education = ability = experience = evaluation = list() # 65 # 上面賦值方法在這裡是錯誤的,它會讓每個變數指向同一記憶體地址,如果改變其中一個變數,其他變數會同時發生改變 66 67 # table = pd.DataFrame(columns = ['城市','職位名稱','公司名稱','公司規模','公司型別','薪資','學歷要求','技能要求','工作經驗要求']) 68 city, position, company_name, company_size, company_type, salary, education, ability, experience = ([] for i in range(9)) # 多變數一次賦值 69 70 for i in html_list: 71 72 if re.search( 73 '<li class="iteminfo__line2__jobdesc__demand__item">(.*?)</li> <li class="iteminfo__line2__jobdesc__demand__item">(.*?)</li> <li class="iteminfo__line2__jobdesc__demand__item">(.*?)</li>', 74 i): 75 s = re.search( 76 '<li class="iteminfo__line2__jobdesc__demand__item">(.*?)</li> <li class="iteminfo__line2__jobdesc__demand__item">(.*?)</li> <li class="iteminfo__line2__jobdesc__demand__item">(.*?)</li>', 77 i).group(1) 78 city.append(s) 79 s = re.search( 80 '<li class="iteminfo__line2__jobdesc__demand__item">(.*?)</li> <li class="iteminfo__line2__jobdesc__demand__item">(.*?)</li> <li class="iteminfo__line2__jobdesc__demand__item">(.*?)</li>', 81 i).group(2) 82 experience.append(s) 83 s = re.search( 84 '<li class="iteminfo__line2__jobdesc__demand__item">(.*?)</li> <li class="iteminfo__line2__jobdesc__demand__item">(.*?)</li> <li class="iteminfo__line2__jobdesc__demand__item">(.*?)</li>', 85 i).group(3) 86 education.append(s) 87 else: 88 city.append(' ') 89 experience.append(' ') 90 education.append(' ') 91 92 93 if re.search('<span class="iteminfo__line1__jobname__name" title="(.*?)">', i): 94 s = re.search('<span class="iteminfo__line1__jobname__name" title="(.*?)">', i).group(1) 95 position.append(s) 96 else: 97 position.append(' ') 98 99 if re.search('<span class="iteminfo__line1__compname__name" title="(.*?)">', i): 100 s = re.search('<span class="iteminfo__line1__compname__name" title="(.*?)">', i).group(1) 101 company_name.append(s) 102 else: 103 company_name.append(' ') 104 105 if re.search( 106 '<span class="iteminfo__line2__compdesc__item">(.*?) </span> <span class="iteminfo__line2__compdesc__item">(.*?) </span>', 107 i): 108 s = re.search( 109 '<span class="iteminfo__line2__compdesc__item">(.*?) </span> <span class="iteminfo__line2__compdesc__item">(.*?) </span>', 110 i).group(1) 111 company_type.append(s) 112 s = re.search( 113 '<span class="iteminfo__line2__compdesc__item">(.*?) </span> <span class="iteminfo__line2__compdesc__item">(.*?) </span>', 114 i).group(2) 115 company_size.append(s) 116 else: 117 company_type.append(' ') 118 company_size.append(' ') 119 120 if re.search('<p class="iteminfo__line2__jobdesc__salary">([\s\S]*?)<', i): 121 s = re.search('<p class="iteminfo__line2__jobdesc__salary">([\s\S]*?)<', i).group(1) 122 s = s.strip() 123 salary.append(s) 124 else: 125 salary.append(' ') 126 127 s = str() 128 l = re.findall('<div class="iteminfo__line3__welfare__item">.*?</div>', i) 129 for i in l: 130 s = s + re.search('<div class="iteminfo__line3__welfare__item">(.*?)</div>', i).group(1) + ' ' 131 ability.append(s) 132 133 table = list(zip(city, position, company_name, company_size, company_type, salary, education, ability, experience)) 134 135 return table 136 137 138 139 if __name__ == '__main__': 140 141 url = 'https://sou.zhaopin.com/' 142 citys = {'上海':538, '北京':530, '廣州':763, '深圳':765, '天津':531, '武漢':736, '西安':854, '成都':801, '南京':635, '杭州':653, '重慶':551, '廈門':682} 143 for i in citys.keys(): 144 html_list = get_html_list(url, citys[i]) 145 table = get_csv(html_list) 146 df = pd.DataFrame(table, columns=['city', 'position', 'company_name', 'company_size', 'company_type', 'salary', 147 'education', 'ability', 'experience']) 148 file_name = i + '.csv' 149 df.to_csv(file_name)
- 對資料進行清洗和處理
1 #!/usr/bin/python3 2 # -*- coding: utf-8 -*- 3 """ 4 Created on Fri Aug 14 17:47:47 2020: 2021/4/2 上午1:30 5 @Author : liudong 6 @Software: PyCharm 7 """ 8 9 10 import matplotlib.pyplot as plt 11 import numpy as np 12 import pandas as pd 13 plt.rcParams['font.sans-serif'] = ['Heiti TC'] # 指定預設字型:解決plot不能顯示中文問題 14 plt.rcParams['axes.unicode_minus'] = False # 解決儲存影象是負號'-'顯示為方塊的問題 15 import re 16 import os 17 import seaborn as sns 18 from wordcloud import WordCloud 19 20 21 citys = ['上海', '北京', '廣州', '深圳', '天津', '武漢', '西安', '成都', '南京', '杭州', '重慶', '廈門'] 22 23 24 #資料清洗: 25 26 def data_clear(): 27 28 for i in citys: 29 30 file_name = './' + i + '.csv' 31 df = pd.read_csv(file_name, index_col = 0) 32 33 for i in range(0, df.shape[0]): 34 35 s = df.loc[[i],['salary']].values.tolist()[0][0] 36 37 if re.search('(.*)-(.*)',s): 38 a = re.search('(.*)-(.*)', s).group(1) 39 if a[-1] == '千': 40 a = eval(a[0:-1]) * 1000 41 elif a[-1] == '萬': 42 a = eval(a[0:-1]) * 10000 43 b = re.search('(.*)-(.*)', s).group(2) 44 if b[-1] == '千': 45 b = eval(b[0:-1]) * 1000 46 elif b[-1] == '萬': 47 b = eval(b[0:-1]) * 10000 48 s = (a + b) / 2 49 df.loc[[i], ['salary']] = s 50 else: 51 df.loc[[i], ['salary']] = '' 52 53 os.remove(file_name) 54 df.to_csv(file_name)
1.資料分析與視覺化(例如:資料柱形圖、直方圖、散點圖、盒圖、分佈圖)
1 #各個城市資料分析職位數量條形圖: 2 3 def citys_jobs(): 4 5 job_num = list() 6 for i in citys: 7 file_name = './' + i + '.csv' 8 df = pd.read_csv(file_name, index_col = 0) 9 job_num.append(df.shape[0]) 10 df = pd.DataFrame(list(zip(citys, job_num))) 11 df = df.sort_values(1, ascending = False) 12 x = list(df[0]) 13 y = list(df[1]) 14 15 fig = plt.figure(dpi=200) 16 ax = fig.add_axes([0.1, 0.1, 0.8, 0.8]) 17 ax.bar(x,y,alpha = 0.8) 18 ax.set_title('資料分析職位在全國主要城市的數量分佈') 19 ax.set_ylim(0,350) 20 21 plt.savefig('./資料分析職位在全國主要城市的數量分佈.jpg') 22 plt.show()
5.根據資料之間的關係,分析兩個變數之間的相關係數,畫出散點圖,並建立變
量之間的迴歸方程(一元或多元)。
1.將以上各部分的程式碼彙總,附上完整程式程式碼
1 #!/usr/bin/python3 2 # -*- coding: utf-8 -*- 3 """ 4 Created on Fri Aug 14 17:47:47 2020: 2021/3/30 上午1:13 5 @Author : liudong 6 @Software: PyCharm 7 """ 8 9 import requests 10 import re 11 from copyheaders import headers_raw_to_dict 12 from bs4 import BeautifulSoup 13 import pandas as pd 14 15 16 # 根據url和引數獲取網頁的HTML: 17 18 def get_html(url, params): 19 20 my_headers = b''' 21 accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9 22 accept-language: zh-CN,zh;q=0.9 23 cache-control: max-age=0 24 cookie: x-zp-client-id=448f2b96-6b3a-48e3-e912-e6c8dd73e6cb; adfbid=0; adfbid2=0; Hm_lvt_38ba284938d5eddca645bb5e02a02006=1617108464; sajssdk_2015_cross_new_user=1; sts_deviceid=178832cf3f2680-0b20242883a4a9-6618207c-1296000-178832cf3f3780; sts_sg=1; sts_chnlsid=Unknown; zp_src_url=https%3A%2F%2Fwww.google.com.hk%2F; FSSBBIl1UgzbN7N443S=kc8_mcJe5xsW.UilCMHXpkoWeyQ8te3q7QhYV8Y8aA0Se9k9JJXcnQVvrOJ9NYDP; locationInfo_search={%22code%22:%22538%22%2C%22name%22:%22%E4%B8%8A%E6%B5%B7%22%2C%22message%22:%22%E5%8C%B9%E9%85%8D%E5%88%B0%E5%B8%82%E7%BA%A7%E7%BC%96%E7%A0%81%22}; zp_passport_deepknow_sessionId=a2ea7206sade7641768f38078ea6b45afef0; at=02a0ea392e1d4fd6a4d6003ac136aae0; rt=82f98e13344843d6b5bf3dadf38e8bb2; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%221071739258%22%2C%22first_id%22%3A%22178832cf3bd20f-0be4af1633ae3d-6618207c-1296000-178832cf3be4b8%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%2C%22%24latest_referrer%22%3A%22%22%7D%2C%22%24device_id%22%3A%22178832cf3bd20f-0be4af1633ae3d-6618207c-1296000-178832cf3be4b8%22%7D; urlfrom=121126445; urlfrom2=121126445; adfcid=none; adfcid2=none; ZL_REPORT_GLOBAL={%22//www%22:{%22seid%22:%2202a0ea392e1d4fd6a4d6003ac136aae0%22%2C%22actionid%22:%2243ffc74e-c32e-42ee-ba04-1e24611fecde-cityPage%22}}; LastCity=%E4%B8%8A%E6%B5%B7; LastCity%5Fid=538; Hm_lpvt_38ba284938d5eddca645bb5e02a02006=1617111259; zpfe_probe_token=ae612f12s0feb44ac697a7434fe1f22af086; d4d6cd0b4a19fa72b8cc377185129bb7=ab637759-b57a-4214-a915-8dcbc5630065; selectCity_search=538; FSSBBIl1UgzbN7N443T=5pRoIYmxrZTzxVozDFEYjcClKKRpXbK9zf0gYH4zU5AyLqGUMT5fnVzyE0SMv7ZDGFLY0HV8o6iXLPBGBBTJhDhz3TIaQ3omm324Q2m4BSJzD0VgZzesPGIXudf636xQZkuag1QJmdqzgFLv6YPcKq.ukZPymp1IazfsOec5vBcMT9yemSrYb9UBk2XF.rZIeM3mIOBqpNii26kDRzjxHP5TsGLJzWaaZvklHnh61NT4acHPQt3Lq1.w2X4htg9ck.uGhzHt9w954igFEqhLCmggLi9OjPUaiU8TA4yn1oR1T5Qmjm1I5AA0PIu76e0T2u6w2f7thMkv6E7lkoDggrRMta0Z_uVEP3Y1sS8hJw7ycE2PTVtVassRyoN6UuTBHtSZ 25 sec-ch-ua: "Google Chrome";v="89", "Chromium";v="89", ";Not A Brand";v="99" 26 sec-ch-ua-mobile: ?0 27 sec-fetch-dest: document 28 sec-fetch-mode: navigate 29 sec-fetch-site: same-origin 30 sec-fetch-user: ?1 31 upgrade-insecure-requests: 1 32 user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 11_2_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36 33 ''' 34 my_headers = headers_raw_to_dict(my_headers) # 把複製的瀏覽器需求頭轉化為字典形式 35 req = requests.get(url, headers=my_headers, params=params) 36 req.encoding = req.apparent_encoding 37 html = req.text 38 39 return html 40 41 42 # 輸入url和城市編號,獲取由所有職位資訊的html標籤的字串組成的列表: 43 44 def get_html_list(url, city_num): 45 46 html_list = list() 47 48 for i in range(1, 12): 49 params = {'jl': str(city_num), 'kw': '資料分析師', 'p': str(i)} 50 html = get_html(url, params) 51 soup = BeautifulSoup(html, 'html.parser') 52 html_list += soup.find_all(name='a', attrs={'class': 'joblist-box__iteminfo iteminfo'}) 53 54 for i in range(len(html_list)): 55 html_list[i] = str(html_list[i]) 56 57 return html_list 58 59 60 # 根據上面的HTML標籤列表,把每個職位資訊的有效資料提取出來,儲存csv檔案: 61 62 def get_csv(html_list): 63 64 # city = position = company_name = company_size = company_type = salary = education = ability = experience = evaluation = list() # 65 # 上面賦值方法在這裡是錯誤的,它會讓每個變數指向同一記憶體地址,如果改變其中一個變數,其他變數會同時發生改變 66 67 # table = pd.DataFrame(columns = ['城市','職位名稱','公司名稱','公司規模','公司型別','薪資','學歷要求','技能要求','工作經驗要求']) 68 city, position, company_name, company_size, company_type, salary, education, ability, experience = ([] for i in range(9)) # 多變數一次賦值 69 70 for i in html_list: 71 72 if re.search( 73 '<li class="iteminfo__line2__jobdesc__demand__item">(.*?)</li> <li class="iteminfo__line2__jobdesc__demand__item">(.*?)</li> <li class="iteminfo__line2__jobdesc__demand__item">(.*?)</li>', 74 i): 75 s = re.search( 76 '<li class="iteminfo__line2__jobdesc__demand__item">(.*?)</li> <li class="iteminfo__line2__jobdesc__demand__item">(.*?)</li> <li class="iteminfo__line2__jobdesc__demand__item">(.*?)</li>', 77 i).group(1) 78 city.append(s) 79 s = re.search( 80 '<li class="iteminfo__line2__jobdesc__demand__item">(.*?)</li> <li class="iteminfo__line2__jobdesc__demand__item">(.*?)</li> <li class="iteminfo__line2__jobdesc__demand__item">(.*?)</li>', 81 i).group(2) 82 experience.append(s) 83 s = re.search( 84 '<li class="iteminfo__line2__jobdesc__demand__item">(.*?)</li> <li class="iteminfo__line2__jobdesc__demand__item">(.*?)</li> <li class="iteminfo__line2__jobdesc__demand__item">(.*?)</li>', 85 i).group(3) 86 education.append(s) 87 else: 88 city.append(' ') 89 experience.append(' ') 90 education.append(' ') 91 92 93 if re.search('<span class="iteminfo__line1__jobname__name" title="(.*?)">', i): 94 s = re.search('<span class="iteminfo__line1__jobname__name" title="(.*?)">', i).group(1) 95 position.append(s) 96 else: 97 position.append(' ') 98 99 if re.search('<span class="iteminfo__line1__compname__name" title="(.*?)">', i): 100 s = re.search('<span class="iteminfo__line1__compname__name" title="(.*?)">', i).group(1) 101 company_name.append(s) 102 else: 103 company_name.append(' ') 104 105 if re.search( 106 '<span class="iteminfo__line2__compdesc__item">(.*?) </span> <span class="iteminfo__line2__compdesc__item">(.*?) </span>', 107 i): 108 s = re.search( 109 '<span class="iteminfo__line2__compdesc__item">(.*?) </span> <span class="iteminfo__line2__compdesc__item">(.*?) </span>', 110 i).group(1) 111 company_type.append(s) 112 s = re.search( 113 '<span class="iteminfo__line2__compdesc__item">(.*?) </span> <span class="iteminfo__line2__compdesc__item">(.*?) </span>', 114 i).group(2) 115 company_size.append(s) 116 else: 117 company_type.append(' ') 118 company_size.append(' ') 119 120 if re.search('<p class="iteminfo__line2__jobdesc__salary">([\s\S]*?)<', i): 121 s = re.search('<p class="iteminfo__line2__jobdesc__salary">([\s\S]*?)<', i).group(1) 122 s = s.strip() 123 salary.append(s) 124 else: 125 salary.append(' ') 126 127 s = str() 128 l = re.findall('<div class="iteminfo__line3__welfare__item">.*?</div>', i) 129 for i in l: 130 s = s + re.search('<div class="iteminfo__line3__welfare__item">(.*?)</div>', i).group(1) + ' ' 131 ability.append(s) 132 133 table = list(zip(city, position, company_name, company_size, company_type, salary, education, ability, experience)) 134 135 return table 136 137 138 139 if __name__ == '__main__': 140 141 url = 'https://sou.zhaopin.com/' 142 citys = {'上海':538, '北京':530, '廣州':763, '深圳':765, '天津':531, '武漢':736, '西安':854, '成都':801, '南京':635, '杭州':653, '重慶':551, '廈門':682} 143 for i in citys.keys(): 144 html_list = get_html_list(url, citys[i]) 145 table = get_csv(html_list) 146 df = pd.DataFrame(table, columns=['city', 'position', 'company_name', 'company_size', 'company_type', 'salary', 147 'education', 'ability', 'experience']) 148 file_name = i + '.csv' 149 df.to_csv(file_name)
1 #!/usr/bin/python3 2 # -*- coding: utf-8 -*- 3 """ 4 Created on Fri Aug 14 17:47:47 2020: 2021/4/2 上午1:30 5 @Author : liudong 6 @Software: PyCharm 7 """ 8 9 10 import matplotlib.pyplot as plt 11 import numpy as np 12 import pandas as pd 13 plt.rcParams['font.sans-serif'] = ['Heiti TC'] # 指定預設字型:解決plot不能顯示中文問題 14 plt.rcParams['axes.unicode_minus'] = False # 解決儲存影象是負號'-'顯示為方塊的問題 15 import re 16 import os 17 import seaborn as sns 18 from wordcloud import WordCloud 19 20 21 citys = ['上海', '北京', '廣州', '深圳', '天津', '武漢', '西安', '成都', '南京', '杭州', '重慶', '廈門'] 22 23 24 #資料清洗: 25 26 def data_clear(): 27 28 for i in citys: 29 30 file_name = './' + i + '.csv' 31 df = pd.read_csv(file_name, index_col = 0) 32 33 for i in range(0, df.shape[0]): 34 35 s = df.loc[[i],['salary']].values.tolist()[0][0] 36 37 if re.search('(.*)-(.*)',s): 38 a = re.search('(.*)-(.*)', s).group(1) 39 if a[-1] == '千': 40 a = eval(a[0:-1]) * 1000 41 elif a[-1] == '萬': 42 a = eval(a[0:-1]) * 10000 43 b = re.search('(.*)-(.*)', s).group(2) 44 if b[-1] == '千': 45 b = eval(b[0:-1]) * 1000 46 elif b[-1] == '萬': 47 b = eval(b[0:-1]) * 10000 48 s = (a + b) / 2 49 df.loc[[i], ['salary']] = s 50 else: 51 df.loc[[i], ['salary']] = '' 52 53 os.remove(file_name) 54 df.to_csv(file_name) 55 56 57 58 #各個城市資料分析職位數量條形圖: 59 60 def citys_jobs(): 61 62 job_num = list() 63 for i in citys: 64 file_name = './' + i + '.csv' 65 df = pd.read_csv(file_name, index_col = 0) 66 job_num.append(df.shape[0]) 67 df = pd.DataFrame(list(zip(citys, job_num))) 68 df = df.sort_values(1, ascending = False) 69 x = list(df[0]) 70 y = list(df[1]) 71 72 fig = plt.figure(dpi=200) 73 ax = fig.add_axes([0.1, 0.1, 0.8, 0.8]) 74 ax.bar(x,y,alpha = 0.8) 75 ax.set_title('資料分析職位在全國主要城市的數量分佈') 76 ax.set_ylim(0,350) 77 78 plt.savefig('./資料分析職位在全國主要城市的數量分佈.jpg') 79 plt.show() 80 81 82 #不同城市薪資分佈條形圖: 83 84 def citys_salary(): 85 86 y = list() 87 x = citys 88 89 for i in citys: 90 file_name = './' + i + '.csv' 91 df = pd.read_csv(file_name, index_col=0) 92 y0 = df['salary'].mean() 93 y.append(round(y0/1000, 1)) 94 95 df = pd.DataFrame(list(zip(x,y))) 96 df = df.sort_values(1, ascending = False) 97 x = list(df[0]) 98 y = list(df[1]) 99 100 fig = plt.figure(dpi=200) 101 ax = fig.add_axes([0.1, 0.1, 0.8, 0.8]) 102 ax.bar(x, y, alpha = 0.8) 103 ax.set_title('資料分析職位在一些主要城市的薪資分佈(單位:千)') 104 ax.set_ylim(5, 18) 105 for a, b, label in zip(x, y, y): # 內建函式zip():將幾個列表合併為二維列表並轉置,返回一個特殊物件,可通過list()列表化之後檢視 106 plt.text(a, b, label, horizontalalignment = 'center', fontsize = 10) # plt.text()函式:在圖中(a,b)位置新增一個文字標籤label 107 108 plt.savefig('./資料分析職位在一些主要城市的薪資分佈.jpg') 109 plt.show() 110 111 112 #資料分析崗位總體薪資的分佈 113 114 def salary_distribute(): 115 116 salary_list = list() 117 for i in citys: 118 file_name = './' + i + '.csv' 119 df = pd.read_csv(file_name, index_col = 0) 120 salary_list += list(df['salary']) 121 salarys = list() 122 for i in range(len(salary_list)): 123 if not pd.isnull(salary_list[i]): #由於該列表是從pandas中讀出的資料,故不能用if salary_list[i] == np.nan,會識別不出來 124 salarys.append(round(salary_list[i]/1000, 1)) 125 mean = np.mean(salarys) 126 127 plt.figure(dpi=200) 128 sns.distplot(salarys, hist = True, kde = True, kde_kws={"color":"r", "lw":1.5, 'linestyle':'-'}) 129 plt.axvline(mean, color='r', linestyle=":") 130 plt.text(mean, 0.01, '平均薪資: %.1f千'%(mean), color='r', horizontalalignment = 'center', fontsize = 15) 131 plt.xlim(0,50) 132 plt.xlabel('薪資分佈(單位:千)') 133 plt.title('資料分析職位整體薪資分佈') 134 plt.savefig('./資料分析職位整體薪資分佈.jpg') 135 plt.show() 136 137 138 #資料分析職位對學歷要求的分佈 139 140 def education_distribute(): 141 142 table = pd.DataFrame() 143 for i in citys: 144 file_name = './' + i + '.csv' 145 df = pd.read_csv(file_name, index_col=0) 146 table = pd.concat([table, df]) 147 table = pd.DataFrame(pd.value_counts(table['education'])) 148 table = table.sort_values(['education'], ascending = False) 149 x = list(table.index) 150 y = list(table['education']) 151 print(x) 152 153 fig = plt.figure(dpi=200) 154 ax = fig.add_axes([0.1,0.1,0.8,0.8]) 155 explode = (0, 0, 0, 0.2, 0.4, 0.6, 0.8) 156 ax.axis('equal') 157 ax.pie(y,labels = x,autopct='%.1f%%',explode=explode) #autopct顯示每塊餅的百分比屬性且自定義格式化字串,其中%%表示字串%,類似正則 158 ax.set_title('資料分析職位對學歷要求的佔比') 159 ax.legend(x, loc = 1) 160 plt.savefig('./資料分析職位對學歷要求的佔比.jpg') 161 plt.show() 162 163 164 #技能關鍵詞頻統計 165 166 def wordfrequence(): 167 168 table = pd.DataFrame() 169 for i in citys: 170 file_name = './' + i + '.csv' 171 df = pd.read_csv(file_name, index_col=0) 172 table = pd.concat([table, df]) 173 l1 = list(table['ability']) 174 l2 = list() 175 for i in range(len(l1)): 176 if not pd.isnull(l1[i]): 177 l2.append(l1[i]) 178 words = ''.join(l2) 179 180 cloud = WordCloud( 181 font_path='/System/Library/Fonts/STHeiti Light.ttc', # 設定字型檔案獲取路徑,預設字型不支援中文 182 background_color='white', # 設定背景顏色 預設是black 183 max_words=20, # 詞雲顯示的最大詞語數量 184 random_state = 1, # 設定隨機生成狀態,即多少種配色方案 185 collocations = False, # 是否包括詞語之間的搭配,預設True,可能會產生語意重複的詞語 186 width=1200, height=900 # 設定大小,預設圖片比較小,模糊 187 ).generate(words) 188 plt.figure(dpi=200) 189 plt.imshow(cloud) # 該方法用來在figure物件上繪製傳入影象資料引數的影象 190 plt.axis('off') # 設定詞雲圖中無座標軸 191 plt.savefig("./技能關鍵詞頻統計.jpg") 192 plt.show() 193 194 195 if __name__ == "__main__": 196 197 data_clear() 198 citys_jobs() 199 citys_salary() 200 salary_distribute() 201 wordfrequence()
五、總結(10 分)
1.經過對主題資料的分析與視覺化,可以得到哪些結論?是否達到預期的目標?
結論:
結合上述資料顯示,大資料分析師崗位資歷越長,薪資越高。
綜合上述資料可得,大資料分析師崗位入職基本資訊。
普遍學歷門檻為大專和本科,工作經驗1-5年發展空間較大,5-10年為瓶頸期。
工作城市北京,廣東,深圳,武漢,合肥,天津,重慶,鄭州,瀋陽,西安,成都,廈門,上海,南京,濟南等一二線大型城市。
公司型別民營,股份制企業較多。
公司規模為小型,中大型網際網路公司。
發展傾向:
專科,本科工作經驗有3-5年,平均薪資12k以上
專科,本科工作經驗有1-3年,平均薪資7k- 10k上下
專科,本科工作經驗1年以下平均薪資在6K以下。
工作城市的選擇也很重要,結合報表5 城市崗位數量及平均薪資報表,可得,杭州,廣東,深圳屬於高薪多崗位城市,北京,廈門,上海,濟南是最理想的就業城市之一。