Python爬取招聘網站資料並做資料視覺化處理

阿新 • • 發佈：2020-12-15

本文的文字及圖片來源於網路,僅供學習、交流使用,不具有任何商業用途,如有問題請及時聯絡我們以作處理。

以下文章來源於青燈程式設計，作者：清風

前言

本文的文字及圖片來源於網路,僅供學習、交流使用,不具有任何商業用途,如有問題請及時聯絡我們以作處理。

基本開發環境

Python 3.6
Pycharm

相關模組使用

爬蟲模組

import requests
import re
import parsel
import csv

詞雲模組

import jieba
import wordcloud

目標網頁分析

通過開發者工具可以看到，獲取返回資料之後，資料是

在window.__SEARCH_RESULT__裡面，可以使用正則匹配資料。

如下圖所示

'https://jobs.51job.com/beijing-ftq/127676506.html?s=01&t=0'

每一個招聘資訊的詳情頁都是有對應的ID，只需要正則匹配提取ID值，通過拼接URL，然後再去招聘詳情頁提取招聘資料即可。

response = requests.get(url=url, headers=headers)
lis = re.findall('"jobid":"(\d+)"', response.text)
for li in lis:
    page_url = 'https://jobs.51job.com/beijing-hdq/{}.html?s=01&t=0 
'.format(li)

雖然網站是靜態網頁，但是網頁編碼是亂碼，在爬取的過程中需要轉碼。

f = open('招聘.csv', mode='a', encoding='utf-8', newline='')
csv_writer = csv.DictWriter(f, fieldnames=['標題', '地區', '工作經驗', '學歷', '薪資', '福利', '招聘人數', '釋出日期'])
csv_writer.writeheader()
response = requests.get(url=page_url, headers=headers)
response.encoding  
= response.apparent_encoding
selector = parsel.Selector(response.text)
title = selector.css('.cn h1::text').get()      # 標題
salary = selector.css('div.cn strong::text').get()       # 薪資
welfare = selector.css('.jtag div.t1 span::text').getall()       # 福利
welfare_info = '|'.join(welfare)
data_info = selector.css('.cn p.msg.ltype::attr(title)').get().split('  |  ')
area = data_info[0]         # 地區
work_experience = data_info[1]      # 工作經驗
educational_background = data_info[2]       # 學歷
number_of_people = data_info[3]     # 招聘人數
release_date = data_info[-1].replace('釋出', '')     # 釋出日期
all_info_list = selector.css('div.tCompany_main > div:nth-child(1) > div p span::text').getall()
all_info = '\n'.join(all_info_list)
dit = {
    '標題': title,
    '地區': area,
    '工作經驗': work_experience,
    '學歷': educational_background,
    '薪資': salary,
    '福利': welfare_info,
    '招聘人數': number_of_people,
    '釋出日期': release_date,
}
csv_writer.writerow(dit)
with open('招聘資訊.txt', mode='a', encoding='utf-8') as f:
    f.write(all_info)

以上步驟即可完成關於招聘的相關資料爬取。

簡單粗略的資料清洗

薪資待遇

content = pd.read_csv(r'D:\python\demo\資料分析\招聘\招聘.csv', encoding='utf-8')
salary = content['薪資']
salary_1 = salary[salary.notnull()]
salary_count = pd.value_counts(salary_1)

學歷要求

content = pd.read_csv(r'D:\python\demo\資料分析\招聘\招聘.csv', encoding='utf-8')
educational_background = content['學歷']
educational_background_1 = educational_background[educational_background.notnull()]
educational_background_count = pd.value_counts(educational_background_1).head()
print(educational_background_count)
bar = Bar()
bar.add_xaxis(educational_background_count.index.tolist())
bar.add_yaxis("學歷", educational_background_count.values.tolist())
bar.render('bar.html')

顯示招聘人數為無要求

工作經驗

content = pd.read_csv(r'D:\python\demo\資料分析\招聘\招聘.csv', encoding='utf-8')
work_experience = content['工作經驗']
work_experience_count = pd.value_counts(work_experience)
print(work_experience_count)
bar = Bar()
bar.add_xaxis(work_experience_count.index.tolist())
bar.add_yaxis("經驗要求", work_experience_count.values.tolist())
bar.render('bar.html')

詞雲分析，技術點要求

py = imageio.imread("python.png")
f = open('python招聘資訊.txt', encoding='utf-8')

re_txt = f.read()
result = re.findall(r'[a-zA-Z]+', re_txt)
txt = ' '.join(result)

# jiabe 分詞 分割詞彙
txt_list = jieba.lcut(txt)
string = ' '.join(txt_list)
# 詞雲圖設定
wc = wordcloud.WordCloud(
        width=1000,         # 圖片的寬
        height=700,         # 圖片的高
        background_color='white',   # 圖片背景顏色
        font_path='msyh.ttc',    # 詞雲字型
        mask=py,     # 所使用的詞雲圖片
        scale=15,
        stopwords={' '},
        # contour_width=5,
        # contour_color='red'  # 輪廓顏色
)
# 給詞雲輸入文字
wc.generate(string)
# 詞雲圖儲存圖片地址
wc.to_file(r'python招聘資訊.png')

總結：

資料分析是真的粗糙，屬實辣眼睛~

Python爬取招聘網站資料並做資料視覺化處理

本文的文字及圖片來源於網路,僅供學習、交流使用,不具有任何商業用途,如有問題請及時聯絡我們以作處理。

Python爬取YY評級分數並儲存資料實現過程解析

前言當需要進行大規模查詢時（比如目前遇到的情形：查詢某個省所有發債企業的YY評級分數），人工查詢顯然太過費時，那就寫個爬蟲吧。

Python爬取股票資訊，並可視化資料的示例

前言截止2019年年底我國股票投資者數量為15975.24萬戶,如此多的股民熱衷於炒股,首先拋開炒股技術不說,那麼多股票資料是不是非常難找,找到之後是不是看著密密麻麻的資料是不是頭都大了?

python爬蟲實現爬取同一個網站的多頁資料的例項講解

對於一個網站的圖片、文字音視訊等，如果我們一個個的下載，不僅浪費時間，而且很容易出錯。Python爬蟲幫助我們獲取需要的資料，這個資料是可以快速批量的獲取。本文小編帶領大家通過python爬蟲獲取獲取總頁數並更改

python協程爬取某網站的老賴資料

import re import json import aiohttp import asyncio import time import pymysql from asyncio.locks import Semaphore

使用python爬取財經網站資料介面

網頁network發現介面返回的是json資料，怎樣通過python，通過分頁方式爬取下載到excel裡或者資料庫裡面

Python爬取招聘網資訊

1、資料來源：職友集 2、程式碼 import requests import openpyxl import time from bs4 import BeautifulSoup #用於解析和提取網頁資料的

python爬取12306網站獲取火車票資訊

利用requests傳送請求，prettytable表格輸出，需要安裝requests，prettytable python -m pip install requests

Python爬取食品商務網蔬菜價格資料，看看蔬菜最近的價格情況

前言蔬菜是指可以做菜、烹飪成為食品的一類植物或菌類，蔬菜是人們日常飲食中必不可少的食物之一。近期，蔬菜價格有所上漲，引起廣泛關注。“雙節”將至，物價走勢如何？9月16日，國家發改委召開9月份例行

Python爬取素材網站3000多條音訊素材檔案

前言本文的文字及圖片來源於網路,僅供學習、交流使用,不具有任何商業用途,如有問題請及時聯絡我們以作處理。

基於Python爬取素材網站音訊檔案

基本環境配置 python 3.6 pycharm requests parsel 相關模組pip安裝即可目標網頁請求網頁 import requests

python爬取天天基金網所有基金資料

多執行緒+代理池爬取天天基金網、股票資料(無需使用爬蟲框架) 簡介提到爬蟲，大部分人都會想到使用Scrapy工具，但是僅僅停留在會使用的階段。為了增加對爬蟲機制的理解，我們可以手動實現多執行緒的爬蟲過

python爬取招聘要求等資訊例項

在我們人生的路途中，找工作是每個人都會經歷的階段，小編曾經也是苦苦求職大軍中的一員。懷著對以後的規劃和想象，我們在找工作的時候，會看一些招聘資訊，然後從中挑選合適的崗位。不過招聘的崗位每個公司都有不少

python 爬取百度文庫並下載(免費文章限定)

import requests import re import json import os session = requests.session() def fetch_url(url): return session.get(url).content.decode(\'gbk\')

python 爬取英雄聯盟面板並下載的示例

爬取結果：爬取程式碼 import os import json import requests from tqdm import tqdm def lol_spider():

Python爬取網頁上想要的資料

原始碼如下 from urllib.request import urlopen,Request import urllib.request import re from bs4 import BeautifulSoup

python爬取豆瓣top250資訊並存入資料庫中 | sqlite3

注：本文承接上文：初學python爬蟲，爬取“豆瓣電影 Top 250”相關資訊，並下載電影封面

Python爬蟲入門教程：爬取boss直聘招聘資料並做視覺化展示

本文的文字及圖片來源於網路,僅供學習、交流使用,不具有任何商業用途,如有問題請及時聯絡我們以作處理。

爬取boss直聘全國招聘資料並做視覺化 - Python

正文：今天來爬一下boss直聘上關於python在全國的招聘崗位。開發環境： python 3.7.9

Python爬取資料並寫入MySQL資料庫的例項

首先我們來爬取 http://html-color-codes.info/color-names/ 的一些資料。按 F12 或 ctrl+u 審查元素，結果如下:

Python爬取招聘網站資料並做資料視覺化處理

前言

相關推薦