1. 程式人生 > >ruby 爬蟲爬取拉鉤網職位信息,產生詞雲報告

ruby 爬蟲爬取拉鉤網職位信息,產生詞雲報告

content 數據持久化 lag works wid spa 代碼 職位 要求

思路:1.獲取拉勾網搜索到職位的頁數

  技術分享圖片

2.調用接口獲取職位id

  技術分享圖片

3.根據職位id訪問頁面,匹配出關鍵字

  技術分享圖片

url訪問采用unirest,由於拉鉤反爬蟲,短時間內頻繁訪問會被限制訪問,所以沒有采用多線程,而且每個頁面訪問時間間隔設定為10s,通過nokogiri解析頁面,正則匹配只獲取技能要求中的英文單詞,可能存在數據不準確的情況

數據持久化到excel中,采用ruby erb生成word_cloud報告

爬蟲代碼:

require unirest
require uri
require nokogiri
require json
require win32ole
@position = 測試開發工程師 @city = 杭州 # 頁面訪問 def query_url(method, url, headers:{}, parameters:nil) case method when :get Unirest.get(url, headers:headers).body when :post Unirest.post(url, headers:headers, parameters:parameters).body end end # 獲取頁數 def get_page_num(url) html
= query_url(:get, url).force_encoding(utf-8) html.scan(/<span class="span totalNum">(\d+)<\/span>/).first.first end # 獲取每頁顯示的所有職位的id def get_positionsId(url, headers:{}, parameters:nil) response = query_url(:post, url, headers:headers, parameters:parameters) positions_id = Array.new response[
content][positionResult][result].each{|i| positions_id << i[positionId]} positions_id end # 匹配職位英文關鍵字 def get_skills(url) puts "loading url: #{url}" html = query_url(:get, url).force_encoding(utf-8) doc = Nokogiri::HTML(html) data = doc.css(dd.job_bt) data.text.scan(/[a-zA-Z]+/) end # 計算詞頻 def word_count(arr) arr.map!(&:downcase) arr.select!{|i| i.length>1} counter = Hash.new(0) arr.each { |k| counter[k]+=1 } # 過濾num=1的數據 counter.select!{|_,v| v > 1} counter2 = counter.sort_by{|_,v| -v}.to_h counter2 end # 轉換 def parse(hash) data = Array.new hash.each do |k,v| word = Hash.new word[name] = k word[value] = v data << word end JSON data end # 持久化數據 def save_excel(hash) excel = WIN32OLE.new(Excel.Application) excel.visible = false workbook = excel.Workbooks.Add() worksheet = workbook.Worksheets(1) # puts hash.size (1..hash.size+1).each do |i| if i == 1 # puts "A#{i}:B#{i}" worksheet.Range("A#{i}:B#{i}").value = [關鍵詞, 頻次] else # puts i # puts hash.keys[i-2], hash.values[i-2] worksheet.Range("A#{i}:B#{i}").value = [hash.keys[i-2], hash.values[i-2]] end end excel.DisplayAlerts = false workbook.saveas(File.dirname(__FILE__)+\lagouspider.xls) workbook.saved = true excel.ActiveWorkbook.Close(1) excel.Quit() end # 獲取頁數 url = URI.encode("https://www.lagou.com/jobs/list_#@position?city=#@city&cl=false&fromSearch=true&labelWords=&suginput=") num = get_page_num(url).to_i puts "存在 #{num} 個信息分頁" skills = Array.new (1..num).each do |i| puts "定位在第#{i}頁" # 獲取positionsid url2 = URI.encode("https://www.lagou.com/jobs/positionAjax.json?city=#@city&needAddtionalResult=false") headers = {Referer:url, User-Agent:i%2==1?Mozilla/5.0:Chrome/67.0.3396.87} parameters = {first:(i==1), pn:i, kd:@position} positions_id = get_positionsId(url2, headers:headers, parameters:parameters) positions_id.each do |id| # 訪問具體職位頁面,提取英文技能關鍵字 url3 = "https://www.lagou.com/jobs/#{id}.html" skills.concat get_skills(url3) sleep 10 end end count = word_count(skills) save_excel(count) @data = parse(count)

效果展示:

技術分享圖片 技術分享圖片

ruby 爬蟲爬取拉鉤網職位信息,產生詞雲報告