ruby 爬蟲爬取拉鉤網職位信息,產生詞雲報告
阿新 • • 發佈:2018-06-23
content 數據持久化 lag works wid spa 代碼 職位 要求
思路:1.獲取拉勾網搜索到職位的頁數
2.調用接口獲取職位id
3.根據職位id訪問頁面,匹配出關鍵字
url訪問采用unirest,由於拉鉤反爬蟲,短時間內頻繁訪問會被限制訪問,所以沒有采用多線程,而且每個頁面訪問時間間隔設定為10s,通過nokogiri解析頁面,正則匹配只獲取技能要求中的英文單詞,可能存在數據不準確的情況
數據持久化到excel中,采用ruby erb生成word_cloud報告
爬蟲代碼:
require ‘unirest‘ require ‘uri‘ require ‘nokogiri‘ require ‘json‘ require ‘win32ole‘ @position = ‘測試開發工程師‘ @city = ‘杭州‘ # 頁面訪問 def query_url(method, url, headers:{}, parameters:nil) case method when :get Unirest.get(url, headers:headers).body when :post Unirest.post(url, headers:headers, parameters:parameters).body end end # 獲取頁數 def get_page_num(url) html= query_url(:get, url).force_encoding(‘utf-8‘) html.scan(/<span class="span totalNum">(\d+)<\/span>/).first.first end # 獲取每頁顯示的所有職位的id def get_positionsId(url, headers:{}, parameters:nil) response = query_url(:post, url, headers:headers, parameters:parameters) positions_id = Array.new response[‘content‘][‘positionResult‘][‘result‘].each{|i| positions_id << i[‘positionId‘]} positions_id end # 匹配職位英文關鍵字 def get_skills(url) puts "loading url: #{url}" html = query_url(:get, url).force_encoding(‘utf-8‘) doc = Nokogiri::HTML(html) data = doc.css(‘dd.job_bt‘) data.text.scan(/[a-zA-Z]+/) end # 計算詞頻 def word_count(arr) arr.map!(&:downcase) arr.select!{|i| i.length>1} counter = Hash.new(0) arr.each { |k| counter[k]+=1 } # 過濾num=1的數據 counter.select!{|_,v| v > 1} counter2 = counter.sort_by{|_,v| -v}.to_h counter2 end # 轉換 def parse(hash) data = Array.new hash.each do |k,v| word = Hash.new word[‘name‘] = k word[‘value‘] = v data << word end JSON data end # 持久化數據 def save_excel(hash) excel = WIN32OLE.new(‘Excel.Application‘) excel.visible = false workbook = excel.Workbooks.Add() worksheet = workbook.Worksheets(1) # puts hash.size (1..hash.size+1).each do |i| if i == 1 # puts "A#{i}:B#{i}" worksheet.Range("A#{i}:B#{i}").value = [‘關鍵詞‘, ‘頻次‘] else # puts i # puts hash.keys[i-2], hash.values[i-2] worksheet.Range("A#{i}:B#{i}").value = [hash.keys[i-2], hash.values[i-2]] end end excel.DisplayAlerts = false workbook.saveas(File.dirname(__FILE__)+‘\lagouspider.xls‘) workbook.saved = true excel.ActiveWorkbook.Close(1) excel.Quit() end # 獲取頁數 url = URI.encode("https://www.lagou.com/jobs/list_#@position?city=#@city&cl=false&fromSearch=true&labelWords=&suginput=") num = get_page_num(url).to_i puts "存在 #{num} 個信息分頁" skills = Array.new (1..num).each do |i| puts "定位在第#{i}頁" # 獲取positionsid url2 = URI.encode("https://www.lagou.com/jobs/positionAjax.json?city=#@city&needAddtionalResult=false") headers = {Referer:url, ‘User-Agent‘:i%2==1?‘Mozilla/5.0‘:‘Chrome/67.0.3396.87‘} parameters = {first:(i==1), pn:i, kd:@position} positions_id = get_positionsId(url2, headers:headers, parameters:parameters) positions_id.each do |id| # 訪問具體職位頁面,提取英文技能關鍵字 url3 = "https://www.lagou.com/jobs/#{id}.html" skills.concat get_skills(url3) sleep 10 end end count = word_count(skills) save_excel(count) @data = parse(count)
效果展示:
ruby 爬蟲爬取拉鉤網職位信息,產生詞雲報告