1. 程式人生 > 實用技巧 >Scrapy 專案:騰訊招聘

Scrapy 專案:騰訊招聘

目的:

通過爬取騰訊招聘網站(https://careers.tencent.com/search.html)練習Scrapy框架的使用

步驟:

1、通過抓包確認要抓取的內容是否在當前url地址中,測試發現內容不在當前url中並且資料格式為json字串

2、請求url地址過長,考慮去除某些部分,經測試得到

'https://careers.tencent.com/tencentcareer/api/post/Query?keyword=&pageIndex=1&pageSize=10&language=zh-cn&area=cn'

3、尋找url地址,pageIndex=頁碼,可構造爬蟲迴圈的URL列表

4、書寫爬蟲程式碼

  - scrapy startproject tencent tencent.com

  - cdtencent.com

  -scrapy genspider hr tencent.com

5、儲存為CSV檔案

 1 import scrapy
 2 import json
 3 
 4 
 5 class HrSpider(scrapy.Spider):
 6     name = 'hr'
 7     allowed_domains = ['tencent.com']
 8     start_urls = ['https://careers.tencent.com/tencentcareer/api/post/Query?keyword=&pageIndex=1&pageSize=10&language
' 9 '=zh-cn&area=cn'] 10 url = 'https://careers.tencent.com/tencentcareer/api/post/Query?keyword=&pageIndex={' \ 11 '}&pageSize=10&language=zh-cn&area=cn ' 12 pageIndex = 1 13 14 def parse(self, response): 15 json_str = json.loads(response.body)
16 for content in json_str["Data"]["Posts"]: 17 content_dic = {"title": content["RecruitPostName"], "location": content["LocationName"], 18 "date": content["LastUpdateTime"]} 19 print(content_dict)20 yield content_dic 21 22 23 if self.pageIndex < 10: 24 self.pageIndex += 1 25 next_url = self.url.format(self.pageIndex) 26 yield scrapy.Request(url=next_url, callback=self.parse)

執行爬蟲

  - scrapy crawl hr

儲存命令

  -scrapy crawl hr -o tencent.csv