python3 + scrapy 抓取boss直聘崗位
阿新 • • 發佈:2018-11-04
前言:本文為記錄工程實現過程,會引用其他文章,如果又不清晰的地方可以檢視原文章。本文主旨在於記錄,所以部分作者瞭解的部分可能不會介紹而直接操作,如果有疑問請留言或者直接使用搜索引擎。
引用:
一、安裝scrapy
管理員模式開啟power shell,輸入
pip install scrapy
ps:此步之前,需要先行安裝pip,具體請自行搜尋。
二、到某路徑下建立scrapy工程
scrapy startproject boss
三、開啟工程目錄
cd boss
四、建立爬蟲
scrapy genspider bosszhipin www.zhipin.com
五、將爬蟲工程匯入pycharm,修改setting.py
將 ROBOTSTXT_OBEY = True
改為 ROBOTSTXT_OBEY = False
六、編寫bosszhipin.py和run.py
# -*- coding: utf-8 -*- import scrapy class BosszhipinSpider(scrapy.Spider): name = 'bosszhipin' allowed_domains= ['www.zhipin.com'] start_urls = ['https://www.zhipin.com/c101270100-p100101/?page=1&ka=page-1'] def parse(self, response): print(response.text)
run.py放在專案根目錄
from scrapy.cmdline import execute
execute(['scrapy','crawl','bosszhipin'])
執行出現錯誤
2018-11-04 13:03:36 [scrapy.core.engine] INFO: Spider opened2018-11-04 13:03:36 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2018-11-04 13:03:36 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2018-11-04 13:03:37 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.zhipin.com/c101270100-p100101/?page=1&ka=page-1> (referer: None) 2018-11-04 13:03:37 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://www.zhipin.com/c101270100-p100101/?page=1&ka=page-1>: HTTP status code is not handled or not allowed 2018-11-04 13:03:37 [scrapy.core.engine] INFO: Closing spider (finished)
連結被關閉,應該是被反爬了,修改中介軟體來修改headers
middlewares.py 中加入
class UserAgentMiddleware(object): def __init__(self, user_agent_list): self.user_agent = user_agent_list @classmethod def from_crawler(cls, crawler, *args, **kwargs): # 獲取配置檔案中的 MY_USER_AGENT 欄位 middleware = cls(crawler.settings.get('MY_USER_AGENT')) return middleware def process_request(self, request, spider): # 隨機選擇一個 user-agent request.headers['user-agent'] = random.choice(self.user_agent)
在setting中啟用中介軟體和MY_USER_AGENT的值
USER_AGENT = 'boss (+http://www.yourdomain.com)' ... DOWNLOADER_MIDDLEWARES = { 'boss.middlewares.BossDownloaderMiddleware': 543, }
(以上程式碼預設有實現,只是被註釋了,建議先啟用試試能不能用,不能用再找解決方法)
再次執行run.py,可以獲取頁面html資訊。
第一階段全部程式碼,後期準備加上MongoDB,因為看不出來爬文字直接輸出有什麼卵用。。。
# -*- coding: utf-8 -*- import scrapy class BosszhipinSpider(scrapy.Spider): name = 'bosszhipin' allowed_domains = ['www.zhipin.com'] start_urls = ['https://www.zhipin.com/c101270100-p100101/?page=1&ka=page-1'] def parse(self, response): # print(response.text) job_node_table = response.xpath("//*[@id=\"main\"]/div/div[2]/ul") job_node_list = job_node_table.xpath("./li") for job_node in job_node_list: enterprise_node = job_node.xpath("./div/div[2]/div/h3/a") salary_node = job_node.xpath("./div/div[1]/h3/a/span") requirement_node = job_node.xpath("./div/div[1]/p") time_node = job_node.xpath("./div/div[3]/p") enterprise = enterprise_node.xpath('string(.)') salary = salary_node.xpath('string(.)') requirement = requirement_node.xpath('string(.)') time = time_node.xpath('string(.)') print("企業", enterprise.extract_first().strip()) print("薪資", salary.extract_first().strip()) print("要求", requirement.extract_first().strip()) print("更新", time.extract_first().strip()) print()