Python爬蟲 scrapy框架爬取某招聘網存入mongodb解析
阿新 • • 發佈:2020-01-09
建立專案
scrapy startproject zhaoping
建立爬蟲
cd zhaoping scrapy genspider hr zhaopingwang.com
目錄結構
items.py
title = scrapy.Field() position = scrapy.Field() publish_date = scrapy.Field()
pipelines.py
from pymongo import MongoClient mongoclient = MongoClient(host='192.168.226.150',port=27017) collection = mongoclient['zhaoping']['hr'] class TencentPipeline(object): def process_item(self,item,spider): print(item) # 需要轉換為 dict collection.insert(dict(item)) return item
spiders/hr.py
def parse(self,response): # 不要第一個 和最後一個 tr_list = response.xpath("//table[@class='tablelist']/tr")[1:-1] for tr in tr_list: item = TencentItem() # xpath 從1 開始數起 item["title"] = tr.xpath("./td[1]/a/text()").extract_first() item["position"] = tr.xpath("./td[2]/text()").extract_first() item["publish_date"] = tr.xpath("./td[5]/text()").extract_first() yield item next_url = response.xpath("//a[@id='next']/@href").extract_first() # 構造url if next_url != "javascript:;": print(next_url) next_url = "https://hr.tencent.com/" + next_url yield scrapy.Request(url=next_url,callback=self.parse,)
就是這麼簡單,就獲取到資料
以上就是本文的全部內容,希望對大家的學習有所幫助,也希望大家多多支援我們。