智聯招聘抓取---scrapy框架和requests庫兩種方式實現
阿新 • • 發佈:2018-12-12
#首先分析目標站點,分析得出結果是在json接口裡,然後抓取企業資訊需要再次請求頁面進行抓取
#1.直接requests請求進行抓取儲存
##需要注意點:
- 可能不同企業單頁排版不一樣,需要判斷採取不同形式
- 儲存為csv檔案注意格式,保證資料表格不換行需要新增 newline=’’
import requests import json from lxml import etree import csv lists=[] for n in range(0,1): url="https://fe-api.zhaopin.com/c/i/sou?start={}&pageSize=60&cityId=530&workExperience=-1&education=-1&companyType=-1&employmentType=-1&jobWelfareTag=-1&kw=python&kt=3&lastUrlQuery=%7B%22p%22:2,%22pageSize%22:%2260%22,%22jl%22:%22530%22,%22kw%22:%22python%22,%22kt%22:%223%22%7D".format(n*60) response=json.loads(requests.get(url).text) # print(response) for i in range(0,60): page=response["data"]["results"][i]["company"]["url"] # print(page) if len(page)< 48: html=requests.get(page).text a=etree.HTML(html) dizi=a.xpath('//table[@class="comTinyDes"]//span[@class="comAddress"]/text()') jianjie=a.xpath('string(//div[@class="part2"]//div)').strip() gongsi = response["data"]["results"][i]["company"]["name"] guimo = response["data"]["results"][i]["company"]["size"]["name"] xinchou = response["data"]["results"][i]["salary"] lists.append([i+1,gongsi,page,guimo,xinchou,dizi,jianjie]) print(lists) print(gongsi) print(page) print(guimo) print(xinchou) print(dizi) print(jianjie) print("*"*50) # with open("aa.txt","a",encoding="utf-8") as f: # f.write("{}{} {} {} {} {} {}".format(i+1,gongsi,page,guimo,xinchou,dizi,jianjie)) # f.write("\n") else: continue with open("aa.csv", 'w', encoding='utf-8',newline='') as f: k = csv.writer(f, dialect='excel') k.writerow(["數量", "公司", "網址", "規模", "薪酬", "地址", "簡介"]) for list in lists: k.writerow(list) # print("="*20)
#2.用scrapy框架進行抓取
需要注意點:
def parse(self, response): item = ItemClass() yield Request(url, meta={'item': item}, callback=self.parse_item) def parse(self, response): item = response.meta['item'] item['field'] = value yield item 作者:何健 連結:https://www.zhihu.com/question/54773510/answer/141177867 來源:知乎 著作權歸作者所有。商業轉載請聯絡作者獲得授權,非商業轉載請註明出處。
- 儲存為csv檔案換行問題處理 scrapy crawl zhilian -o aaa.csv 修改scrapy的原始碼 原始碼目錄D:\Python36\Lib\site-packages\scrapy\exporters.py 新增一行 newline="",
class CsvItemExporter(BaseItemExporter): def __init__(self, file, include_headers_line=True, join_multivalued=',', **kwargs): ... self.stream = io.TextIOWrapper( file, newline="", # 新新增的 line_buffering=False, write_through=True, encoding=self.encoding ) if six.PY3 else file --------------------- 本文來自 範翻番樊 的CSDN 部落格 ,全文地址請點選:https://blog.csdn.net/u011361138/article/details/79912895?utm_source=copy
- scrapy引用items方法常路徑不對出錯解決方法 這是因為編譯器的問題,pycharm不會將當前檔案目錄自動加入自己的sourse_path
那麼具體的解決方法如下:
1)找到你的scrapy專案上右鍵
2)然後點選make_directory as
3)最後點選sources root
4)看到資料夾程式設計藍色就成功了
#最後是scrapy抓取智聯招聘程式碼spider:
# -*- coding: utf-8 -*-
import scrapy
import json
from zhilianzp.items import ZhilianzpItem
cc={}
class ZhilianSpider(scrapy.Spider):
name = 'zhilian'
# start_urls = ['https://www.baidu.com/']
def start_requests(self):
url = "https://fe-api.zhaopin.com/c/i/sou?start=0&pageSize=60&cityId=530&workExperience=-1&education=-1&companyType=-1&employmentType=-1&jobWelfareTag=-1&kw=python&kt=3&lastUrlQuery=%7B%22p%22:2,%22pageSize%22:%2260%22,%22jl%22:%22530%22,%22kw%22:%22python%22,%22kt%22:%223%22%7D"
yield scrapy.Request(url=url,callback=self.parse)
def parse(self, response):
content=json.loads(response.text)
item = ZhilianzpItem()
for i in range(0, 60):
page = content["data"]["results"][i]["company"]["url"]
# item = ZhilianzpItem()
# print(page)
if len(page) < 48:
item["gongsi"]=content["data"]["results"][i]["company"]["name"]
item["guimo"]=content["data"]["results"][i]["company"]["size"]["name"]
item["xinchou"]=content["data"]["results"][i]["salary"]
yield scrapy.Request(page,meta={"key":item},callback=self.next_parse)
# print(item["gongsi"])
else:
continue
# return content
# yield item
def next_parse(self,response):
# item = ZhilianzpItem()
item=response.meta['key']
# item["gongsi"] = content["data"]["results"][i]["company"]["name"]
item["dizi"]= response.xpath('//table[@class="comTinyDes"]//span[@class="comAddress"]/text()').extract()
item["jianjie"] = response.xpath('string(//div[@class="part2"]//div)').extract_first()
yield item
# print(jianjie)