爬蟲練習之遞迴爬取入口頁面下所有連結(scrapy-redis分散式)

阿新 • • 發佈：2019-02-19

1. 實現scrapy-redis前的一些準備

pycharm中安裝scrapy和scrapy-redis模組
pycharm中開啟scrapy-redis原始碼所在資料夾
同scrapy用法,修改四個檔案items, settings, pipelines 和自定義的爬蟲程式碼dmoz

2. scrapy-redis與scrapy區別

利用redis實現分散式爬蟲

排程器Scheduler

scrapy
1. 改寫Python雙向佇列為自己的優先順序佇列,但是scapry中存在多個spider時不能共享同一個待爬佇列
scrapy-redis
1. 將scarpy佇列放到redis資料庫中讀取,實現多個爬蟲共享一個佇列
2. 同時還支援使用FIFO佇列和LIFO佇列

去除重複Duplication Filter

scrapy
1. 使用集合實現去重
2. 將已傳送請求的指紋存入集合,新發送請求時與該集合比對判斷是否已請求過
scrapy-redis
1. redis的zset具有不重複的特點
2. 將指紋存入redis,將不重複的請求寫入請求佇列

資料管道Item Pipeline

scrapy
爬取到的資料直接傳給管道檔案
scrapy-redis
將爬取到的資料存入redis資料佇列,可實現items processes叢集

爬蟲引擎Base Spider

scrapy
Spider類
scrapy-redis
繼承Spider類和RedisMixin類,從redis讀取url

3. 程式碼部分

settings

# Scrapy settings for lagou project
#
# For simplicity, this file contains only the most important settings by
# default. All the other settings are documented here:
#
#     http://doc.scrapy.org/topics/settings.html
#
SPIDER_MODULES = ['lagou.spiders']
NEWSPIDER_MODULE = 'lagou.spiders' 


#USER_AGENT = 'scrapy-redis (+https://github.com/rolando/scrapy-redis)'
# 本地重複過濾
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
# 計劃排程器,將請求佇列處理分發
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
# 是否將本地請求佇列持久化到遠端伺服器
SCHEDULER_PERSIST = True
# 使用框架提供的佇列
SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderPriorityQueue"# 常用,優先順序佇列
#SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderQueue"# FIFO佇列,先進先出
#SCHEDULER_QUEUE_CLASS = "scrapy_redis.queue.SpiderStack"# LIFO佇列,後進先出

ITEM_PIPELINES = {
    'lagou.pipelines.lagouPipeline': 300,
    'scrapy_redis.pipelines.RedisPipeline': 400,
}
# 日誌級別
# LOG_LEVEL = 'DEBUG'

# Introduce an artifical delay to make use of parallelism. to speed up the
# crawl.
# 爬取間隔
DOWNLOAD_DELAY = 30

# 請求頭
DEFAULT_REQUEST_HEADERS = {
    'Referer': 'https://www.lagou.com/jobs/list_%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0?city=%E5%8C%97%E4%BA%AC&cl=false&fromSearch=true&labelWords=&suginput=',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36',
}
# COOKIES不用
COOKIES_ENABLED = False
# 機器人規則不遵守
ROBOTSTXT_OBEY = False

# 重試
RETRY_ENABLE = True
RETRY_TIMES = 5 # 重試次數,次
DOWNLOAD_TIMEOUT = 5 # 超時時長,秒

# 連線遠端redis服務,可連線redis叢集實現分散式
REDIS_HOST = '10.25.34.65'
REDIS_PORT = 6379

items

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/topics/items.html

from scrapy.item import Item, Field
from scrapy.loader import ItemLoader
from scrapy.loader.processors import MapCompose, TakeFirst, Join


class ExampleItem(Item):
    # 框架預設欄位
    name = Field()
    description = Field()
    link = Field()
    crawled = Field()
    spider = Field()
    url = Field()
    # 自定義欄位
    positionName = Field()
    companyFullName = Field()
    companyShortName = Field()
    companySize = Field()
    financeStage = Field()
    district = Field()
    education = Field()
    workYear = Field()
    salary = Field()
    positionAdvantage = Field()


class ExampleLoader(ItemLoader):
    default_item_class = ExampleItem
    default_input_processor = MapCompose(lambda s: s.strip())
    default_output_processor = TakeFirst()
    description_out = Join()

pipelines

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/topics/item-pipeline.html
from datetime import datetime

import pandas


class lagouPipeline(object):
    def process_item(self, item, spider):
        # 框架預設
        item["crawled"] = datetime.utcnow()
        item["spider"] = spider.name
        # 自定義
        positionName = item['positionName']
        companyFullName = item['companyFullName']
        companyShortName = item['companyFullName']
        companySize = item['companyFullName']
        financeStage = item['companyFullName']
        district = item['companyFullName']
        education = item['companyFullName']
        workYear = item['companyFullName']
        salary = item['companyFullName']
        positionAdvantage = item['companyFullName']
        data=[companyFullName,companyShortName,companySize,financeStage,district,positionName
            ,workYear,education,salary,positionAdvantage]
        columns=['公司全名', '公司簡稱', '公司規模', '融資階段', '區域', '職位名稱', '工作經驗', '學歷要求', '工資', '職位福利']
        df=pandas.DataFrame(data=data,index=None,columns=columns)
        df.to_csv('北京-機器學習.csv',index=None)
        return item

自定義爬蟲程式碼dmoz

import json
import math

import scrapy
from scrapy.spiders import CrawlSpider, Rule

from lagou.items import ExampleItem


class DmozSpider(CrawlSpider):
    name = 'dmoz'
    allowed_domains = ['www.lagou.com']
    start_urls=['https://www.lagou.com/jobs/positionAjax.json?px=default&city=北京&needAddtionalResult=false']

    # rules = [
    #     Rule(LinkExtractor(
    #         allow=(r'支援正則表示匹配爬蟲域www.lagou.com內所有連結')
    #     ), callback='start_requests', follow=True),
    # ]

    def start_requests(self):
        print('start_requests--------------------------------------------------------')
        url = 'https://www.lagou.com/jobs/positionAjax.json?px=default&city=北京&needAddtionalResult=false'
        yield scrapy.FormRequest(
            url= url,
            formdata={
                'first': 'true',
                'pn': '1',
                'kd': '機器學習'
            },
            callback=self.get_pagenum,
        )
    def get_pagenum(self,response):
        # 確定總頁數
        meta = json.loads(response.body)
        print(meta)
        jobnum = meta['content']['positionResult']['totalCount']
        pagedemo=math.ceil(jobnum / 15)
        if pagedemo>30:
            pagenum=30
        else:
            pagenum=pagedemo
        print(f'總頁數:{pagenum}')
        url = response.url
        for num in range(1,pagenum+1):
            yield scrapy.FormRequest(
                url= url,
                formdata={
                    'first': 'true',
                    'pn': str(num),
                    'kd': '機器學習'
                },
                callback=self.get_message,
            )
    def get_message(self,response):
        # json.loads獲取json資料列表
        meta=json.loads(response.body)
        print(f'meta:{meta}')

        item = ExampleItem()
        joblist = meta['content']['positionResult']['result']
        for job in joblist:
            item['positionName'] = job['positionName']
            item['companyFullName'] = job['companyFullName']
            item['companyShortName'] = job['companyShortName']
            item['companySize'] = job['companySize']
            item['financeStage'] = job['financeStage']
            item['district'] = job['district']
            item['education'] = job['education']
            item['workYear'] = job['workYear']
            item['salary'] = job['salary']
            item['positionAdvantage'] = job['positionAdvantage']

爬蟲練習之遞迴爬取入口頁面下所有連結(scrapy-redis分散式)

1. 實現scrapy-redis前的一些準備 pycharm中安裝scrapy和scrapy-redis模組 pycharm中開啟scrapy-redis原始碼所在資料夾同scrapy用法,修改四個檔案items, settings, pipelin

Scrapy爬蟲教程之URL解析與遞迴爬取

前面介紹了Scrapy如何實現一個最簡單的爬蟲，但是這個Demo裡只是對一個頁面進行了抓取。在實際應用中，爬蟲一個重要功能是”發現新頁面”，然後遞迴的讓爬取操作進行下去。發現新頁面的方法很簡單，我們首先定義一個爬蟲的入口URL地址，比如《Scrapy入門教程》中的

爬蟲之Scrapy遞迴爬取網頁資訊

# -*- coding: utf-8 -*- import re import scrapy from zhipin.items import ZhipinItem class BossZhipinSpider(scrapy.Spider):

Python爬蟲系列之小說網爬取

今日爬蟲—小說網再次宣告所有爬蟲僅僅為技術交流，沒有任何惡意，若有侵權請☞私信☚ 此次爬取由主頁爬取到各本小說地址，然後通過這些地址獲取到小說目錄結構，在通過目錄結構獲取章節內容，同時以小說名字為資料夾，每一個章節為txt文字儲存到本地。話不多說，直接上程式碼

Python爬蟲入門之豆瓣短評爬取

採用工具pyCharm，python3，工具的安裝在這就不多說了，之所以採用python3是因為python2只更新維護到2020年。新建python專案 File-Settings-project interpreter，點右上角+號，安裝requests，lx

利用scrapy框架遞迴爬取菜譜網站

介紹：最近學習完scrapy框架後，對整個執行過程有了進一步的瞭解熟悉。於是想著利用該框架對食譜網站上的美食圖片進行抓取，並且分別按照各自的命名進行儲存。 1、網頁分析爬取的網站是www.xinshipu.com,在爬取的過程中我發現使用xpath對網頁進行解析時總是找不到對應的標籤

c語言練習之遞迴

1.遞迴和非遞迴分別實現求第n個斐波那契數。遞迴實現 #include <stdio.h> #include <stdlib.h> int fblq(int n){ if (n == 1 || n == 2){ return 1; } return f

爬蟲學習之17：爬取拉勾網網招聘資訊（非同步載入+Cookie模擬登陸）

很多網站需要通過提交表單來進行登陸或相應的操作，可以用requests庫的POST方法，通過觀測表單原始碼和逆向工程來填寫表單獲取網頁資訊。本程式碼以獲取拉勾網Python相關招聘職位為例作為練習。開啟拉鉤網，F12進入瀏覽器開發者工具，可以發現網站使用了A

爬蟲學習之11：爬取豆瓣電影TOP250並存入資料庫

本次實驗主要測試使用PyMySQL庫寫資料進MySQL，爬取資料使用XPATH和正則表示式，在很多場合可以用XPATH提取資料，但有些資料項在網頁中沒有明顯特徵，用正則表示式反而反而更輕鬆獲取資料。直接上程式碼：from lxml import etree impo

Java爬蟲系列之實戰：爬取酷狗音樂網 TOP500 的歌曲(附原始碼)

在前面分享的兩篇隨筆中分別介紹了HttpClient和Jsoup以及簡單的程式碼案例： Java爬蟲系列二：使用HttpClient抓取頁面HTML Java爬蟲系列三：使用Jsoup解析HTML 今天就來實戰下，用他們來抓取酷狗音樂網上的 Top500排行榜音樂。接下來的程式碼

爬蟲：輸入網頁之後爬取當前頁面的圖片和背景圖片,最後打包成exe

環境：py3.6 核心庫：selenium(考慮到通用性，js載入的網頁)、pyinstaller 顏色顯示：colors.py colors.py 用於在命令列輸出文字時，帶有顏色，可有可無。 # -*- coding:utf-8 -*-# # filename: prt_cmd_color.py

Python 爬蟲實現簡單例子（爬取某個頁面）

Python爬蟲最簡單實現 #!/usr/bin/env python #coding=utf-8import urllibimport urllib2def login(): url = 'https://www.oschina.net/action/user/

Java遞迴刪除指定資料夾下所有檔案

Java遞迴刪除指定資料夾下所有檔案工具類封裝 public class FileUtils{ public static boolean delAllFile(String path) { return delAllFile(new File(path)

遞迴遍歷資料夾下所有檔案

遞迴遍歷資料夾下所有檔案程式碼塊遞迴遍歷資料夾下所有檔案 package com.chow; import java.io.File; import java.util.ArrayList; /** * Created by zhouhaiming on 20

java 遞迴，列印資料夾下所有的檔案

import java.io.File; public class FileDemo2 { public static void main(String[] args) { // File file=new File("d:\\test"); File file=new File("D:\

windows API遞迴遍歷資料夾下所有檔案

1.網上有些程式碼有問題，改進如下 #include <stdio.h> #include<windows.h> #include<iostream> #inclu

java 遞迴顯示某個資料夾下所有的檔名稱

import java.io.File; public class generateMappingClass { public static void main(String[] args) throws Exception { File file = new File

linux下 c語言遞迴遍歷資料夾下所有檔案和子資料夾(附上替換文字檔案內容的方法)

#include <stdio.h> #include <sys/dir.h> #include <string> #include <sys/stat.h> //判斷是否為資料夾 bool isDir(const cha

遞迴 --- 遍歷指定目錄下所有檔案

A、如果該檔案目錄下全是檔案（非資料夾），那很理想，直接列印絕對路徑（file.getAbsolutePath()）就完成任務 B、重點在於如果該檔案目錄下有的是檔案，有的是資料夾（子資料夾）

爬蟲練習之迴圈爬取網頁中全部連結(requsets同步)

驗證輸入的url是否可正常連線,無法連線提示使用者再次輸入,正常連線則返回url本身 def url_get(): url = input("請輸入要爬取的首頁url:") try

爬蟲練習之遞迴爬取入口頁面下所有連結(scrapy-redis分散式)

1. 實現scrapy-redis前的一些準備

2. scrapy-redis與scrapy區別

排程器Scheduler

去除重複Duplication Filter

資料管道Item Pipeline

爬蟲引擎Base Spider

3. 程式碼部分

settings

items

pipelines

自定義爬蟲程式碼dmoz

相關推薦