scrapy框架的另一種分頁處理以及mongodb的持久化儲存以及from_crawler類方法的使用

阿新 • • 發佈：2019-03-08

Coding pca rom utf-8 ngs ODB 持久 same req

一.scrapy框架處理

　　1.分頁處理

　　　　以爬取亞馬遜為例

　　　　爬蟲文件.py

# -*- coding: utf-8 -*-
import scrapy
from Amazon.items import AmazonItem

class AmazonSpider(scrapy.Spider):
    name = ‘amazon‘
    allowed_domains = [‘www.amazon.cn‘]
    start_urls = [‘www.amazon.cn‘]

    def start_requests(self):
         
# 重寫父類方法,拿到商品搜索頁
        url = ‘https://www.amazon.cn/s/ref=nb_sb_noss?__mk_zh_CN=亞馬遜網站&url=search-alias%3Daps&field-keywords=iphone+-xs&rh=i%3Aaps%2Ck%3Aiphone+-xs&ajr=0‘
        yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        # 解析每一個商品的url 

        links = response.xpath(‘//*[contains(@id,"result_")]/div/div[3]/div[1]/a/@href‘).extract()
        # 同時拿到下一頁的連接
        next_page_url = response.xpath(‘//a[@id="pagnNextLink"]/@href‘).extract_first()
        print(‘>>>>>>>>>>>>>‘, next_page_url)
        # 再對這些每一個商品的url進行請求 

        for link in links:
            yield scrapy.Request(url=link, callback=self.parse_detail)

        #分頁處理
        # 把所有的商品詳情遍歷完了之後,再判斷是否有下一頁,有下一頁就繼續對下一頁發起請求
        if next_page_url:
            scrapy.Request(url=next_page_url, callback=self.parse)

    def parse_detail(self, response):
        #每個商品的詳情頁解析出我們要的數據
        title = response.xpath(‘//*[@id="productTitle"]/text()‘).extract_first().strip()
        price = (response.xpath("//*[@id=‘priceblock_ourprice‘]/text()") or response.xpath(
            "//*[@id=‘priceblock_saleprice‘]/text()")).extract_first().strip()
        deliver = response.xpath(‘//*[@id="ddmMerchantMessage"]/*[1]/text()‘).extract_first().strip()

        #把數據裝到容器裏面
        item=AmazonItem()
        item[‘title‘]=title
        item[‘price‘]=price
        item[‘deliver‘]=deliver
        #記得返回,否則管道接不到
        yield item

　　2.mongodb持久化儲存以及from_crawl的使用

　　　　pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don‘t forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import pymongo
class AmazonPipeline(object):

    @classmethod
    def from_crawler(cls, crawler):
        """
        Scrapy會先通過getattr判斷我們是否自定義了from_crawler,有則調它來完
        成實例化,早於__init__方法執行
　　　　 自己要的參數要去settings.py文件配置　　　　
        """
        HOST = crawler.settings.get(‘HOST‘)
        PORT = crawler.settings.get(‘PORT‘)
        USER = crawler.settings.get(‘USER‘)
        PWD = crawler.settings.get(‘PWD‘)
        DB = crawler.settings.get(‘DB‘)
        TABLE = crawler.settings.get(‘TABLE‘)
        return cls(HOST, PORT, USER, PWD, DB, TABLE)

    def __init__(self,host,port,user,pwd,db,table):
        self.host=host
        self.port=port
        self.user=user
        self.pwd=pwd
        self.db=db
        self.table=table

    def open_spider(self,spider):
        #程序運行時執行一次
        self.client=pymongo.MongoClient(host=self.host,port=self.port)

    def process_item(self, item, spider):
        dic_item=dict(item)
        if dic_item:
            self.client[self.db][self.table].save(dic_item)
        return item
    def close_spider(self,spider):
        #程序關閉時候執行一次
        self.client.close()

　　settings.py

# -*- coding: utf-8 -*-

# Scrapy settings for Amazon project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = ‘Amazon‘

SPIDER_MODULES = [‘Amazon.spiders‘]
NEWSPIDER_MODULE = ‘Amazon.spiders‘


# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = ‘Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.26 Safari/537.36‘

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   ‘Accept‘: ‘text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8‘,
#   ‘Accept-Language‘: ‘en‘,
#}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    ‘Amazon.middlewares.AmazonSpiderMiddleware‘: 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
   ‘Amazon.middlewares.AmazonDownloaderMiddleware‘: 543,
}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    ‘scrapy.extensions.telnet.TelnetConsole‘: None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   ‘Amazon.pipelines.AmazonPipeline‘: 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = ‘httpcache‘
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = ‘scrapy.extensions.httpcache.FilesystemCacheStorage‘


###MONGODB的配置
HOST=‘127.0.0.1‘
PORT=27017
USER=‘root‘
PWD=‘‘
DB=‘amazon‘
TABLE=‘goods‘

View Code

二.補充一個小技巧

　　一直在命令行啟動爬蟲文件就很累了,可以這麽做

　　在爬蟲項目的根目錄直接寫一個.py文件,加入如下內容

#第一個,第二不變,第三個是爬蟲文件名稱,也可以加第四個,--nolog不達意你日誌
from scrapy.cmdline import execute
execute([‘scrapy‘, ‘crawl‘, ‘amazon‘])

scrapy框架的另一種分頁處理以及mongodb的持久化儲存以及from_crawler類方法的使用

Coding pca rom utf-8 ngs ODB 持久 same req 一.scrapy框架處理　　1.分頁處理　　　　以爬取亞馬遜為例　　　　爬蟲文件.py # -*- coding: utf-8 -*- import scrapy fro

mybatis框架的兩種分頁

mybatis有兩種分頁方法 1、記憶體分頁，也就是假分頁。本質是查出所有的資料然後根據遊標的方式，擷取需要的記錄。如果資料量大，開銷大和記憶體溢位。使用方式：利用自動生成的example類，加入mybatis的RowBounds類,在呼叫的介面中新增給類的引數 @

tp的另一種的分頁填充資料

$userList = Db::name('資料庫')->paginate($pagenumber,false); if($userList->toArray()) { $userList->toA

分頁的另一種實現-不用額外請求

情景：千里碼有些最優化題目的旁邊會有一個排行榜，用來展示不同的答案。比如[Uber打車匹配](http://www.qlcoder.com/task/7596) 這裡的答題人數並不多，但是[老王

SpringBoot-異常處理的另一種方式

1、實現BasicErrorController類 package com.imooc.error; import java.util.List; import java.util.Map; im

layui進行分頁處理，後端返回資料沒有count欄位，需要單獨獲取再新增到資料中，再進行項渲染，另有layui表格資料增刪改查前後端互動

整體效果圖如下：（1）分頁前端介面處理（2）分頁後端的資料處理具體程式碼如下：前端介面程式碼：包括分頁，增刪改查，重新整理（搜尋功能還沒做，後端是java程式碼） <!DOCTYPE html> <html> <hea

C到C++的另一種錯誤處理策略

這篇短文是討論一個大多數程式設計師都感興趣的一個話題：錯誤處理。錯誤處理是程式設計的一個“黑暗面”。它既是應用程式的“現實世界”的關鍵點，也是一個你想隱藏的複雜業務。下面我們就來看看我所瞭解的幾種吧！ C語言的方式：返回錯誤碼 C語言風格的錯誤處理是最簡單的，但是並不完美。

基於Metronic的Bootstrap開發框架經驗總結（2）--列表分頁處理和外掛JSTree的使用

在上篇《基於Metronic的Bootstrap開發框架經驗總結（1）-框架總覽及選單模組的處理》介紹了Bootstrap開發框架的一些基礎性概括，包括總體介面效果，以及佈局、選單等內容，本篇繼續這一主題，介紹頁面內容常用到的資料分頁處理，以及Bootstrap外掛JSTree的使用。在資料的介面顯示當中，表

爬蟲系列5：scrapy動態頁面爬取的另一種思路

前面有篇文章給出了爬取動態頁面的一種思路，即應用Selenium+Firefox（參考《scrapy動態頁面爬取》）。但是selenium需要執行本地瀏覽器，比較耗時，不太適合大規模網頁抓取。事實上，還有一種執行效率更高的方法。就是事先分析js發出的GET或者POST請求

關於處理按鍵長按不用onKeyLongPress的另一種解決方案

近期專案中需要處理按鍵長按事件，所以使用onKeyLongPress()進行了處理，但同時自己也發現了另一種處理長按的方式。首先來介紹一下使用onKeyLongPress()的相關方法。一、onKeyLongPress使用 1.在onKeyDown()方法

分頁處理

ica list .class XML new pac 查詢 com emma 　　 package com.taotao.controller; import java.util.List; import org.junit.Test;import org.springf

另一種的SQL註入和DNS結合的技巧

其中 where ets 鏈接是我例如 .com bar 導致這個技巧有些另類，當時某業界大佬提點了一下。當時真的真的沒有理解到那種程度，現在可能也是沒有理解到，但是我會努力。本文章是理解於：http://netsecurity.51cto.com/art/2015

另一種比較器：Comparator

span 問題 com pre log implement nts face spa package comparatordemo.cn; import java.util.Comparator; /* * 一個對象的初期，並沒有實現comparable 接口，此時

爬蟲——Scrapy框架案例一：手機APP抓包

debug domain hone targe allow topic document more ebs 以爬取鬥魚直播上的信息為例： URL地址：http://capi.douyucdn.cn/api/v1/getVerticalRoom?limit=20&of

Asp.Net中的三種分頁方式總結

rom chang clas 綁定 select proc dll xtend tinc 本人ASP.net初學，網上找了一些分頁的資料，看到這篇文章，沒看到作者在名字，我轉了你的文章，只為我可以用的時候方便查看，2010的文章了，不知道這技術是否過期。以下才是正文

C# DataTable分頁處理

for toa cast array urn int data [] edt public DataTable GetPagedTable(DataTable dt, int PageIndex, int PageSize)//PageIndex表示第幾頁，PageSize

mysql 中的 not like 另一種簡化方法。

ont pan 簡化 regex regexp from sel span rom 第一種 not like 方法 select * from table where `zongbu` not like ‘%北京%‘ and `zongbu` not like ‘%上海%‘

requestAnimationFrame，Web中寫動畫的另一種選擇

畫的 hat hub settime github 激活 time() inpu on() HTML5/CSS3時代，我們要在web裏做動畫選擇其實已經很多了: 你可以用CSS3的animattion+keyframes; 你也可以用css3的transitio

html模板生成靜態頁面及模板分頁處理

htm 系統測試頻道 arr writable 屬性處理 ges 它只讓你修改頁面的某一部分，當然這“某一部分”是由你來確定的。美工先做好一個頁面，然後我們把這個頁面當作模板（要註意的是這個模板就沒必要使用EditRegion3這樣的代碼了，這

分享一種固定頁教在頁面底部的方法

spa ctype mar bootstra blog pad type idt color 這裏是固定在頁面底部，而不是fixed瀏覽屏幕的底部。如下這樣的方法有很多，這裏分享一種本人用過的。在last內容的div加上padding-bottom。 <!D

scrapy框架的另一種分頁處理以及mongodb的持久化儲存以及from_crawler類方法的使用

一.scrapy框架處理

1.分頁處理

2.mongodb持久化儲存以及from_crawl的使用

二.補充一個小技巧

相關推薦

　　1.分頁處理

　　2.mongodb持久化儲存以及from_crawl的使用