34.scrapy解決爬蟲翻頁問題

阿新 • • 發佈：2018-09-26

city bsp ride ebsp through del execute 出現 auto

這裏主要解決的問題：

1.翻頁需要找到頁面中加載的兩個參數。

  ‘__VIEWSTATE‘: ‘{}‘.format(response.meta[‘data‘][‘__VIEWSTATE‘]),

  ‘__EVENTVALIDATION‘: ‘{}‘.format(response.meta[‘data‘][‘__EVENTVALIDATION‘]),

還有一點需要註意的就是  dont_filter=False

yield scrapy.FormRequest(url=response.url, callback=self.parse, formdata=data, method="POST", dont_filter=False)

2.日期 我自己做的時候取的是2008-2018年的數據。
3.還有的就是數據字段入庫亂的問題。
4.一些問題比如越界等，我都沒做具體的解決只是做了一個拋出異常沒做處理。

這個相對較麻煩一點，首先先分析一下網站，地址:  http://www.nbzj.net/MaterialPriceList.aspx (寧波造價網) 這裏呢我主要是拿這個材料信息價數據。

nbzj.py

# -*- coding: utf-8 -*-
import scrapy
import re
from nbzj_web.items import NbzjWebItem

class NbzjSpider(scrapy.Spider):
    name  
= ‘nbzj‘
    allowed_domains = [‘www.nbzj.net‘]
    start_urls = [‘http://www.nbzj.net/MaterialPriceList.aspx‘]
    custom_settings = {
        "DOWNLOAD_DELAY": 1,
        "ITEM_PIPELINES": {
            ‘nbzj_web.pipelines.MysqlPipeline‘: 300,
        },
        "DOWNLOADER_MIDDLEWARES": {
             
‘nbzj_web.middlewares.NbzjWebDownloaderMiddleware‘: 500,
        },
    }
    def parse(self, response):
        _response=response.text
        # print(_response)

        #獲取翻頁參數
        __VIEWSTATE=re.findall(r‘id="__VIEWSTATE" value="(.*?)" />‘,_response)

        A=__VIEWSTATE[0]
        # print(A)
        __EVENTVALIDATION=re.findall(r‘id="__EVENTVALIDATION" value="(.*?)" />‘,_response)
        B=__EVENTVALIDATION[0]
        # print(B)

        #頁碼
        page_num=re.findall(r‘>下頁</a><a title="轉到第(.*?)頁"‘,_response)
        # print(page_num[0])
        max_page=page_num[0]
        # print(max_page)

        content={
            ‘__VIEWSTATE‘:A,
            ‘__EVENTVALIDATION‘:B,
            ‘page_num‘:max_page,
        }


        # 獲取標簽列表

        tag_list=response.xpath("//div[@class=‘fcon‘]/table[@class=‘mytable‘]//tr/td").extract()
        # print(tag_list)
　　　　  #這裏我直接取文本出現問題，我就直接拿標簽數據等下面在做字符串的修改刪除
        list=[]
        try:
            tag1=tag_list[:9]
            list.append(tag1)
            tag2=tag_list[9:18]
            list.append(tag2)
            tag3=tag_list[18:27]
            list.append(tag3)
            tag4=tag_list[27:36]
            list.append(tag4)
            tag5=tag_list[36:45]
            list.append(tag5)
            tag6=tag_list[45:54]
            list.append(tag6)
            tag7=tag_list[54:63]
            list.append(tag7)
            tag8=tag_list[63:72]
            list.append(tag8)
            tag9=tag_list[72:81]
            list.append(tag9)
            tag10=tag_list[81:90]
            list.append(tag10)
            tag11=tag_list[99:108]
            list.append(tag11)
            tag12=tag_list[108:117]
            list.append(tag12)
            tag13=tag_list[117:126]
            list.append(tag13)
            tag14=tag_list[126:135]
            list.append(tag14)
            tag15=tag_list[135:144]
            list.append(tag15)

            print(list)

            for tag in list:

                item=NbzjWebItem()
                # print(tag)
                #代碼
                code=tag[0].replace(‘<td style="text-align: center">‘,‘‘).replace(‘</td>‘,‘‘)
                # print(code)
                item[‘code‘]=code
                #名稱
                name=tag[1].replace(‘<td>‘,‘‘).replace(‘</td>‘,‘‘)
                # print(name)
                item[‘name‘]=name
                #地區
                district=tag[2].replace(‘<td style="text-align: center">‘,‘‘).replace(‘</td>‘,‘‘)
                # print(district)
                item[‘district‘]=district
                #型號規格
                _type=tag[3].replace(‘<td>‘,‘‘).replace(‘</td>‘,‘‘)
                # print(_type)
                item[‘_type‘]=_type
                #單位
                unit=tag[4].replace(‘<td style="text-align: center">‘,‘‘).replace(‘</td>‘,‘‘)
                # print(unit)
                item[‘unit‘]=unit
                #除稅價
                except_tax_price = tag[5].replace(‘<td style="text-align: right">‘,‘‘).replace(‘</td>‘,‘‘)
                # print(except_tax_price)
                item[‘except_tax_price‘]=except_tax_price
                #含稅價
                tax_price = tag[6].replace(‘<td style="text-align: right">‘,‘‘).replace(‘</td>‘,‘‘)
                # print(tax_price)
                item[‘tax_price‘]=tax_price
                #時間
                time=tag[7].replace(‘<td style="text-align: center">‘,‘‘).replace(‘</td>‘,‘‘)
                print(time)
                item[‘time‘]=time

                # print(‘-‘*100)
                yield item
            # print(‘*‘*100)
        except:
            pass

        yield scrapy.Request(url=response.url,callback=self.parse_detail,meta={"data": content})

    def parse_detail(self,response):
        for h in range(2008,2019):
            list=[‘01‘,‘02‘,‘03‘,‘04‘,‘05‘,‘06‘,‘07‘,‘08‘,‘09‘,‘10‘,‘11‘,‘12‘]
            for j in list:
                try:
                    max_page=response.meta[‘data‘][‘page_num‘]
                    # print(max_page)
                    for i in range(2,int(max_page)):
                        data={

                        ‘__VIEWSTATE‘: ‘{}‘.format(response.meta[‘data‘][‘__VIEWSTATE‘]),
                        ‘__VIEWSTATEGENERATOR‘: ‘E53A32FA‘,
                        ‘__EVENTTARGET‘: ‘ctl00$ContentPlaceContent$Pager‘,
                        ‘__EVENTARGUMENT‘:‘{}‘.format(i),
                        ‘__EVENTVALIDATION‘: ‘{}‘.format(response.meta[‘data‘][‘__EVENTVALIDATION‘]),
                        ‘HeadSearchType‘: ‘localsite‘,
                        ‘ctl00$ContentPlaceContent$txtnewCode‘:‘‘,
                        ‘ctl00$ContentPlaceContent$txtMaterualName‘:‘‘,
                        ‘ctl00$ContentPlaceContent$ddlArea‘:‘‘,
                        ‘ctl00$ContentPlaceContent$txtPublishDate‘: ‘{} - 0{}‘.format(h,j),
                        ‘ctl00$ContentPlaceContent$ddlCategoryOne‘:‘‘,
                        ‘ctl00$ContentPlaceContent$hidCateId‘:‘‘,
                        ‘ctl00$ContentPlaceContent$txtSpecification‘:‘‘,
                        ‘ctl00$ContentPlaceContent$Pager_input‘: ‘{}‘.format(i-1),
                        ‘ctl00$foot$ddlsnzjw‘: ‘0‘,
                        ‘ctl00$foot$ddlswzjw‘: ‘0‘,
                        ‘ctl00$foot$ddlqtxgw‘: ‘0‘
                        }
                        yield scrapy.FormRequest(url=response.url, callback=self.parse, formdata=data, method="POST", dont_filter=False)
                except:
                    pass

items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class NbzjWebItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()

    code=scrapy.Field()
    name=scrapy.Field()
    district=scrapy.Field()
    _type=scrapy.Field()
    unit=scrapy.Field()
    except_tax_price=scrapy.Field()
    tax_price =scrapy.Field()
    time=scrapy.Field()

middlewares.py


# -*- coding: utf-8 -*-

# Define here the models for your spider middleware
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals


class NbzjWebSpiderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, dict or Item objects.
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Response, dict
        # or Item objects.
        pass

    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.

        # Must return only requests (not items).
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info(‘Spider opened: %s‘ % spider.name)


class NbzjWebDownloaderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info(‘Spider opened: %s‘ % spider.name)

piplines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don‘t forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

# -*- coding: utf-8 -*-
from scrapy.conf import settings
import pymysql

class NbzjWebPipeline(object):
    def process_item(self, item, spider):
        return item

# 數據保存mysql
class MysqlPipeline(object):

    def open_spider(self, spider):
        self.host = settings.get(‘MYSQL_HOST‘)
        self.port = settings.get(‘MYSQL_PORT‘)
        self.user = settings.get(‘MYSQL_USER‘)
        self.password = settings.get(‘MYSQL_PASSWORD‘)
        self.db = settings.get((‘MYSQL_DB‘))
        self.table = settings.get(‘TABLE‘)
        self.client = pymysql.connect(host=self.host, user=self.user, password=self.password, port=self.port, db=self.db, charset=‘utf8‘)

    def process_item(self, item, spider):
        item_dict = dict(item)
        cursor = self.client.cursor()
        values = ‘,‘.join([‘%s‘] * len(item_dict))
        keys = ‘,‘.join(item_dict.keys())
        sql = ‘INSERT INTO {table}({keys}) VALUES ({values})‘.format(table=self.table, keys=keys, values=values)
        try:
            if cursor.execute(sql, tuple(item_dict.values())):  # 第一個值為sql語句第二個為 值 為一個元組
                print(‘數據入庫成功!‘)
                self.client.commit()
        except Exception as e:
            print(e)

            print(‘數據已存在!‘)
            self.client.rollback()
        return item

    def close_spider(self, spider):

        self.client.close()

setting.py

# -*- coding: utf-8 -*-

# Scrapy settings for nbzj_web project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = ‘nbzj_web‘

SPIDER_MODULES = [‘nbzj_web.spiders‘]
NEWSPIDER_MODULE = ‘nbzj_web.spiders‘


# mysql配置參數
MYSQL_HOST = "172.16.10.197"
MYSQL_PORT = 3306
MYSQL_USER = "root"
MYSQL_PASSWORD = "123456"
MYSQL_DB = ‘web_datas‘
TABLE = "web_nbzj"

# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = ‘nbzj_web (+http://www.yourdomain.com)‘

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   ‘Accept‘: ‘text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8‘,
#   ‘Accept-Language‘: ‘en‘,
#}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    ‘nbzj_web.middlewares.NbzjWebSpiderMiddleware‘: 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
   ‘nbzj_web.middlewares.NbzjWebDownloaderMiddleware‘: 543,
}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    ‘scrapy.extensions.telnet.TelnetConsole‘: None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   ‘nbzj_web.pipelines.NbzjWebPipeline‘: 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = ‘httpcache‘
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = ‘scrapy.extensions.httpcache.FilesystemCacheStorage‘

scrapy crawl nbzj 執行結果如下

技術分享圖片

34.scrapy解決爬蟲翻頁問題

city bsp ride ebsp through del execute 出現 auto 這裏主要解決的問題：1.翻頁需要找到頁面中加載的兩個參數。 ‘__VIEWSTATE‘: ‘{}‘.format(response.meta[‘data‘][‘__VIEWS

scrapy模擬瀏覽器翻頁爬取智聯

智聯爬取中,頁碼的數字和url是不匹配的,因此盲目的拼接url會造成錯誤,因此可以採用模擬瀏覽器爬取網頁要模擬瀏覽器需要知道scrapy流程,簡圖如下: 這裡只是簡單的寫一些偽碼,設計的資料清洗部分請看scrapy資料清洗 middleswares.py from scrap

單頁爬蟲翻頁不能

爬了一頁圖，很順利。最最基礎的，這網頁沒加密，重新命名儲存。 #coding:utf-8 import urllib.request import re import os import urllib def getHtml(url): page =

在scrapy框架下爬蟲中如何實現翻頁請求

通過scrapy.Request實現翻頁請求： scrapy.Request(url, callback=None, method='GET', headers=None, body=None, cookies=None, meta=None, en

Python爬蟲時翻頁等操作URL不會改變的解決辦法----以攜程評論爬取為例

一、需求：需要爬取攜程的五四廣場景點主頁的使用者點評資訊。二、爬蟲時可能遇到的問題：評論資訊雖然可以在該頁的原始碼中獲取到：但是存在許多問題，例如： 1、評論翻頁、修改評論排序方式（智慧排序、有用數排序、按時間排序）並不會改變當前頁的UR

Scrapy框架的學習(9.Scrapy中的CrawlSpider類的作用以及使用，實現優化的翻頁爬蟲)

1.CrawlSpider類通過一些規則（rules），使對於連結（網頁）的爬取更具有通用性，換句話說，CrawlSpider爬蟲為通用性的爬蟲，而Spider爬蟲更像是為一些特

Scrapy框架的學習(5.scarpy實現翻頁爬蟲，以及scrapy.Request的相關引數介紹)

1. 建立爬蟲專案： scrapy startporject tencent 然後進入到專案中： cd tencent 建立爬蟲：scrapy genspider tencent_spider

關於Python Scrapy框架 yield scrapy.Request(next_url, call_back="")無法翻頁情況解決

錯誤的程式碼: class XXSpider(scrapy.Spider): name = 'xxspider' allowed_domains = ['https://www.xx.com'] start_urls = ['https://ww

Python爬蟲處理JS翻頁的一種方法，利用Ajax非同步請求

前端方面知識不是很好，只是想解決有關Python爬蟲翻頁的問題 =。= 如有不對，還望指正瀏覽器：Google 利用區域性更新這種翻頁的方式，同樣需要進行一個url請求，因此我們的目的就是找到這個url 1.分析如圖所示，頁面翻頁採用了JS的方法 &nb

scrapy-redis所有request爬取完畢，如何解決爬蟲空跑問題？

scrapy-redis所有request爬取完畢，如何解決爬蟲空跑問題？ 1. 背景根據scrapy-redis分散式爬蟲的原理，多臺爬蟲主機共享一個爬取佇列。當爬取佇列中存在request時，爬蟲就會取出request進行爬取，如果爬取佇列中不存在request時，爬蟲就會處於等待狀

Ext JS 列表裡的QuickTipManager翻頁後不能正常顯示的問題解決方案

需求：在列表裡渲染一個圖示，滑鼠懸浮上後，顯示一個彈出框，裡面根據資料不同而顯示對應的項。如下圖：問題現象：當對列表進行翻頁後，不能正常顯示小框，且瀏覽器控制檯反覆列印紅色錯誤。程式碼： displaySegmentBinding:function (i

關於Scrapy crawlspider rules的規則——翻頁

最近在學習爬蟲，對於crawlspider rules的執行機制有點疑問，於是自己研究了一下，總結出以下幾點： 1、rules裡的Rule規則執行是有順序的，按照由上往下執行； 2、request url的獲取是Rule定位到的容器裡，所有a標籤裡的href連結，比如用xpath定位 r

分庫深度翻頁問題&Elasticsearch的解決方式

主要內容 o一業界難題-跨庫分頁需求 o二解決方案 o三 elasticsearch採用的解決方案&原始碼解析 o四由分頁問題引發對es效能的思考一業界難題-跨庫分頁需求

初識Scrapy框架+爬蟲實戰(7)-爬取鏈家網100頁租房資訊

Scrapy簡介 Scrapy，Python開發的一個快速、高層次的螢幕抓取和web抓取框架，用於抓取web站點並從頁面中提取結構化的資料。Scrapy用途廣泛，可以用於資料探勘、監測和自動化測試。Scrapy吸引人的地方在於它是一個框架，任何人都可以根

關於IE11訪問百度貼吧不能翻頁的解決辦法

從IE10 更新到 IE11 後，訪問貼吧時，只能看首頁，不能翻頁，而且精品等內容也不能檢視。網上此類問題從13就出現，4年了，百度仍然沒有解決，偌大個百度公司難道無人使用IE瀏覽器？？呵呵，無奈只能自己動手了。 1.開啟本地組策略編輯器 win+R 組合件彈出執行

使用scrapy做爬蟲遇到的一些坑：No module named items以及一些解決方案

最近在學習scrapy，因為官方文件看著比較累，所以看著崔慶才老師寫的部落格來做：https://cuiqingcai.com/3472.html# -*- coding: utf-8 -*- import re import scrapy # 匯入scrapy包 from

swiper不能手指滑動翻頁的解決辦法

/*當swiper中的slide的裡面放入長度在手機上不能滑動的時候放入這段程式碼就可以了*/ var startScroll, touchStart, touchCurrent; swiperV.slides.on('touchstart'

Python Scrapy反爬蟲常見解決方案（包含5種方法）

ins 都是可能自定義輸入 src stx 用戶 play 爬蟲的本質就是“抓取”第二方網站中有價值的數據，因此，每個網站都會或多或少地采用一些反爬蟲技術來防範爬蟲。比如前面介紹的通過 User-Agent 請求頭驗證是否為瀏覽器、使用 Jav

小白學 Python 爬蟲（34）：爬蟲框架 Scrapy 入門基礎（二）

人生苦短，我用 Python 前文傳送門：小白學 Python 爬蟲（1）：開篇小白學 Python 爬蟲（2）：前置準備（一）基本類庫的安裝小白學 Python 爬蟲（3）：前置準備（二）Linux基礎入門小白學 Python 爬蟲（4）：前置準備（三）Docker基礎入門小白學 Pyth

Android自己定義控件實戰——仿多看閱讀平移翻頁

mar pos sim androi 調用 andro return getview pan 轉載請聲明出處http://blog.csdn.net/zhongkejingwang/article/details/38728119 之前自己做的一個APP須要用到

34.scrapy解決爬蟲翻頁問題

相關推薦