scrapy之內蒙古自治區環境保護廳

阿新 • • 發佈：2018-04-18

row ast iterable ret pipeline defined AR con download

主程序：

# -*- coding: utf-8 -*-
import re, scrapy
from urllib.parse import urljoin
from nmgepb.items import NmgepbItem


class BasicNmgepbSpider(scrapy.Spider):
    name = ‘basic_nmgepb‘
    allowed_domains = [‘nmgepb.gov.cn‘]
    start_urls = [‘http://nmgepb.gov.cn/‘]


    def __init__(self):
        self.countNum  
= 1
        self.startLink ="http://www.nmgepb.gov.cn/ywgl/hjpj/xmslqk/index.html"


    def start_requests(self):
        yield scrapy.Request(url=self.startLink, dont_filter=True, callback=self.link_parse)


    def customXpathParse(self, value):
        return ‘‘.join(value).strip()


    def customReParse(self, condition, index=1):
         
if condition:
            return condition.group(index).strip()
        else:
            return ""


    def link_parse(self, response):
        if (len(response.text) < 1000):
            yield scrapy.Request(url=response.url, dont_filter=True, callback=self.link_parse)
        else:
            allLinks  
= response.xpath("/html/body/div[3]/div/div[3]/div[2]/ul/li/span[2]/a/@href").extract()
            for link in allLinks:
                link = urljoin(response.url, link)
                yield scrapy.Request(url=link, callback=self.info_parse)
            if (response.url == self.startLink):
                for pageNum in range(1, 6):
                    link = ‘{0}_{1}.html‘.format(self.startLink.split(‘.html‘)[0], pageNum)
                    yield scrapy.Request(url=link, callback=self.link_parse)


    def info_parse(self, response):
        if ‘http://www.nmgepb.gov.cn/ywgl/hjpj/xmslqk/201412/t20141230_1472451.html‘ != response.url and ‘http://www.nmgepb.gov.cn/ywgl/hjpj/xmslqk/201412/t20141230_1472450.html‘ != response.url and ‘http://www.nmgepb.gov.cn/ywgl/hjpj/xmslqk/201412/t20141230_1472443.html‘ != response.url:
            item = NmgepbItem()
            trData = response.xpath(‘//table//tr‘)
            tableClass = self.customXpathParse(response.xpath(‘//table/@class‘).extract())

            if trData:
                for data in trData:
                    tdNum = len(data.xpath(‘./td‘))
                    firstTd = self.customXpathParse(data.xpath(‘./td[1]//text()‘).extract())
                    lastTd = self.customXpathParse(data.xpath(‘./td[6]//text()‘).extract())

                    if (tdNum == 3):
                        if (tableClass == ‘MsoTableGrid‘):
                            item[‘link‘] = response.url
                            item[‘title‘] = self.customXpathParse(data.xpath(‘./td[2]//text()‘).extract())
                            item[‘place‘] = self.customXpathParse(data.xpath(‘./td[3]//text()‘).extract())
                            item[‘company‘] = ‘‘
                            item[‘mechanism‘] = ‘‘
                            item[‘date‘] = ‘‘
                            if (item[‘title‘]) and (item[‘title‘] != ‘編號‘) and (item[‘title‘] != ‘項目名稱‘):
                                print(‘{0:>3}：\n\t項目鏈接：{1}\n\t項目名稱：{2}\n\n‘.format(self.countNum, item[‘link‘], item[‘title‘]))
                                self.countNum += 1
                                yield item

                        elif (tableClass == ‘FCK__ShowTableBorders‘):
                            item[‘link‘] = response.url
                            item[‘title‘] = self.customXpathParse(data.xpath(‘./td[3]//text()‘).extract())
                            item[‘place‘] = ‘‘
                            item[‘company‘] = ‘‘
                            item[‘mechanism‘] = ‘‘
                            item[‘date‘] = self.customXpathParse(data.xpath(‘./td[2]//text()‘).extract())
                            if (item[‘title‘]) and (item[‘title‘] != ‘編號‘) and (item[‘title‘] != ‘項目名稱‘):
                                print(‘{0:>3}：\n\t項目鏈接：{1}\n\t項目名稱：{2}\n\n‘.format(self.countNum, item[‘link‘], item[‘title‘]))
                                self.countNum += 1
                                yield item

                    elif (tdNum == 6) and (lastTd):
                        item[‘link‘] = response.url
                        item[‘title‘] = self.customXpathParse(data.xpath(‘./td[2]//text()‘).extract())
                        item[‘place‘] = self.customXpathParse(data.xpath(‘./td[3]//text()‘).extract())
                        item[‘company‘] = self.customXpathParse(data.xpath(‘./td[4]//text()‘).extract())
                        item[‘mechanism‘] = self.customXpathParse(data.xpath(‘./td[5]//text()‘).extract())
                        item[‘date‘] = self.customXpathParse(data.xpath(‘./td[6]//text()‘).extract())
                        if (item[‘title‘]) and (item[‘title‘] != ‘編號‘) and (item[‘title‘] != ‘項目名稱‘):
                            print(‘{0:>3}：\n\t項目鏈接：{1}\n\t項目名稱：{2}\n\n‘.format(self.countNum, item[‘link‘], item[‘title‘]))
                            self.countNum += 1
                            yield item

                    elif (tdNum == 5 or tdNum == 6) and (not lastTd):
                        if firstTd.isdigit():
                            item[‘link‘] = response.url
                            item[‘title‘] = self.customXpathParse(data.xpath(‘./td[2]//text()‘).extract())
                            item[‘place‘] = self.customXpathParse(data.xpath(‘./td[3]//text()‘).extract())
                            item[‘company‘] = self.customXpathParse(data.xpath(‘./td[4]//text()‘).extract())
                            item[‘mechanism‘] = ‘‘
                            item[‘date‘] = self.customXpathParse(data.xpath(‘./td[5]//text()‘).extract())
                            if (item[‘title‘]) and (item[‘title‘] != ‘編號‘) and (item[‘title‘] != ‘項目名稱‘):
                                print(‘{0:>3}：\n\t項目鏈接：{1}\n\t項目名稱：{2}\n\n‘.format(self.countNum, item[‘link‘], item[‘title‘]))
                                self.countNum += 1
                                yield item
                        else:
                            item[‘link‘] = response.url
                            item[‘title‘] = self.customXpathParse(data.xpath(‘./td[1]//text()‘).extract())
                            item[‘place‘] = self.customXpathParse(data.xpath(‘./td[2]//text()‘).extract())
                            item[‘company‘] = self.customXpathParse(data.xpath(‘./td[3]//text()‘).extract())
                            item[‘mechanism‘] = self.customXpathParse(data.xpath(‘./td[4]//text()‘).extract())
                            item[‘date‘] = self.customXpathParse(data.xpath(‘./td[5]//text()‘).extract())
                            if (item[‘title‘]) and (item[‘title‘] != ‘編號‘) and (item[‘title‘] != ‘項目名稱‘):
                                print(‘{0:>3}：\n\t項目鏈接：{1}\n\t項目名稱：{2}\n\n‘.format(self.countNum, item[‘link‘], item[‘title‘]))
                                self.countNum += 1
                                yield item

                    elif (tdNum == 7):
                        item[‘link‘] = response.url
                        item[‘title‘] = self.customXpathParse(data.xpath(‘./td[2]//text()‘).extract())
                        item[‘place‘] = self.customXpathParse(data.xpath(‘./td[3]//text()‘).extract())
                        item[‘company‘] = self.customXpathParse(data.xpath(‘./td[4]//text()‘).extract())
                        item[‘mechanism‘] = ‘‘
                        item[‘date‘] = self.customXpathParse(data.xpath(‘./td[7]//text()‘).extract())
                        if (item[‘title‘]) and (item[‘title‘] != ‘編號‘) and (item[‘title‘] != ‘項目名稱‘):
                            print(‘{0:>3}：\n\t項目鏈接：{1}\n\t項目名稱：{2}\n\n‘.format(self.countNum, item[‘link‘], item[‘title‘]))
                            self.countNum += 1
                            yield item

                    elif (tdNum == 9):
                        item[‘link‘] = response.url
                        item[‘title‘] = self.customXpathParse(data.xpath(‘./td[2]//text()‘).extract())
                        item[‘place‘] = self.customXpathParse(data.xpath(‘./td[3]//text()‘).extract())
                        item[‘company‘] = self.customXpathParse(data.xpath(‘./td[4]//text()‘).extract())
                        item[‘mechanism‘] = ‘‘
                        item[‘date‘] = self.customXpathParse(data.xpath(‘./td[9]//text()‘).extract())
                        if (item[‘title‘]) and (item[‘title‘] != ‘編號‘) and (item[‘title‘] != ‘項目名稱‘):
                            print(‘{0:>3}：\n\t項目鏈接：{1}\n\t項目名稱：{2}\n\n‘.format(self.countNum, item[‘link‘], item[‘title‘]))
                            self.countNum += 1
                            yield item

            else:
                item[‘link‘] = response.url
                item[‘title‘] = self.customReParse(re.search(r‘<strong>項目名稱：</strong>(.*?)<‘, response.text, re.I))
                item[‘place‘] = self.customReParse(re.search(r‘<strong>建設地點：</strong>(.*?)<‘, response.text, re.I))
                item[‘company‘] = self.customReParse(re.search(r‘<strong>建設單位：</strong>(.*?)<‘, response.text, re.I))
                item[‘mechanism‘] = self.customReParse(re.search(r‘<strong>環境影響評價機構：</strong>(.*?)<‘, response.text, re.I))
                item[‘date‘] = self.customReParse(re.search(r‘<strong>受理日期：</strong>(.*?)<‘, response.text, re.I))
                if (item[‘title‘]) and (item[‘title‘] != ‘編號‘) and (item[‘title‘] != ‘項目名稱‘):
                    print(‘{0:>3}：\n\t項目鏈接：{1}\n\t項目名稱：{2}\n\n‘.format(self.countNum, item[‘link‘], item[‘title‘]))
                    self.countNum += 1
                    yield item

items:

import scrapy


class NmgepbItem(scrapy.Item):
    link = scrapy.Field()
    title = scrapy.Field()
    place = scrapy.Field()
    company = scrapy.Field()
    mechanism = scrapy.Field()
    date = scrapy.Field()

middlewares:

from scrapy import signals


class NmgepbSpiderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, dict or Item objects.
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Response, dict
        # or Item objects.
        pass

    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.

        # Must return only requests (not items).
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info(‘Spider opened: %s‘ % spider.name)


class NmgepbDownloaderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info(‘Spider opened: %s‘ % spider.name)

pipelines:

import os, csv


class NmgepbPipeline(object):
    def __init__(self):
        self.csvFilePath = os.path.join(os.path.dirname(os.path.dirname(os.path.abspath(__file__))), ‘nmgepb.csv‘)
        self.csvFile = open(self.csvFilePath, ‘w‘, encoding=‘gb18030‘, newline=‘‘)
        self.csvWrite = csv.writer(self.csvFile)
        self.csvWrite.writerow([‘頁面鏈接‘, ‘項目名稱‘, ‘建設地點‘, ‘建設單位‘, ‘評價機構‘, ‘受理日期‘])


    def process_item(self, item, spider):
        self.csvWrite.writerow(
            [item.get(‘link‘), item.get(‘title‘), item.get(‘place‘), item.get(‘company‘), item.get(‘mechanism‘), item.get(‘date‘)]
        )
        return item


    def close_spider(self, spider):
        self.csvFile.close()
        print("恭喜, 數據采集完成, 存儲路徑：%s"%self.csvFilePath)

settings（加入）:

DEFAULT_REQUEST_HEADERS = {
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:59.0) Gecko/20100101 Firefox/59.0"
}
ITEM_PIPELINES = {
   ‘nmgepb.pipelines.NmgepbPipeline‘: 300,
}

scrapy之內蒙古自治區環境保護廳

row ast iterable ret pipeline defined AR con download 主程序： # -*- coding: utf-8 -*- import re, scrapy from urllib.parse import urljoin fr

模塊法之內蒙古自治區環境保護廳

imp etl OS print link 5.0 enc gen web import re,requests,xlwt from lxml import etree headers = { ‘User-Agent‘:‘Mozilla/5.0 (Windows

環境保護

.com es2017 ini img mvc 需求業務保護 blog 技術：C# 語言MVC，前端用的是mini UI 學習框架 http://www.miniui.com/demo/ 責任描述：首輪計劃編制與查詢需求界面設計主題業務：：環境保護

Python 爬蟲框架 Scrapy 在 Windows10 系統環境下的安裝和配置

環境： Windows10 系統、python3.4.3 版本安裝 Scrapy ：使用 pip 安裝 Scrapy,在命令列視窗中輸入命令 “pip install Scrapy”,若不報錯即可安裝成功。注意：可在命令列視窗中輸入命令“pip -h”測試 pip 模組是否已安

scrapy的除錯和環境安裝技巧

1，先在settings中把ROBOTSTXT_OBEY = False 在主目錄下面新建main檔案 __autor__ = 'zhouli' __date__ = '2018/11/3 22:39' from scrapy.cmdline import execute # 可以執行scrap

基於JSP的環境保護與宣傳平臺，基於java的畢業設計

**基於JSP的環境保護與宣傳平臺，基於java的畢業設計** 基於JSP的環境保護與宣傳平臺mysql資料庫建立語句基於JSP的環境保護與宣傳平臺oracle資料庫建立語句基於JSP的環境保護與宣傳平臺sqlserver資料庫建立語句基於JSP的環境保護與宣

2018年內蒙古自治區第十三屆大學生程式設計競賽題解

A.內蒙創業簡單判斷是否符合條件即可#include<iostream> #include<cstdio> #include<cstring> using namespace std; int main() { int t;

Windows安裝配置Python Scrapy環境

pil microsoft div scrapy 現在 soft python2.7 輸入 compile 下載並安裝Microsoft Visual C++ Compiler for Python 2.7（lxml的依賴環境，lxml是scrapy的依賴環境）

Ubuntu16.04下Scrapy環境的搭建

scrapy環境部署 pycharm配置運行spider一、Scrapy簡介與部署環境 Scrapy是一個為了爬取網站數據，提取結構性數據而編寫的第三方爬蟲框架。可以應用在包括數據挖掘，信息處理或存儲歷史數據等一系列的程序中。最初是為了頁面抓取 (更確切來說, 網絡抓取 )所

Scrapy教程——搭建環境、創建項目、爬取內容、保存文件

.com 目錄 tor strong targe self 文件名 ng- start 1、創建項目在開始爬取之前，您必須創建一個新的Scrapy項目。進入您打算存儲代碼的目錄中，運行新建命令。例如，我需要在D:\00Coding\Python\scra

Scrapy分布式爬蟲打造搜索引擎（一）,開發環境安裝

req per 分布式 apt fff mkdir bootstra ble douban Technorati 標簽: 分布式爬蟲 Linux環境下安裝mysqlsudo apt-get install mysqlserver 然後可以查看是否啟動 ps aux | gr

centos環境的python、scrapy部署

bucket nal com img ack setup 分享 all src 1 操作系統 Centos 6.5 64bit 或以上版本 2 軟件環境提示：（1）用什麽軟件；（2）運行環境（3）開發包。 Python：Python-2.7.6.tgz Scrapy：S

Scrapy爬蟲學習筆記 - windows下搭建開發環境1

ima 搭建開發環境環境 navicat win pyc arm bsp mysql 一、pycharm的安裝和簡單使用二、mysql和navicat的安裝和使用三、wi

python3.5 之Scrapy環境安裝

ont .com 執行命令 AI 通過 dos gpo pos HR python3.5 之Scrapy環境安裝 1.通過pip安裝Scrapy框架執行命令 pip install Scrapy 但總是失敗查閱相關安裝文檔 http://doc.scrapy.org/

Scrapy爬蟲框架第一講(Linux環境)

配置文件如何解決成了文件路徑 selenium linux 文件權限 vmw 1、What is Scrapy? 答：Scrapy是一個使用python語言（基於Twistec框架）編寫的開源網絡爬蟲框架，其結構清晰、模塊之間的耦合程度低，具有較強的擴張性，能滿足

虛擬環境安裝Scrapy框架

新版本環境 mha -cp ads ade pip download ted Scrapy框架涉及到依賴庫,采用本地安裝建議在虛擬環境中安裝python -m venv Scrapycd Scrapycd Scriptsactivatepython3.3以上可以通過上面的

Python3環境安裝Scrapy爬蟲框架過程

-o 每一個 evel base awl 列表 all field size Python3環境安裝Scrapy爬蟲框架過程 1. 安裝wheel pip install wheel 安裝檢查： 2. 安裝lxml pip install lxml-4.2.1-c

系統安全保護配置用戶環境配置高級連接防火墻策略管理

拒絕簡單 disabled var 進制 inux enable wall 熱備份 SELinux概述? Security-Enhanced Linux – 美國NSA國家安全局主導開發,一套增強Linux系統安全的強制訪問控制體系– 集成到Linux內核(2.6及以上)

Scrapy安裝-環境python 3.6.5

成功 .com fin exe engine bat 技術分享 cpp sta 執行命令：pip install scrapy 提示錯誤：資料參考：https://blogs.msdn.microsoft.com/pythonengineering/2016/04/11

基於python3.6.6的scrapy環境部署+圖像識別插件安裝

軟件 b- ont -o cal text bit 添加 all 一、Python3.6.6安裝1、安裝依賴的二進制軟件包yum -y install zlib zlib-devel bzip2 bzip2-devel ncurses ncurses-devel readl

scrapy之內蒙古自治區環境保護廳

相關推薦