1. 程式人生 > >scrapy框架一個相對完善的爬蟲

scrapy框架一個相對完善的爬蟲

用命令先生成自己的爬蟲框架

scrapy startproject myspider

cd myspider

scrapy genspider itcast itcast.com
#itcast.py


# -*- coding: utf-8 -*-
import scrapy

from mySpider.items import ItcastItem

#所有爬蟲的基類,使用者定義的爬蟲必須從這個類繼承

from scrapy import log, Request
from scrapy.utils.trackref import object_ref
from
scrapy.utils.url import url_is_from_spider class ItcastSpider(scrapy.Spider): name = 'itcast'#爬蟲名字,必須是唯一的,在不同的爬蟲必須定義不同的名字。 allowed_domains = ['itcast.cn'] #allow_domains = [] 是搜尋的域名範圍,也就是爬蟲的約束區域, # 規定爬蟲只爬取這個域名下的網頁,不存在的URL會被忽略。 start_urls = ("http://www.itcast.cn/channel/teacher.shtml",) #start_urls
= () :爬取的URL元祖/列表。爬蟲從這裡開始抓取資料, # 所以,第一次下載的資料將會從這些urls開始。其他子URL將會 # 從這些起始URL中繼承性生成 def parse(self, response): ''' parse(self, response) :解析的方法,每個初始URL完成下載後將被呼叫, 呼叫的時候傳入從每一個URL傳回的Response物件來作為唯一引數,主要作用如下: 1,負責解析返回的網頁資料(response.body),提取結構化資料(生成item) 2,生成需要下一頁的URL請求。 def parse(self, response): filename
='teacher.html' with open(filename,'w')as f: f.write(response.text) 然後執行一下看看,在mySpider目錄下執行 scrapy crawl itcast 是的,就是 itcast,看上面程式碼,它是 ItcastSpider 類的 name 屬性,也就是使用 scrapy genspider命令的唯一爬蟲名。 執行之後,如果列印的日誌出現 [scrapy] INFO: Spider closed (finished),代表執行完成。 之後當前資料夾中就出現了一個 teacher.html 檔案, 裡面就是我們剛剛要爬取的網頁的全部原始碼資訊。 我們之前在mySpider/items.py 裡定義了一個ItcastItem類。 這裡引入進來 from mySpider.items import ItcastItem 然後將我們得到的資料封裝到一個 ItcastItem 物件中,可以儲存每個老師的屬性 ''' #存放老師資訊的集合 items=[] for each in response.xpath("//div[@class='li_txt']"): # 將我們得到的資料封裝到一個 `ItcastItem` 物件 item = ItcastItem() # extract()方法返回的都是unicode字串 name = each.xpath("h3/text()").extract() title = each.xpath("h4/text()").extract() info = each.xpath("p/text()").extract() # xpath返回的是包含一個元素的列表 item['name'] = name[0] item['title'] = title[0] item['info'] = info[0] items.append(item) yield item#經過管道處理資料 # # 直接返回最後資料,不經過pipwline # return items
item.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy
'''
Item 定義結構化資料欄位,用來儲存爬取到的資料,
有點像Python中的dict,但是提供了一些額外的保護減少錯誤。
'''
class ItcastItem(scrapy.Item):
    name = scrapy.Field()
    level = scrapy.Field()
    title=scrapy.Field()
    info = scrapy.Field()
pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
'''
    當Item在Spider中被收集之後,它將會被傳遞到Item Pipeline,這些Item Pipeline元件按定義的順序處理Item。
每個Item Pipeline都是實現了簡單方法的Python類,比如決定此Item是丟棄而儲存。以下是item pipeline的一些典型應用:
驗證爬取的資料(檢查item包含某些欄位,比如說name欄位)
查重(並丟棄)
將爬取結果儲存到檔案或者資料庫中

'''
import json
class ItcastJsonPipeline(object):
    '''
item寫入JSON檔案
以下pipeline將所有(從所有'spider'中)爬取到的item,
儲存到一個獨立地items.json 檔案,每行包含一個序列化為'JSON'格式的
'item'

    '''

    def __init__(self):
        self.file = open('teacher.json', 'w+')

    def process_item(self, item, spider):
        content = json.dumps(dict(item), ensure_ascii=False) + "\n"
        self.file.write(content)
        return item

    def close_spider(self, spider):
        self.file.close()
# -*- coding: utf-8 -*-

# Scrapy settings for mySpider project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://doc.scrapy.org/en/latest/topics/settings.html
# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'mySpider'

SPIDER_MODULES = ['mySpider.spiders']
NEWSPIDER_MODULE = 'mySpider.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'mySpider (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'mySpider.middlewares.MyspiderSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
# 'mySpider.middlewares.MyspiderDownloaderMiddleware': 543,
'mySpider.middlewares.RandomUserAgent': 1,
# 'mySpider.middlewares.RandomProxy': 100
}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'mySpider.pipelines.ItcastJsonPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
USER_AGENTS = [
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
"Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
"Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
"Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
"Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5"
]

 

middlewares.py

class RandomUserAgent(object):
    def process_request(self, request, spider):
        useragent = random.choice(USER_AGENTS)

        request.headers.setdefault("User-Agent", useragent)
其他沒有更改