scrapy框架一個相對完善的爬蟲

阿新 • • 發佈：2019-01-09

用命令先生成自己的爬蟲框架

scrapy startproject myspider

cd myspider

scrapy genspider itcast itcast.com

#itcast.py


# -*- coding: utf-8 -*-
import scrapy

from mySpider.items import ItcastItem

#所有爬蟲的基類，使用者定義的爬蟲必須從這個類繼承

from scrapy import log, Request
from scrapy.utils.trackref import object_ref
from 
 scrapy.utils.url import url_is_from_spider

class ItcastSpider(scrapy.Spider):
    name = 'itcast'#爬蟲名字，必須是唯一的，在不同的爬蟲必須定義不同的名字。
    allowed_domains = ['itcast.cn']
#allow_domains = [] 是搜尋的域名範圍，也就是爬蟲的約束區域，
# 規定爬蟲只爬取這個域名下的網頁，不存在的URL會被忽略。
    start_urls = ("http://www.itcast.cn/channel/teacher.shtml",)
#start_urls  
= () ：爬取的URL元祖/列表。爬蟲從這裡開始抓取資料，
# 所以，第一次下載的資料將會從這些urls開始。其他子URL將會
# 從這些起始URL中繼承性生成
    def parse(self, response):
        '''
 parse(self, response) ：解析的方法，每個初始URL完成下載後將被呼叫，
 呼叫的時候傳入從每一個URL傳回的Response物件來作為唯一引數，主要作用如下：
1，負責解析返回的網頁資料(response.body)，提取結構化資料(生成item)
2，生成需要下一頁的URL請求。
def parse(self, response):
    filename 
='teacher.html'
    with open(filename,'w')as f:
        f.write(response.text)
然後執行一下看看，在mySpider目錄下執行
scrapy crawl itcast

是的，就是 itcast，看上面程式碼，它是 ItcastSpider 類的 name 屬性，也就是使用 scrapy genspider命令的唯一爬蟲名。

執行之後，如果列印的日誌出現 [scrapy] INFO: Spider closed (finished)，代表執行完成。 之後當前資料夾中就出現了一個 teacher.html 檔案，
裡面就是我們剛剛要爬取的網頁的全部原始碼資訊。

我們之前在mySpider/items.py 裡定義了一個ItcastItem類。 這裡引入進來
from mySpider.items import ItcastItem

然後將我們得到的資料封裝到一個 ItcastItem 物件中，可以儲存每個老師的屬性


'''
        #存放老師資訊的集合
        items=[]
        for each in response.xpath("//div[@class='li_txt']"):
            # 將我們得到的資料封裝到一個 `ItcastItem` 物件
            item = ItcastItem()
            # extract()方法返回的都是unicode字串
            name = each.xpath("h3/text()").extract()
            title = each.xpath("h4/text()").extract()
            info = each.xpath("p/text()").extract()

            # xpath返回的是包含一個元素的列表
            item['name'] = name[0]
            item['title'] = title[0]
            item['info'] = info[0]

            items.append(item)

            yield item#經過管道處理資料

        #     # 直接返回最後資料,不經過pipwline
        # return items

item.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy
'''
Item 定義結構化資料欄位，用來儲存爬取到的資料，
有點像Python中的dict，但是提供了一些額外的保護減少錯誤。
'''
class ItcastItem(scrapy.Item):
    name = scrapy.Field()
    level = scrapy.Field()
    title=scrapy.Field()
    info = scrapy.Field()

pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
'''
    當Item在Spider中被收集之後，它將會被傳遞到Item Pipeline，這些Item Pipeline元件按定義的順序處理Item。
每個Item Pipeline都是實現了簡單方法的Python類，比如決定此Item是丟棄而儲存。以下是item pipeline的一些典型應用：
驗證爬取的資料(檢查item包含某些欄位，比如說name欄位)
查重(並丟棄)
將爬取結果儲存到檔案或者資料庫中

'''
import json
class ItcastJsonPipeline(object):
    '''
item寫入JSON檔案
以下pipeline將所有(從所有'spider'中)爬取到的item，
儲存到一個獨立地items.json 檔案，每行包含一個序列化為'JSON'格式的
'item'

    '''

    def __init__(self):
        self.file = open('teacher.json', 'w+')

    def process_item(self, item, spider):
        content = json.dumps(dict(item), ensure_ascii=False) + "\n"
        self.file.write(content)
        return item

    def close_spider(self, spider):
        self.file.close()

# -*- coding: utf-8 -*-

# Scrapy settings for mySpider project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'mySpider'

SPIDER_MODULES = ['mySpider.spiders']
NEWSPIDER_MODULE = 'mySpider.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'mySpider (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'mySpider.middlewares.MyspiderSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
   # 'mySpider.middlewares.MyspiderDownloaderMiddleware': 543,
      'mySpider.middlewares.RandomUserAgent': 1,
    # 'mySpider.middlewares.RandomProxy': 100
}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'mySpider.pipelines.ItcastJsonPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
USER_AGENTS = [
    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
    "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
    "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
    "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
    "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5"
    ]

middlewares.py

class RandomUserAgent(object):
    def process_request(self, request, spider):
        useragent = random.choice(USER_AGENTS)

        request.headers.setdefault("User-Agent", useragent)
其他沒有更改

scrapy框架一個相對完善的爬蟲

用命令先生成自己的爬蟲框架 scrapy startproject myspider cd myspider scrapy genspider itcast itcast.com #itcast.py # -*- coding: utf-8 -*- import scrapy f

python之Scrapy框架的第一個爬蟲

執行： D:\pycodes\python123demo>scrapy crawl demo scrapy crawl demo 學習筆記：程式碼： D:\pycodes>scrapy startproject python123demo Ne

10 scrapy框架解讀--深入理解爬蟲原理

scrapy框架結構圖: 組成部分介紹: Scrapy Engine：負責元件之間資料的流轉，當某個動作發生時觸發事件 Scheduler：接收requests，並把他們入

python爬蟲Scrapy框架之增量式爬蟲

obj lib show prop open html back extract hot 一增量式爬蟲什麽時候使用增量式爬蟲：增量式爬蟲：需求當我們瀏覽一些網站會發現，某些網站定時的會在原有的基礎上更新一些新的數據。如一些電影網站會實時更新最近熱門的電影。那麽，當我

Python爬蟲從入門到放棄（十一）之 Scrapy框架整體的一個了解

object 定義 roc encoding eth obi pipe pos 等等這裏是通過爬取伯樂在線的全部文章為例子，讓自己先對scrapy進行一個整理的理解該例子中的詳細代碼會放到我的github地址：https://github.com/pythonsite/

記錄我的爬蟲之路1--爬蟲起步的urlib.request Python寫一個不用Scrapy框架的裸奔小幼兒爬爬

這幾天得知保研失敗了….剛好卡在保研名額外一名…雖然最近寫什麼東西都忍不住碎碎唸叨這一句話 =。=，但是好像也覺得能找到喜歡的東西了~比如現在打算認真學的爬蟲了~今天剛把小甲魚入門python的爬蟲部分學完，利用scrapy框架能順利地爬出dmoztools的

利用scrapy框架實現一個簡單的爬蟲專案

首先簡單介紹一下什麼是scrapy框架？具體詳情見百科！！！總之，scrapy是一個用於python開發抓取網站網頁的框架，更加通俗的講就是爬蟲框架！！！下面就是利用scrapy爬取web的一個小專案： import scrapy class BooksSpi

一個令人著迷的爬蟲框架——Scrapy框架！

在平常的知識傳播中，我經常遇到許多的小夥伴說，Python爬蟲還厲害喔，我想學，或者是我已經初學了Python，但是爬蟲還是沒有接觸，能教教我嗎？看到小夥伴有如此熱情，我決定來帶大家探討探討Python爬蟲！在探討爬蟲之前，我們首先來帶大家瞭解下 Scrapy 框架，我們先來

Python爬蟲從入門到放棄（十三）之 Scrapy框架的命令行詳解

directory xpath idf 成了 spider i386 名稱 4.2 不同的這篇文章主要是對的scrapy命令行使用的一個介紹創建爬蟲項目 scrapy startproject 項目名例子如下： localhost:spider zhaofan$ sc

第三百三十三節，web爬蟲講解2—Scrapy框架爬蟲—Scrapy模擬瀏覽器登錄—獲取Scrapy框架Cookies

pid 設置 ade form 需要 span coo decode firefox 第三百三十三節，web爬蟲講解2—Scrapy框架爬蟲—Scrapy模擬瀏覽器登錄模擬瀏覽器登錄 start_requests()方法，可以返回一個請求給爬蟲的起始網站，這個返回的請求相

爬蟲——Scrapy框架案例一：手機APP抓包

debug domain hone targe allow topic document more ebs 以爬取鬥魚直播上的信息為例： URL地址：http://capi.douyucdn.cn/api/v1/getVerticalRoom?limit=20&of

爬蟲——Scrapy框架案例二：陽光問政平臺

web url地址 blog rem idt xpath disable ora ole 陽光熱線問政平臺 URL地址：http://wz.sun0769.com/index.php/question/questionType?type=4&page= 爬取字段：帖

React 一個較為完善的前端框架

react 輪播創建 sem ont 鏡像源建議 tables 2.0 GitHub地址預覽地址(已增加響應式，可手機預覽??) 依賴模塊項目是用create-react-app創建的，主要還是列出新加的功能依賴包點擊名稱可跳轉相關網站???

scrapy框架系列 (2) 一個簡單案例

com 必須 res 逗號大致繼承中文 append .sh 學習目標創建一個Scrapy項目定義提取的結構化數據(Item) 編寫爬取網站的 Spider 並提取出結構化數據(Item) 編寫 Item Pipelines 來存儲提取到的Item(即結構化數據

爬蟲系列---Scrapy框架學習

產生 follow everyone 頁面 pos per iso select -s 項目的需求需要爬蟲某網的商品信息，自己通過Requests,BeautifulSoup等編寫了一個spider，把抓取的數據存到數據庫裏面。跑起來的感覺速度有點慢，尤其是進入詳情頁

爬蟲scrapy框架安裝使用

目錄結構 spi 創建信息目錄結構 win 框架命令安裝： pip install scrapy 安裝可能會出現問題，此時需要下載一個依賴包在這個網站： https://www.lfd.uci.edu/~gohlke/pythonlibs/#t

皇冠體育二代信用盤帶手機版網絡爬蟲之scrapy框架詳解

ML gin spi 通過 file 解決問題有時 ide bee 網絡爬蟲之scrapy框架詳解twisted介紹皇冠體育二代信用盤帶手機版 QQ2952777280Twisted是用Python實現的基於事件驅動的網絡引擎框架，scrapy正是依賴於twisted，

爬蟲Scrapy框架-Crawlspider鏈接提取器與規則解析器

一個 htm turn 創建 for tin Coding lines spi 一：Crawlspider簡介　　　　CrawlSpider其實是Spider的一個子類，除了繼承到Spider的特性和功能外，還派生除了其自己獨有的更加強大的特性和功能。其中最顯著的功能就是

爬蟲之scrapy框架

web 信息 .cn 入隊依賴下載器新建和數類定義一 scrapy框架簡介 1 介紹 Scrapy一個開源和協作的框架，其最初是為了頁面抓取 (更確切來說, 網絡抓取 )所設計的，使用它可以以快速、簡單、可擴展的方式從網站中提取所需的數據。但目前Scrapy的用

Python爬蟲從入門到放棄之 Scrapy框架中Download Middleware用法

sta 頻繁 space raw 處理們的 img ear 法則這篇文章中寫了常用的下載中間件的用法和例子。Downloader Middleware處理的過程主要在調度器發送requests請求的時候以及網頁將response結果返回給spiders的時候，所以從

scrapy框架一個相對完善的爬蟲

相關推薦