Python3 爬蟲之 Scrapy 快速入門

阿新 • • 發佈：2019-02-18

初識 Scrapy

Scrapy 是一個為了爬取網站資料，提取結構性資料而編寫的應用框架。可以應用在包括資料探勘，資訊處理或儲存歷史資料等一系列的程式中。其最初是為了頁面抓取（更確切來說, 網路抓取）所設計的，也可以應用在獲取API所返回的資料（例如 Amazon Associates Web Services）或者通用的網路爬蟲。

Scrapy 環境搭建

1. 安裝 Python 3.6，本文使用 Python 3.6，且在 PATH 中設定好環境變數。推薦使用Anaconda3 （https://www.anaconda.com/download/）

2. 安裝 Scrapy-1.5.1，通過 pip 安裝 Scrapy：pip install Scrapy

3. 可能還需的包，包括涉及資料庫儲存操作的pymssql （適用於sql server）：pip install pymssql

Scrapy 專案建立

開啟命令列視窗，新建一個 scrapy 工程：

scrapy startproject SpiderProjectName（專案名稱）

這個命令會在當前目錄（根據需求設定）下建立一個新目錄 SpiderProjectName，這就是此爬蟲的專案名稱。

成功建立爬蟲專案檔案結構後，進入目錄，使用命令： tree /f 檢視檔案層級的結構關係

這些檔案主要是：
scrapy.cfg: 專案配置檔案
SpiderProjectName/: 專案python模組, 程式碼將從這裡匯入
SpiderProjectName/items.py: 專案items檔案
SpiderProjectName/pipelines.py: 專案管道檔案
SpiderProjectName/settings.py: 專案配置檔案
SpiderProjectName/spiders: 放置spider的目錄

定義item

編輯 items.py 檔案，items 是將要裝載抓取的資料的容器，它工作方式像 python 裡面的字典，但它提供更多的保護，比如對未定義的欄位填充以防止拼寫錯誤。在 items.py 檔案裡，scrapy 需要我們定義一個容器用於放置爬蟲抓取的資料，它通過建立一個scrapy.Item 類來宣告，定義它的屬性為scrpy.Field 物件，就像是一個物件關係對映(ORM, Object Relational Mapping)。我們通過將需要的 item 模型化，來控制從站點獲得的新聞資料，比如我們要獲得新聞的標題項、內容項、發表時間、圖片連結地址和頁面連結地址，則定義這5種屬性的域。Scrapy 框架已經定義好了基礎的 item，我們自己的 item 只需繼承 scrapy.Item 即可。

# 示例
class ClawlerItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # 主題名稱
    title = scrapy.Field()
    # 主題種類
    category = scrapy.Field()
    # 語言
    lang = scrapy.Field()
    # 返回內容
    content = scrapy.Field()
    pass

爬蟲spider主程式

在spiders資料夾下，新建 SpiderName.py 檔案， Scrapy 框架已經幫助我們定義好了基礎爬蟲，只需要從 scrapy.spider 繼承，並重寫相應的解析函式 parse 即可。其中會涉及到使用 xPath 獲取頁面元素路徑的操作，xPaht 是 XML 頁面路徑語言，使用路徑表示式來選取 XML 文件中的節點或節點集，節點是通過沿著路徑（Path）或者步（Steps）來選取的，html 是 XML 的子集，當然同樣適用，有興趣的讀者可以自行查閱相關的 Xpath 文件。

資料儲存

編輯 pipelines.py 檔案，用於將 items 中的資料儲存到資料庫或者本地文字等。

# 涉及資料庫儲存的方式,資料庫引數都在settings中設定，更靈活
import pymssql
class ClawlerPipeline(object):
    # 定義資料庫連結
    def __init__(self):
        self.connect = pymssql.connect(host=SQLServer_HOST, user=SQLServer_USER, password=SQLServer_PASSWD,
                                    database=SQLServer_DBNAME, port=SQLServer_PORT)
        self.cursor = self.connect.cursor()

    def process_item(self, item, spider):
        try:
            sql = "insert into dbo.YB_ALL(ztid,title,category,lang,content,gxrq,demo) values (%s,%s,%s,%s,%s,%s,%s)"
            self.cursor.execute(sql,(item['id'],item['title'], item['category'], item['lang'], str(item['content']), timestamp, None))
            self.connect.commit()
            time.sleep(10)
            self.close_spider(spider)
        except Exception as error:
            get_logger('error.log').info(str(error))

    def close_spider(self, spider):
        self.connect.close()

    # 本地儲存方式
    # def process_item(self, item, spider):
    #     with open("test1.txt", 'a', encoding='utf-8') as fw:
    #         # json.dump(str(item['content']), fw, ensure_ascii=False)
    #         fw.write(str(item['content']) + '\n')

# util.py工具類中的日誌方法也貼出來分享吧

import logging

def get_logger(filename):
    logger = logging.getLogger('logger')
    logger.setLevel(logging.DEBUG)
    logging.basicConfig(format='%(message)s', level=logging.DEBUG)
    handler = logging.FileHandler(filename)
    handler.setLevel(logging.DEBUG)
    handler.setFormatter(logging.Formatter('%(asctime)s:%(levelname)s: %(message)s'))
    logging.getLogger().addHandler(handler)
    return logger

啟用 pipeline 管道

編輯 settings.py 檔案中的 ITEM_PIPELINES ，新增如下程式碼

# DBspiderPipeline 是你定義在Pipeline檔案中的類
# 專案名.pipelines.管道類名， 根據你的情況更改
ITEM_PIPELINES = {
    'SpiderProjectName.pipelines.DBSpiderPipeline': 300,
}

# 編碼
FEED_EXPORT_ENCODING = 'utf-8'

# 資料庫宣告
SQLServer_HOST = '127.0.0.1'
SQLServer_DBNAME = 'DATA_SAMPLE'
SQLServer_USER = 'sa'
SQLServer_PASSWD = '12345'
SQLServer_PORT = 1433

包括資料庫屬性，編碼等等，都可以在settings.py中設定

執行爬蟲

爬蟲專案根目錄下，執行命令如下：

scrapy crawl SpiderProjectName

Python3 爬蟲之 Scrapy 快速入門

初識 Scrapy

Scrapy 環境搭建

Scrapy 專案建立

定義item

爬蟲spider主程式

資料儲存

啟用 pipeline 管道

執行爬蟲

Python3 爬蟲之 Scrapy 快速入門

python3 [爬蟲入門實戰]爬蟲之scrapy爬取中國醫學人才網

python3 [爬蟲入門實戰]爬蟲之scrapy安裝與配置教程

python3爬蟲之安裝和使用scrapy

Python爬蟲大殺器之Requests快速入門

python3爬蟲之使用Scrapy框架爬取性感女神美女照片

python3爬蟲之使用Scrapy框架爬取英雄聯盟高清桌面桌布

Python3[爬蟲實戰] scrapy爬取汽車之家全站連結存json檔案

2017.07.26 Python網絡爬蟲之Scrapy爬蟲框架

2017.08.04 Python網絡爬蟲之Scrapy爬蟲實戰二天氣預報

2017.08.04 Python網絡爬蟲之Scrapy爬蟲實戰二天氣預報的數據存儲問題

爬蟲之Scrapy

python爬蟲之scrapy的pipeline的使用

python爬蟲之scrapy文件下載

python爬蟲之scrapy模擬登錄

全棧開發之HTML快速入門（一）

python3 爬蟲之Pyquery的使用方法

python3 爬蟲之requests模塊使用總結

python3.5+django2.0快速入門(一)

python3.5 之Scrapy環境安裝

Python3 爬蟲之 Scrapy 快速入門

初識 Scrapy

Scrapy 環境搭建

Scrapy 專案建立

定義item

爬蟲spider主程式

資料儲存

啟用 pipeline 管道

執行爬蟲

相關推薦