Python爬蟲之Scrapy框架

阿新 • • 發佈：2021-06-17

Scrapy的命令

Scrapy框架常用命令

1、建立專案：

scrapy startproject <專案名字>

2、建立爬蟲：

cd <專案名字>
scrapy genspider <爬蟲名字> <允許爬取的域名>

3、執行爬蟲：

scrapy crawl <爬蟲名字>

setings.py常用配置

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36'  # UA偽裝
ROBOTSTXT_OBEY = False  # 不遵守Robot協議
LOG_LEVEL = "WARNING"  # 列印日誌級別

Scrapy的概念

Scrapy是一個為了爬取網站資料，提取結構性資料而編寫的應用框架

Scrapy的工作流程

流程:

爬蟲中起始的url構造成request物件-->爬蟲中介軟體-->引擎-->排程器
排程器把request-->引擎-->下載中介軟體--->下載器
下載器傳送請求，獲取response響應---->下載中介軟體---->引擎--->爬蟲中介軟體--->爬蟲
爬蟲提取url地址，組裝成request物件---->爬蟲中介軟體--->引擎--->排程器，重複步驟2
爬蟲提取資料--->引擎--->管道處理和儲存資料

注意：

圖中中文是為了方便理解後加上去的
圖中綠色線條的表示資料的傳遞
注意圖中中介軟體的位置，決定了其作用
注意其中引擎的位置，所有的模組之前相互獨立，只和引擎進行互動

scrapy各模組具體作用

scrapy中每個模組的具體作用：

引擎(engine)：負責資料和訊號在不腰痛模組間的傳遞
排程器(scheduler)：實現一個佇列，存放引擎發過來的request請求物件
下載器(downloader)：傳送引擎發過來的request請求，獲取響應，並將響應交給引擎
爬蟲(spider)：處理引擎發過來的response，提取資料，提取url，並交給引擎
管道(pipeline)：處理引擎傳遞過來的資料，比如儲存
下載中介軟體(downloader middleware)：可以自定義的下載擴充套件，比如設定代理ip
爬蟲中介軟體(spider middleware)：可以自定義request請求和進行response過濾，與下載中介軟體作用重複

Scrapy專案的結構

三個內建物件

request請求物件

response響應物件

item資料物件

五個元件

spider爬蟲模組

pipeline管道

scheduler排程器

downloader下載器

engine引擎

兩個中介軟體

process_request(self, request, spider)

process_response(self, request, response, spider)

Scrapy專案開發流程

建立專案

scrapy startproject <專案名字>

示例：scrapy startproject mySpider

建立爬蟲

cd <專案名字>

scrapy genspider <爬蟲名字> <允許爬取的域名>

示例：

cd mySpider

scrapy genspider itcast itcast.cn

資料建模

中介軟體

爬蟲檔案(itcast.py)

import scrapy

class ItcastSpider(scrapy.Spider):  # 繼承scrapy.spider
	# 爬蟲名字 
    name = 'itcast' 
    # 允許爬取的範圍
    allowed_domains = ['itcast.cn'] 
    # 開始爬取的url地址
    start_urls = ['http://www.itcast.cn/channel/teacher.shtml']
    
    # 資料提取的方法，接受下載中介軟體傳過來的response
    def parse(self, response): 
    	# scrapy的response物件可以直接進行xpath
    	names = response.xpath('//div[@class="tea_con"]//li/div/h3/text()') 
    	print(names)

    	# 獲取具體資料文字的方式如下
        # 分組
    	li_list = response.xpath('//div[@class="tea_con"]//li') 
        for li in li_list:
        	# 建立一個數據字典
            item = {}
            # 利用scrapy封裝好的xpath選擇器定位元素，並通過extract()或extract_first()來獲取結果
            item['name'] = li.xpath('.//h3/text()').extract_first() # 老師的名字
            item['level'] = li.xpath('.//h4/text()').extract_first() # 老師的級別
            item['text'] = li.xpath('.//p/text()').extract_first() # 老師的介紹
            print(item)

附：

需要修改的是allowed_domains，start_urls，parse()

定位元素以及提取資料、屬性值的方法：

response.xpath方法的返回結果是一個類似list的型別，其中包含的是selector物件，操作和列表一樣，但是有一些額外的方法
額外方法extract()：返回一個包含有字串的列表
額外方法extract_first()：返回列表中的第一個字串，列表為空沒有返回None

response響應物件的常用屬性

response.url：當前響應的url地址
response.request.url：當前響應對應的請求的url地址
response.headers：響應頭
response.requests.headers：當前響應的請求頭
response.body：響應體，也就是html程式碼，byte型別
response.status：響應狀態碼

儲存資料

在settings.py配置啟用管道

ITEM_PIPELINES = {
    'myspider.pipelines.ItcastPipeline': 400
}

配置項中鍵為使用的管道類，管道類使用.進行分割，第一個為專案目錄，第二個為檔案，第三個為定義的管道類。

配置項中值為管道的使用順序，設定的數值約小越優先執行，該值一般設定為1000以內。

執行scrapy

在專案目錄下執行:

scrapy crawl <爬蟲名字>

示例：scrapy crawl itcast

Scrapy的使用

user-agent

settings.py中修改/新增:

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36'  # UA偽裝

固定cookie,適用於cookie週期長(常見於一些不規範的網站)，爬取資料量不大，能在cookie過期之前把所有的資料拿到的網站

方法一：重構scrapy的start_rquests方法，將帶cookies引數的請求返回給引擎

爬蟲檔案中：

def start_requests(self):  # 重構start_requests方法
    # 這個cookies_str是抓包獲取的
    cookies_str = '...' # 抓包獲取
    # 將cookies_str轉換為cookies_dict
    cookies_dict = {i.split('=')[0]:i.split('=')[1] for i in cookies_str.split('; ')}
    yield scrapy.Request(  # 將帶cookies的請求返回給引擎
        self.start_urls[0],
        callback=self.parse,
        cookies=cookies_dict
    )

注意：

scrapy中cookie不能夠放在headers中，在構造請求的時候有專門的cookies引數，能夠接受字典形式的coookie

方法二:scrapy.FormRequest()傳送post請求,適用於頻繁更換cookie的網站

import scrapy

class Login2Spider(scrapy.Spider):
   name = 'login'
   allowed_domains = ['']
   start_urls = ['']

   def parse(self, response):
       authenticity_token = response.xpath("//input[@name='authenticity_token']/@value").extract_first()
       utf8 = response.xpath("//input[@name='utf8']/@value").extract_first()
       commit = response.xpath("//input[@name='commit']/@value").extract_first()
        
        #構造POST請求，傳遞給引擎
       yield scrapy.FormRequest(  # FormRequest請求
           "https://github.com/session",
           formdata={
               "utf8":utf8,
               "commit":commit,
               "login":"username",
               "password":"***"
           },
           callback=self.parse_login
       )

   def parse_login(self,response):
       print(response.body)

附:

在settings.py中通過設定COOKIES_DEBUG=TRUE 能夠在終端看到cookie的傳遞傳遞過程

ip

翻頁請求

資料建模(items)

在items.py檔案中定義要提取的欄位：

class MyspiderItem(scrapy.Item): 
    name = scrapy.Field()   # 講師的名字
    title = scrapy.Field()  # 講師的職稱
    desc = scrapy.Field()   # 講師的介紹

在爬蟲檔案中匯入並且例項化，之後的使用方法和使用字典相同

itcast.py:

from myspider.items import MyspiderItem   # 匯入Item，注意路徑
...
    def parse(self, response)

        item = MyspiderItem() # 例項化後可直接使用

        item['name'] = node.xpath('./h3/text()').extract_first()
        item['title'] = node.xpath('./h4/text()').extract_first()
        item['desc'] = node.xpath('./p/text()').extract_first()
        
        print(item)

from myspider.items import MyspiderItem這一行程式碼中注意item的正確匯入路徑，忽略pycharm標記的錯誤

python中的匯入路徑要訣：從哪裡開始執行，就從哪裡開始匯入

儲存/清洗資料(pipelines)

管道能夠實現資料的清洗和儲存，能夠定義多個管道實現不同的功能

儲存資料

#### 一個爬蟲





#### 多個爬蟲

import json

from itemadapter import ItemAdapter
from pymongo import MongoClient

class ItcastspiderPipeline:
    def open_spider(self, spider):
        if spider.name == 'itcast':
            self.file = open('./itcast.json', 'w', encoding='utf-8')

    def process_item(self, item, spider):
        if spider.name == 'itcast':
            # 將item物件強轉成字典
            item = dict(item)
            json_data = json.dumps(item, ensure_ascii=False) + ',\n'
            self.file.write(json_data)
        return item

    def close_spider(self, spider):
        if spider.name == 'itcast':
            self.file.close()

class ItcspiderPipeline:
    def open_spider(self, spider):
        if spider.name == 'itc':
            self.file = open('./itc.json', 'w', encoding='utf-8')

    def process_item(self, item, spider):
        if spider.name == 'itc':
            # 將item物件強轉成字典
            item = dict(item)
            json_data = json.dumps(item, ensure_ascii=False) + ',\n'
            self.file.write(json_data)
        return item

    def close_spider(self, spider):
        if spider.name == 'itc':
            self.file.close()

class itMongoPipeline(object):
    def open_spider( self, spider ):
        if spider.name == 'itcast':
            con = MongoClient()
            self.collection = con.itcast.teachers

    def process_item( self, item, spider ):
        if spider.name == 'itcast':
            # # 將item物件強轉成字典 如果之前的item已經在pipeline中強轉過已經是字典，就不需要再轉換
            # item = dict(item)
            self.collection.insert(item)
        return item

開啟管道：

在settings.py設定開啟pipeline

......
ITEM_PIPELINES = {
   'itcastspider.pipelines.ItcastspiderPipeline': 300,  # 400表示權重,權重值越小，越優先執行！
   'itcastspider.pipelines.ItcspiderPipeline': 301,
   'itcastspider.pipelines.itMongoPipeline': 400,
}
......

注意點

使用之前需要在settings中開啟。
pipeline在setting中鍵表示位置(即pipeline在專案中的位置可以自定義)，值表示距離引擎的遠近，越近資料會越先經過：權重值小的優先執行
不同的pipeline可以處理不同爬蟲的資料，通過spider.name屬性來區分
不同的pipeline能夠對一個或多個爬蟲進行不同的資料處理的操作，比如一個進行資料清洗，一個進行資料的儲存
同一個管道類也可以處理不同爬蟲的資料，通過spider.name屬性來區分
有多個pipeline的時候，process_item的方法必須return item,否則後一個pipeline取到的資料為None值
pipeline中process_item的方法必須有，否則item沒有辦法接受和處理
process_item(self,item,spider):實現對item資料的處理，接受item和spider，其中spider表示當前傳遞item過來的spider
如果item已經在pipelines中使用過已經是字典，就不需要再次轉換，看是否被其他的先執行了主要看他的管道設定，管道數值越小表示它越優先執行。
open_spider(spider) :能夠在爬蟲開啟的時候執行一次
close_spider(spider) :能夠在爬蟲關閉的時候執行一次
上述倆個方法經常用於爬蟲和資料庫的互動，在爬蟲開啟的時候建立和資料庫的連線，在爬蟲關閉的時候斷開和資料庫的連線

儲存資料到MongoDB

itcast.py

......
 def parse(self, response):
        ...
	yield item  # 爬蟲檔案中需要yield給引擎，pipelines中才能拿到資料
......

pipelines.py

from pymongo import MongoClient

class MongoPipeline(object):
    def open_spider( self, spider ):
            con = MongoClient(host='127.0.0.1', port=27017)  # mongodb預設的host和post都是一樣的，在本機可以省略host和port
            self.collection = con.itcast.teachers

    def process_item( self, item, spider ):
            # # 將item物件強轉成字典 
            # item = dict(item)   如果之前的item已經在pipeline中強轉過已經是字典，就不需要再轉換
            self.collection.insert(item)
        return item

在settings.py設定開啟pipeline

......
ITEM_PIPELINES = {
    'itcastspider.pipelines.MongoPipeline': 500, # 權重值越小，越優先執行！  itcastspider是當前爬蟲專案名
}
......

開啟mongodb

MongoDB-->bin-->雙擊mongodb.exe

檢視mongodb是否儲存成功

儲存資料到MySQL

清洗資料

Scrapy實驗專案

robots, ua實驗

cookie實驗

攜帶cookie引數登入gitee

1、建立gitee專案

scrapy startproject giteeproject

cd giteeproject
scrapy genspider giteespider

2、修改gitee專案

giteespider.py

import scrapy


class GiteeSpider(scrapy.Spider):
    name = 'gitee'
    # allowed_domains = ['gitee.com']
    start_urls = ['https://gitee.com/profile/account_information']
    
	# 重寫start_requests方法
    def start_requests( self ):
        url = self.start_urls[0]
        temp = '登入後的gitee cookies字串'
        # 將cookies字串遍歷切成鍵值對形式
        cookies = {data.split('=')[0]: data.split('=')[-1] for data in temp.split('; ')}
        # 返回給引擎帶cookies的請求
        yield scrapy.Request(
            url=url,
            callback=self.parse,  # 預設會呼叫parse方法，可以省略callback不寫
            cookies=cookies 
        )

    def parse( self, response ):
        title = response.xpath('//div[@class="user-info"]/a/text()').extract_first()
        print(title)

settings.py

將 ROBOTSTXT_OBEY、USER_AGENT、LOG_LEVEL 解除註釋並修改：

ROBOTSTXT_OBEY = False  # 不遵守Robots協議
USER_AGENT = 'Mozilla/5.0' # UA偽裝
LOG_LEVEL = "WARNING"  # 列印日誌級別

其餘的檔案不用作修改

3、執行gitee專案

scrapy crawl giteespider

傳送post請求登入github

實驗網站:github登入網站

思路分析

進入github登入網站,F12開啟開發者工具,Network --> Preserve log勾選上,點選sign in 按鈕

可以看到是 https://github.com/session 攜帶使用者名稱以及密碼等相關引數在傳送post請求

分析引數哪些有變動: 發現只有authenticity_token,timestamp,timestamp_secret這三個引數的值是變化的,其餘都是不變的

獲取引數值: 首先在頁首找,發現這三個引數值都可以在login原始碼中獲取

建立github爬蟲專案

scrapy startproject githubProject

cd githubProject

scrapy genspider githubSpider github.com

完善程式碼

githubSpider.py中:

import scrapy


class GithubspiderSpider(scrapy.Spider):
    name = 'githubSpider'
    allowed_domains = ['github.com']
    start_urls = ['https://github.com/login']

    def parse( self, response ):
        # 在login原始碼中提取post需要攜帶的引數值
        authenticity_token = response.xpath('//input[@name="authenticity_token"]/@value').extract_first()
        timestamp = response.xpath('//input[@name="timestamp"]/@value').extract_first()
        timestamp_secret = response.xpath('//input[@name="timestamp_secret"]/@value').extract_first()
        # print(f'{authenticity_token}\n{timestamp}\n{timestamp_secret}')
        yield scrapy.FormRequest(  # 用FormRequest傳送請求
            'https://github.com/session',
            formdata={
                'commit': 'Sign in',
                'authenticity_token': authenticity_token,
                'login': '你的github帳號',
                'password': '你的gihub帳號登入密碼',
                'webauthn-support': 'supported',
                'webauthn-iuvpaa-support': 'supported',
                'timestamp': timestamp,
                'timestamp_secret': timestamp_secret,
            },
            callback=self.parse_login,
        )

    def parse_login( self, response ):
        if 'email' in str(response.body):
            print('yes')
        else:
            print('error')

settings.py中修改新增對應的變數:

USER_AGENT = 'Mozilla/5.0' # UA偽裝
ROBOTSTXT_OBEY = False  # 不遵守Robot協議
LOG_LEVEL = "WARNING"  # 列印日誌級別

執行github爬蟲專案

scrapy crawl githubSpider

傳送post請求登入gitee(未完)

ctrl+shift+n開啟無痕瀏覽器,進入gitee登入頁面,F12調出開發者工具,network-->把Preserve log勾選上

輸入你的使用者名稱和密碼,點選登入按鈕,觀察開發者工具中network的變化,可以看到https://gitee.com/login傳送post請求時攜帶使用者名稱和密碼,並進行了302跳轉

退出登入,按之前的操作再重新登入一次,可以發現login中的authenticity_token和encrypt_data[user[password]]有變化

ip實驗

items實驗

pipeline實驗

將itcast教師資訊儲存到mongodb

目標網站

原始碼:

itcast.py

import scrapy
from itcastspider.items import ItcastspiderItem

class ItcastSpider(scrapy.Spider):
    name = 'itcast'
    # allowed_domains = ['itcast.cn']
    start_urls = ['http://www.itcast.cn/channel/teacher.shtml#ajavaee']

    def parse(self, response):
        teachers = response.xpath('//div[@class="maincon"]/ul/li')
        for node in teachers:
            # temp={}
            item = ItcastspiderItem()
            item['name'] = node.xpath('.//div[@class="main_bot"]//text()').extract()
            item['desc'] = node.xpath('.//div[@class="main_mask"]//text()').extract()
            yield item

items.py

import scrapy

class ItcastspiderItem(scrapy.Item):
    # define the fields for your item here like:
    name = scrapy.Field()
    title = scrapy.Field()
    desc = scrapy.Field()

pipelines.py

from itemadapter import ItemAdapter
from pymongo import MongoClient

class MongoPipeline(object):
    def open_spider( self, spider ):
        con = MongoClient()  # 本機中可省略host和port
        self.collection = con.itcast.teachers

    def process_item( self, item, spider ):
        # 將item物件強轉成字典
        item = dict(item)
        self.collection.insert(item)
        return item

settings.py

ROBOTSTXT_OBEY = False
LOG_LEVEL = "WARNING"

ITEM_PIPELINES = {
   'itcastspider.pipelines.MongoPipeline': 200,
}

Python爬蟲之Scrapy框架

Scrapy的命令

Scrapy框架常用命令

setings.py常用配置

Scrapy的概念

Scrapy的工作流程

scrapy各模組具體作用

Scrapy專案的結構

三個內建物件

五個元件

兩個中介軟體

Scrapy專案開發流程

建立專案

建立爬蟲

資料建模

中介軟體

爬蟲檔案(itcast.py)

儲存資料

在settings.py配置啟用管道

執行scrapy

Scrapy的使用

user-agent

cookie

ip

meta

翻頁請求

資料建模(items)

儲存/清洗資料(pipelines)

儲存資料

注意點

儲存資料到MongoDB

儲存資料到MySQL

清洗資料

Scrapy實驗專案

robots, ua實驗

cookie實驗

攜帶cookie引數登入gitee

傳送post請求登入github

思路分析

建立github爬蟲專案

完善程式碼

執行github爬蟲專案

傳送post請求登入gitee(未完)

ip實驗

items實驗

pipeline實驗

將itcast教師資訊儲存到mongodb

原始碼:

儲存資料到mysql

中介軟體實驗

scrapy_redis實驗

參考連結

相關推薦