scrapy 詳細例項-爬取百度貼吧資料並儲存到檔案和和資料庫中

阿新 • • 發佈：2019-01-09

Scrapy是一個為了爬取網站資料，提取結構性資料而編寫的應用框架。可以應用在包括資料探勘，資訊處理或儲存歷史資料等一系列的程式中。使用框架進行資料的爬取那，可以省去好多力氣，如不需要自己去下載頁面、資料處理我們也不用自己去寫。我們只需要關注資料的爬取規則就行，scrapy在python資料爬取框架中資料比較流行的，那麼今天就用scrapy進行百度貼吧-黑中介貼吧資料的爬取。別問我為啥爬取黑中介吧的，因為我個人經歷過一番。。咳咳咳，要抓住重點，咱們還是來講怎麼爬資料吧（贓官猛於虎！）。

注意：你需要自己先安裝python和scrapy框架哦~

1、建立專案

scrapy startproject 自定義專案名

scrapy startproject baidutieba

該命令將會建立包含下列內容的 sqc_scapy的目錄:

baidutieba/
    scrapy.cfg
    baidutieba/
        __init__.py
        items.py
        pipelines.py
        settings.py
        spiders/
            __init__.py
            ...

scrapy.cfg: 專案的配置檔案
baidutieba/: 該專案的python模組。之後您將在此加入程式碼。
baidutieba/items.py: 專案中的item檔案.
baidutieba/pipelines.py: 專案中的pipelines檔案.
baidutieba/settings.py: 專案的設定檔案.
baidutieba/spiders/: 放置spider程式碼的目錄.

2、建立爬蟲檔案

我們要編寫爬蟲，首先是建立一個Spider
我們在baidutieba/spiders/目錄下建立一個檔案MySpider.py 。檔案包含一個MySpider類，它必須繼承scrapy.Spider類。同時它必須定義一下三個屬性：
1、-name: 用於區別Spider。該名字必須是唯一的，您不可以為不同的Spider設定相同的名字。
2、-start_urls: 包含了Spider在啟動時進行爬取的url列表。因此，第一個被獲取到的頁面將是其中之一。後續的URL則從初始的URL獲取到的資料中提取。
3、-parse() 是spider的一個方法。被呼叫時，每個初始URL完成下載後生成的 Response 物件將會作為唯一的引數傳遞給該函式。該方法負責解析返回的資料(responsedata)，提取資料(生成item)以及生成需要進一步處理的URL的 Request 物件。

建立完成後MySpider.py的程式碼如下

#引入檔案
import scrapy

class MySpider(scrapy.Spider):
    #用於區別Spider
    name = "MySpider"
    #允許訪問的域
    allowed_domains = []
    #爬取的地址
    start_urls = []
    #爬取方法
    def parse(self, response):
        pass

3、定義Item

爬取的主要目標就是從非結構性的資料來源提取結構性資料，例如網頁。 Scrapy提供 Item 類來滿足這樣的需求。
Item 物件是種簡單的容器，儲存了爬取到得資料。其提供了類似於詞典(dictionary-like) 的API以及用於宣告可用欄位的簡單語法。

來，咱們先確定要爬取的資料元素

大家可以看到我們在工程目錄下可以看到一個items檔案，我們可以更改這個檔案或者建立一個新的檔案來定義我們的item。
這裡，我們在同一層建立一個新的item檔案Tbitems.py：

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class Tbitem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    #內容
	user_info = scrapy.Field()
	title = scrapy.Field()
	url = scrapy.Field()
	short_content = scrapy.Field()
	imgs = scrapy.Field()

如上：咱們建立了Tbitems容器來儲存抓取的資訊，user_info 對應發帖人資訊，title帖子標題，url帖子詳情地址，short_content帖子的簡短介紹，imgs帖子的圖片

常用方法如下：

#定義一個item
info= Tbitem()
#賦值
info['title'] = "語文"
#取值
info['title']
info.get('title')
#獲取全部鍵
info.keys()
#獲取全部值
info.items()

4、完善我的爬蟲主程式1：

# coding=utf-8
# 
import scrapy
from baidutieba.Tbitems import Tbitem


class MySpider(scrapy.Spider):
	name = "MySpider"
	allowed_domains = ['tieba.baidu.com']
	start_urls = ['https://tieba.baidu.com/f?ie=utf-8&kw=%E9%BB%91%E4%B8%AD%E4%BB%8B&fr=search']

	def parse(self, response):
		item = Tbitem()
		boxs = response.xpath("//li[contains(@class,'j_thread_list')]")
		for box in boxs:
			item['user_info'] = box.xpath('./@data-field').extract()[0];
			item['title'] = box.xpath(".//div[contains(@class,'threadlist_title')]/a/text()").extract()[0];
			item['url'] = box.xpath(".//div[contains(@class,'threadlist_title')]/a/@href").extract()[0];
			item['short_content'] = box.xpath(".//div[contains(@class,'threadlist_abs')]/text()").extract()[0];
			if box.xpath('.//img/@src'):
				item['imgs'] = box.xpath('.//img/@src').extract()[0];
			else:
				item['imgs'] =[]
			yield item

注：這裡用到了xpath方式來獲取頁面資訊，這裡不做過多介紹，可以參考網上的xpath教程來自己學習

上面這個是利用谷歌瀏覽器擴充套件元件XPath-Helper進行的除錯元件地址：XPath-Helper_v2.0.2，當然谷歌瀏覽器自帶了獲取元素xpath路徑的方法如下：

大家注意爬取的部分在MySpider類的parse()方法中進行。
parse()方法負責處理response並返回處理的資料

該方法及其他的Request回撥函式必須返回一個包含 Request 及(或) Item 的可迭代的物件（yield item 具體介紹請看徹底理解Python中的yield）

（在scrapy框架中，可以使用多種選擇器來尋找資訊，這裡使用的是xpath，同時我們也可以使用BeautifulSoup，lxml等擴充套件來選擇，而且框架本身還提供了一套自己的機制來幫助使用者獲取資訊，就是Selectors。因為本文只是為了入門所以不做過多解釋。）

cd進入工程資料夾，然後執行命令列

scrapy crawl 自己定義的spidername

scrapy crawl MySpider

看以看到我們已經執行成功了，獲取到了資料。不過那大家執行可以看到我們只爬了一頁的資料，那麼我們想將分頁資料全部爬取那該如何做？

def parse(self, response):
		item = Tbitem()
		boxs = response.xpath("//li[contains(@class,'j_thread_list')]")
		for box in boxs:
			item['user_info'] = box.xpath('./@data-field').extract()[0];
			item['title'] = box.xpath(".//div[contains(@class,'threadlist_title')]/a/text()").extract()[0];
			item['url'] = box.xpath(".//div[contains(@class,'threadlist_title')]/a/@href").extract()[0];
			item['short_content'] = box.xpath(".//div[contains(@class,'threadlist_abs')]/text()").extract()[0];
			if box.xpath('.//img/@src'):
				item['imgs'] = box.xpath('.//img/@src').extract()[0];
			else:
				item['imgs'] =[]
			yield item

		#url跟進開始
		#獲取下一頁的url資訊
		url = response.xpath('//*[@id="frs_list_pager"]/a[10]/@href').extract()
		
		if url :
			page = 'https:' + url[0]
			#返回url
        	yield scrapy.Request(page, callback=self.parse)
        #url跟進結束

可以看到url跟進和for同級也就是說 for迴圈完成後（即本頁面資料抓取完成後）進行下一頁的爬取，獲取到下一頁按鈕的地址然後作為一個Request進行了可迭代的資料返回這樣就可以進行分頁資料的爬取了

5、將爬取的資料進行儲存

當Item在Spider中被收集之後，它將會被傳遞到Item Pipeline，一些元件會按照一定的順序執行對Item的處理。
每個item pipeline元件(有時稱之為“Item Pipeline”)是實現了簡單方法的Python類。他們接收到Item並通過它執行一些行為，同時也決定此Item是否繼續通過pipeline，或是被丟棄而不再進行處理。
以下是item pipeline的一些典型應用：
（1）清理HTML資料
（2）驗證爬取的資料(檢查item包含某些欄位)
（3）查重(並丟棄)
（4）將爬取結果儲存到檔案或資料庫中

1、將資料儲存到檔案裡

首先那我們在專案目錄下 pipelines.py同級目錄建立我們的BaidutiebaPipeline.py檔案

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
#設定系統預設字符集
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
import codecs
import json
from logging import log



class JsonWithEncodingPipeline(object):
    '''儲存到檔案中對應的class
       1、在settings.py檔案中配置
       2、在自己實現的爬蟲類中yield item,會自動執行'''    
    def __init__(self):
        self.file = codecs.open('info.json', 'w', encoding='utf-8')#儲存為json檔案
    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"#轉為json的
        self.file.write(line)#寫入檔案中
        return item
    def spider_closed(self, spider):#爬蟲結束時關閉檔案
        self.file.close()

那麼這我們的資料儲存到檔案裡的Item Pipeline就寫好了，那麼接下來我們想要用它就需要先註冊自己的Pipeline：

在同級目錄下有一個settings.py 開啟檔案找到ITEM_PIPELINES 註冊我們的Pipeline

格式：專案目錄.Pipeline檔名.Pipeline中的類名

後面int型的引數是標示執行的優先順序，範圍1～1000，越小越先執行

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
	'baidutieba.BaidutiebaPipeline.JsonWithEncodingPipeline': 300,
}

那麼我們再執行、

scrapy crawl MySpider

2、將資料儲存到資料庫中

同樣在settings.py中新增咱們的資料庫儲存Pipeline，並且在其中設定資料庫的配置如下：

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
	'baidutieba.BaidutiebaPipeline.JsonWithEncodingPipeline': 300,
	'baidutieba.BaidutiebaPipeline.WebcrawlerScrapyPipeline': 300,
}


# MySql 資料庫連結操作
MYSQL_HOST = '127.0.0.1'
MYSQL_DBNAME = 'test'         #資料庫名字，請修改
MYSQL_USER = 'homestead'             #資料庫賬號，請修改 
MYSQL_PASSWD = 'secret'         #資料庫密碼，請修改
MYSQL_PORT = 3306               #資料庫埠，在dbhelper中使用

修改BaidutiebaPipeline.py檔案

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
#設定系統預設字符集
import sys
reload(sys)
sys.setdefaultencoding('utf-8')

from twisted.enterprise import adbapi
import MySQLdb
import MySQLdb.cursors
import codecs
import json
from logging import log



class JsonWithEncodingPipeline(object):
    '''儲存到檔案中對應的class
       1、在settings.py檔案中配置
       2、在自己實現的爬蟲類中yield item,會自動執行'''    
    def __init__(self):
        self.file = codecs.open('info.json', 'w', encoding='utf-8')#儲存為json檔案
    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"#轉為json的
        self.file.write(line)#寫入檔案中
        return item
    def spider_closed(self, spider):#爬蟲結束時關閉檔案
        self.file.close()

class WebcrawlerScrapyPipeline(object):
    '''儲存到資料庫中對應的class
       1、在settings.py檔案中配置
       2、在自己實現的爬蟲類中yield item,會自動執行'''    

    def __init__(self,dbpool):
        self.dbpool=dbpool
        ''' 這裡註釋中採用寫死在程式碼中的方式連線執行緒池，可以從settings配置檔案中讀取，更加靈活
            self.dbpool=adbapi.ConnectionPool('MySQLdb',
                                          host='127.0.0.1',
                                          db='crawlpicturesdb',
                                          user='root',
                                          passwd='123456',
                                          cursorclass=MySQLdb.cursors.DictCursor,
                                          charset='utf8',
                                          use_unicode=False)'''        
        
    @classmethod
    def from_settings(cls,settings):
        '''1、@classmethod宣告一個類方法，而對於平常我們見到的則叫做例項方法。 
           2、類方法的第一個引數cls（class的縮寫，指這個類本身），而例項方法的第一個引數是self，表示該類的一個例項
           3、可以通過類來呼叫，就像C.f()，相當於java中的靜態方法'''
        dbparams=dict(
            host=settings['MYSQL_HOST'],#讀取settings中的配置
            db=settings['MYSQL_DBNAME'],
            user=settings['MYSQL_USER'],
            passwd=settings['MYSQL_PASSWD'],
            charset='utf8',#編碼要加上，否則可能出現中文亂碼問題
            cursorclass=MySQLdb.cursors.DictCursor,
            use_unicode=False,
        )
        dbpool=adbapi.ConnectionPool('MySQLdb',**dbparams)#**表示將字典擴充套件為關鍵字引數,相當於host=xxx,db=yyy....
        return cls(dbpool)#相當於dbpool付給了這個類，self中可以得到

    #pipeline預設呼叫
    def process_item(self, item, spider):
        query=self.dbpool.runInteraction(self._conditional_insert,item)#呼叫插入的方法
        query.addErrback(self._handle_error,item,spider)#呼叫異常處理方法
        return item
    
    #寫入資料庫中
    def _conditional_insert(self,tx,item):
        #print item['name']
        sql="insert into test(name,url) values(%s,%s)"
        print 3333333333333333333333
        print item["title"]
        params=(item["title"].encode('utf-8'),item["url"])
        tx.execute(sql,params)
    
    #錯誤處理方法
    def _handle_error(self, failue, item, spider):
        print '--------------database operation exception!!-----------------'
        print '-------------------------------------------------------------'
        print failue

這個是我給大家的一個入門的示例，如果大家有問題可以給我留言或者私信。另外由於百度貼吧的升級，可能程式抓取規則會要做相應的調整，但是主體不會變哦，大家需要自己調整下程式哦

程式完成時間：2017.7.18

程式程式碼

scrapy 詳細例項-爬取百度貼吧資料並儲存到檔案和和資料庫中

1、建立專案

2、建立爬蟲檔案

3、定義Item

4、完善我的爬蟲主程式1：

5、將爬取的資料進行儲存

1、將資料儲存到檔案裡

2、將資料儲存到資料庫中

scrapy 詳細例項-爬取百度貼吧資料並儲存到檔案和和資料庫中

Python爬蟲例項--爬取百度貼吧小說

正則的應用--爬取百度貼吧NBA的精品貼詳細的回覆資訊

requests+xpath+map爬取百度貼吧

Python爬取百度貼吧數據

Python簡易爬蟲爬取百度貼吧圖片

Python爬蟲實例（一）爬取百度貼吧帖子中的圖片

ulrlib案例-爬取百度貼吧

完整的爬蟲程序爬取百度貼吧的圖片

python爬取百度貼吧指定內容

XPath：爬取百度貼吧圖片，並儲存本地

爬取百度貼吧圖片

使用者輸入關鍵字，爬取百度貼吧

PHP爬蟲-爬取百度貼吧首頁違規主題貼

爬取百度貼吧中的圖片以及視訊

Python爬蟲-爬取百度貼吧

Python爬蟲教程：爬取百度貼吧

Python爬取百度貼吧標題

教你分分鐘爬取百度貼吧，新手可操作（附原始碼及解析）

Python爬取百度貼吧圖片指令碼

scrapy 詳細例項-爬取百度貼吧資料並儲存到檔案和和資料庫中

1、建立專案

2、建立爬蟲檔案

3、定義Item

4、完善我的爬蟲主程式1：

5、將爬取的資料進行儲存

1、將資料儲存到檔案裡

2、將資料儲存到資料庫中

相關推薦