Scrapy爬蟲(七)：爬蟲資料儲存例項

阿新 • • 發佈：2020-10-27

Scrapy爬蟲(七)：爬蟲資料儲存例項

Scrapy爬蟲七爬蟲資料儲存例項

本章將實現資料儲存到資料庫的例項。

資料儲存

scrapy支援將資料儲存到檔案,例如csv、jl、jsonlines、pickle、marshal、json、xml，少量的資料儲存到資料庫還行，如果超大量的資料儲存到檔案（當然圖片還是要存檔案的），就顯得不太友好，畢竟這些資料要為我所用。

因此我們通常將資料儲存到資料庫，本處將介紹的是最常用的資料庫mysql。我們也看到scrapy中的pipeline檔案還沒有用到，其實這個檔案就是處理spider分發下來的item，我們可以在pipeline中處理檔案的儲存。

mysql庫(PyMysql)的新增

開啟pycharm File–>Default Settings–>Project interpreter點選左下角的“+”，搜尋PyMysql，如圖：

點選安裝install package，如果無法安裝可以選擇將上面的install to user‘s site…勾選安裝到Users目錄下。

配置mysql服務

1、安裝mysql

root@ubuntu:~# sudo apt-get install mysql-server

root@ubuntu:~# apt isntall mysql-client

root@ubuntu:~# apt install libmysqlclient-dev

期間會彈出設定root賬戶的密碼框，輸入兩次相同密碼。

2、查詢是否安裝成功

root@ubuntu:~# sudo netstat -tap | grep mysql

root@ubuntu:~# netstat -tap | grep mysql

tcp6        0       0       [::]:mysql    [::]:*    LISTEN    7510/mysqld

3、開啟遠端訪問mysql

編輯mysql配置檔案，註釋掉“bind-address = 127.0.0.1”

root@ubuntu:~# vi /etc/mysql/mysql.conf.d/mysqld.cnf

#bind-address = 127.0.0.1

進入mysql root賬戶

root@ubuntu:~# mysql -u root -p123456

在mysql環境中輸入grant all on.to username@’%’ identified by ‘password’;
或者grant all on.to username@’%’ identified by ‘password’ with grand option;

root@ubuntu:~# grant all on *.* to china@'%' identified by '123456';

重新整理flush privileges;然後重啟mysql，通過/etc/init.d/mysql restart命令

root@ubuntu:~# flush privileges;

root@ubuntu:~# /etc/init.d/mysql restart

遠端連線時客戶端設定：

4、常見問題

1045 access denied for user ‘root’@’localhost(ip)’ using password yes

1、mysql -u root -p;
2、GRANT ALL PRIVILEGES ON *.* TO 'myuser'@'%' IDENTIFIED BY 'mypassword' WITH GRANT OPTION;
3、FLUSH PRIVILEGES;

在mysql中建立好四個item表

建立專案

安裝好PyMysql後就可以在pipeline中處理儲存的邏輯了。首先建立專案：scrapy startproject mysql本例還是使用上一章多個爬蟲組合例項的例子，處理將其中四個item儲存到mysql資料庫。
然後開啟建立好的mysql專案，在settings.py中新增資料庫連線相關的常量。

# -*- coding: utf-8 -*-
BOT_NAME = 'mysql'

SPIDER_MODULES = ['mysql.spiders']
NEWSPIDER_MODULE = 'mysql.spiders'

MYSQL_HOST = 'localhost'
MYSQL_DBNAME = 'spider'
MYSQL_USER = 'root'
MYSQL_PASSWD = '123456'

DOWNLOAD_DELAY = 1

ITEM_PIPELINES = {
    'mysql.pipelines.DoubanPipeline': 301,
}

其中ITEM_PIPELINES即是將pipeline加入到配置中生效。

ITEM_PIPELINES = {
    'mysql.pipelines.DoubanPipeline': 301,
}

pipelines.py配置

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import pymysql
from scrapy import log

from mysql import settings
from mysql.items import MusicItem, MusicReviewItem, VideoItem, VideoReviewItem


class DoubanPipeline(object):
    def __init__(self):
        self.connect = pymysql.connect(
            host=settings.MYSQL_HOST,
            db=settings.MYSQL_DBNAME,
            user=settings.MYSQL_USER,
            passwd=settings.MYSQL_PASSWD,
            charset='utf8',
            use_unicode=True)
        self.cursor = self.connect.cursor()

    def process_item(self, item, spider):
        if item.__class__ == MusicItem:
            try:
                self.cursor.execute("""select * from music_douban where music_url = %s""", item["music_url"])
                ret = self.cursor.fetchone()
                if ret:
                    self.cursor.execute(
                        """update music_douban set music_name = %s,music_alias = %s,music_singer = %s,
                            music_time = %s,music_rating = %s,music_votes = %s,music_tags = %s,music_url = %s
                            where music_url = %s""",
                        (item['music_name'],
                         item['music_alias'],
                         item['music_singer'],
                         item['music_time'],
                         item['music_rating'],
                         item['music_votes'],
                         item['music_tags'],
                         item['music_url'],
                         item['music_url']))
                else:
                    self.cursor.execute(
                        """insert into music_douban(music_name,music_alias,music_singer,music_time,music_rating,
                          music_votes,music_tags,music_url)
                          value (%s,%s,%s,%s,%s,%s,%s,%s)""",
                        (item['music_name'],
                         item['music_alias'],
                         item['music_singer'],
                         item['music_time'],
                         item['music_rating'],
                         item['music_votes'],
                         item['music_tags'],
                         item['music_url']))
                self.connect.commit()
            except Exception as error:
                log(error)
            return item

        elif item.__class__ == MusicReviewItem:
            try:
                self.cursor.execute("""select * from music_review_douban where review_url = %s""", item["review_url"])
                ret = self.cursor.fetchone()
                if ret:
                    self.cursor.execute(
                        """update music_review_douban set review_title = %s,review_content = %s,review_author = %s,
                            review_music = %s,review_time = %s,review_url = %s
                            where review_url = %s""",
                        (item['review_title'],
                         item['review_content'],
                         item['review_author'],
                         item['review_music'],
                         item['review_time'],
                         item['review_url'],
                         item['review_url']))
                else:
                    self.cursor.execute(
                        """insert into music_review_douban(review_title,review_content,review_author,review_music,review_time,
                          review_url)
                          value (%s,%s,%s,%s,%s,%s)""",
                        (item['review_title'],
                         item['review_content'],
                         item['review_author'],
                         item['review_music'],
                         item['review_time'],
                         item['review_url']))
                self.connect.commit()
            except Exception as error:
                log(error)
            return item

        elif item.__class__ == VideoItem:
            try:
                self.cursor.execute("""select * from video_douban where video_url = %s""", item["video_url"])
                ret = self.cursor.fetchone()
                if ret:
                    self.cursor.execute(
                        """update video_douban set video_name= %s,video_alias= %s,video_actor= %s,video_year= %s,
                          video_time= %s,video_rating= %s,video_votes= %s,video_tags= %s,video_url= %s,
                          video_director= %s,video_type= %s,video_bigtype= %s,video_area= %s,video_language= %s,
                          video_length= %s,video_writer= %s,video_desc= %s,video_episodes= %s where video_url = %s""",
                        (item['video_name'],
                         item['video_alias'],
                         item['video_actor'],
                         item['video_year'],
                         item['video_time'],
                         item['video_rating'],
                         item['video_votes'],
                         item['video_tags'],
                         item['video_url'],
                         item['video_director'],
                         item['video_type'],
                         item['video_bigtype'],
                         item['video_area'],
                         item['video_language'],
                         item['video_length'],
                         item['video_writer'],
                         item['video_desc'],
                         item['video_episodes'],
                         item['video_url']))
                else:
                    self.cursor.execute(
                        """insert into video_douban(video_name,video_alias,video_actor,video_year,video_time,
                          video_rating,video_votes,video_tags,video_url,video_director,video_type,video_bigtype,
                          video_area,video_language,video_length,video_writer,video_desc,video_episodes)
                          value (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)""",
                        (item['video_name'],
                         item['video_alias'],
                         item['video_actor'],
                         item['video_year'],
                         item['video_time'],
                         item['video_rating'],
                         item['video_votes'],
                         item['video_tags'],
                         item['video_url'],
                         item['video_director'],
                         item['video_type'],
                         item['video_bigtype'],
                         item['video_area'],
                         item['video_language'],
                         item['video_length'],
                         item['video_writer'],
                         item['video_desc'],
                         item['video_episodes']))
                self.connect.commit()
            except Exception as error:
                log(error)
            return item

        elif item.__class__ == VideoReviewItem:
            try:
                self.cursor.execute("""select * from video_review_douban where review_url = %s""", item["review_url"])
                ret = self.cursor.fetchone()
                if ret:
                    self.cursor.execute(
                        """update video_review_douban set review_title = %s,review_content = %s,review_author = %s,
                            review_video = %s,review_time = %s,review_url = %s
                            where review_url = %s""",
                        (item['review_title'],
                         item['review_content'],
                         item['review_author'],
                         item['review_video'],
                         item['review_time'],
                         item['review_url'],
                         item['review_url']))
                else:
                    self.cursor.execute(
                        """insert into video_review_douban(review_title,review_content,review_author,review_video,review_time,
                          review_url)
                          value (%s,%s,%s,%s,%s,%s)""",
                        (item['review_title'],
                         item['review_content'],
                         item['review_author'],
                         item['review_video'],
                         item['review_time'],
                         item['review_url']))
                self.connect.commit()
            except Exception as error:
                log(error)
            return item
        else:
            pass

在上面的pipeline中我已經做了資料庫去重的操作。

執行爬蟲

pycharm執行run.py，mysql資料庫表中已經存好了我們要的資料。

github地址

Scrapy爬蟲(七)：爬蟲資料儲存例項

Scrapy爬蟲(七)：爬蟲資料儲存例項

資料儲存

配置mysql服務

在mysql中建立好四個item表

建立專案

執行爬蟲

Scrapy爬蟲(七)：爬蟲資料儲存例項

爬蟲系列：使用 MySQL 儲存資料

爬蟲實戰：爬蟲加資料分析，重慶電氣小哥一文帶你分析重慶所有旅遊景點

利用scrapy將爬到的資料儲存到mysql（防止重複）

【Go 語言社群】 HTML5 前端--資料儲存例項

python爬蟲資料儲存到mongoDB的例項方法

Scrapy 入門：爬蟲類詳解（Parse()函式、選擇器、提取資料）

Scrapy爬蟲(五)：有限爬取深度例項

Scrapy爬蟲(四)：imdb.cn爬蟲例項

Scrapy爬蟲(六)：多個爬蟲組合例項

python爬蟲學習：從資料庫讀取目標爬蟲站點及爬蟲規程，批量爬取目標站點制定資料（scrapy框架）

爬蟲與Python：（四）爬蟲進階二之資料儲存（資料庫儲存）——7.Redis儲存

爬蟲與Python：（四）爬蟲進階二之資料儲存（資料庫儲存）——8.PostgreSQL儲存

python網路爬蟲-資料儲存（七）

Python爬蟲實戰：爬取美團美食資料

爬蟲-資料儲存（8）

常規檔案讀寫和爬蟲資料儲存

python網路爬蟲案例：批量爬取百度貼吧頁面資料

Python爬蟲框架：scrapy爬取迅雷電影天堂最新電影！

Python爬蟲實戰：自動化登入網站，爬取商品資料

Scrapy爬蟲(七)：爬蟲資料儲存例項

Scrapy爬蟲(七)：爬蟲資料儲存例項

資料儲存

配置mysql服務

在mysql中建立好四個item表

建立專案

執行爬蟲

相關推薦