1. 程式人生 > >Scrapy多個spider指定piplines

Scrapy多個spider指定piplines

這段時間我在一個爬蟲專案寫了兩個蜘蛛(http://blog.csdn.net/mr_blued?t=1),都需要通過piplines將資料儲存到Mysql資料庫,所以在piplines寫了兩個類。

一個MoviePipeline(),一個BookPipline()

import pymysql


'''
class MoviePipeline(object):
    def __init__(self):
        # 連線資料庫
        self.conn = pymysql.connect(host='127.0.0.1', port=3306, user='root', passwd='1likePython',
                                    db='TESTDB', charset='utf8')
        # 建立遊標物件
        self.cursor = self.conn.cursor()
        self.cursor.execute('truncate table Movie')
        self.conn.commit()

    def process_item(self, item, spider):
        try:
            self.cursor.execute("insert into Movie (name,movieInfo,star,number,quote) \
            VALUES (%s,%s,%s,%s,%s)", (item['movie_name'],item['movie_message'],item['movie_star'],
                                       item['number'], item['movie_quote']))
            self.conn.commit()
        except pymysql.Error:
            print("Error%s,%s,%s,%s,%s" % (item['movie_name'],item['movie_message'],item['movie_star'],
                                       item['number'], item['movie_quote']))
        return item

class BookPipeline(object):
    def __init__(self):
        # 連線資料庫
        self.conn = pymysql.connect(host='127.0.0.1', port=3306, user='root', passwd='1likePython',
                                    db='TESTDB', charset='utf8')
        # 建立遊標物件
        self.cursor = self.conn.cursor()
        self.cursor.execute('truncate table Book')
        self.conn.commit()

    def process_item(self, item, spider):
        try:
            self.cursor.execute("insert into Book (book_name,author,book_type,book_state,book_update,book_time,new_href,book_intro) \
            VALUES (%s,%s,%s,%s,%s,%s,%s,%s)", (item['book_name'], item['author'], item['book_type'],
                                                   item['book_state'], item['book_update'], item['book_time'],
                                                   item['new_href'], item['book_intro']))
            self.conn.commit()
        except pymysql.Error:
            print("Error%s,%s,%s,%s,%s,%s,%s,%s" % (item['book_name'], item['author'], item['book_type'],
                                                   item['book_state'], item['book_update'], item['book_time'],
                                                   item['new_href'], item['book_intro']))
        return item

'''

接著我在settings.py中對這兩個類進行了設定

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'Mycrawl.pipelines.MoviePipeline': 100,
    'Mycrawl.pipelines.BookPipeline': 300,
}

接著執行爬蟲,我發現執行 book 爬蟲時使用的 piplines 類是 MoviePipline(),顯然這樣子會報錯,如果不想報錯,就要在setting.py中將這一行給註釋掉

 'Mycrawl.pipelines.MoviePipeline': 100,
然而等我想用movie爬蟲的時候又需要將該註釋給去掉,將另一行給註釋起來,這樣子就會變得很麻煩。

所以我在piplines.py中對程式碼進行了修改,讓其對現在進行的爬蟲名進行判斷,修改如下:

class MycrawlPipeline(object):
    def __init__(self):
        # 連線資料庫
        self.conn = pymysql.connect(host='127.0.0.1', port=3306, user='root', passwd='1likePython',
                                    db='TESTDB', charset='utf8')
        # 建立遊標物件
        self.cursor = self.conn.cursor()
        self.cursor.execute('truncate table Movie')
        self.cursor.execute('truncate table Book')
        self.conn.commit()

    def process_item(self, item, spider):
        # 如果爬蟲名是movie
        if spider.name == 'movie':
            try:
                self.cursor.execute("insert into Movie (name,movieInfo,star,number,quote) \
                VALUES (%s,%s,%s,%s,%s)", (item['movie_name'],item['movie_message'],item['movie_star'],
                                           item['number'], item['movie_quote']))
                self.conn.commit()
            except pymysql.Error:
                print("Error%s,%s,%s,%s,%s" % (item['movie_name'],item['movie_message'],item['movie_star'],
                                           item['number'], item['movie_quote']))
            return item
        # 如果爬蟲名是book
        elif spider.name == 'book':
            try:
                self.cursor.execute("insert into Book (book_name,author,book_type,book_state,book_update,book_time,new_href,book_intro) \
                        VALUES (%s,%s,%s,%s,%s,%s,%s,%s)", (item['book_name'], item['author'], item['book_type'],
                                                            item['book_state'], item['book_update'], item['book_time'],
                                                            item['new_href'], item['book_intro']))
                self.conn.commit()
            except pymysql.Error:
                print("Error%s,%s,%s,%s,%s,%s,%s,%s" % (item['book_name'], item['author'], item['book_type'],
                                                        item['book_state'], item['book_update'], item['book_time'],
                                                        item['new_href'], item['book_intro']))
            return item

這樣,只需寫一個piplines類即可讓蜘蛛與其pipline一一對應起來了。