Scrapy多個spider指定piplines
阿新 • • 發佈:2019-02-09
這段時間我在一個爬蟲專案寫了兩個蜘蛛(http://blog.csdn.net/mr_blued?t=1),都需要通過piplines將資料儲存到Mysql資料庫,所以在piplines寫了兩個類。
一個MoviePipeline(),一個BookPipline()
import pymysql ''' class MoviePipeline(object): def __init__(self): # 連線資料庫 self.conn = pymysql.connect(host='127.0.0.1', port=3306, user='root', passwd='1likePython', db='TESTDB', charset='utf8') # 建立遊標物件 self.cursor = self.conn.cursor() self.cursor.execute('truncate table Movie') self.conn.commit() def process_item(self, item, spider): try: self.cursor.execute("insert into Movie (name,movieInfo,star,number,quote) \ VALUES (%s,%s,%s,%s,%s)", (item['movie_name'],item['movie_message'],item['movie_star'], item['number'], item['movie_quote'])) self.conn.commit() except pymysql.Error: print("Error%s,%s,%s,%s,%s" % (item['movie_name'],item['movie_message'],item['movie_star'], item['number'], item['movie_quote'])) return item class BookPipeline(object): def __init__(self): # 連線資料庫 self.conn = pymysql.connect(host='127.0.0.1', port=3306, user='root', passwd='1likePython', db='TESTDB', charset='utf8') # 建立遊標物件 self.cursor = self.conn.cursor() self.cursor.execute('truncate table Book') self.conn.commit() def process_item(self, item, spider): try: self.cursor.execute("insert into Book (book_name,author,book_type,book_state,book_update,book_time,new_href,book_intro) \ VALUES (%s,%s,%s,%s,%s,%s,%s,%s)", (item['book_name'], item['author'], item['book_type'], item['book_state'], item['book_update'], item['book_time'], item['new_href'], item['book_intro'])) self.conn.commit() except pymysql.Error: print("Error%s,%s,%s,%s,%s,%s,%s,%s" % (item['book_name'], item['author'], item['book_type'], item['book_state'], item['book_update'], item['book_time'], item['new_href'], item['book_intro'])) return item '''
接著我在settings.py中對這兩個類進行了設定
# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'Mycrawl.pipelines.MoviePipeline': 100,
'Mycrawl.pipelines.BookPipeline': 300,
}
接著執行爬蟲,我發現執行 book 爬蟲時使用的 piplines 類是 MoviePipline(),顯然這樣子會報錯,如果不想報錯,就要在setting.py中將這一行給註釋掉
'Mycrawl.pipelines.MoviePipeline': 100,
然而等我想用movie爬蟲的時候又需要將該註釋給去掉,將另一行給註釋起來,這樣子就會變得很麻煩。所以我在piplines.py中對程式碼進行了修改,讓其對現在進行的爬蟲名進行判斷,修改如下:
class MycrawlPipeline(object): def __init__(self): # 連線資料庫 self.conn = pymysql.connect(host='127.0.0.1', port=3306, user='root', passwd='1likePython', db='TESTDB', charset='utf8') # 建立遊標物件 self.cursor = self.conn.cursor() self.cursor.execute('truncate table Movie') self.cursor.execute('truncate table Book') self.conn.commit() def process_item(self, item, spider): # 如果爬蟲名是movie if spider.name == 'movie': try: self.cursor.execute("insert into Movie (name,movieInfo,star,number,quote) \ VALUES (%s,%s,%s,%s,%s)", (item['movie_name'],item['movie_message'],item['movie_star'], item['number'], item['movie_quote'])) self.conn.commit() except pymysql.Error: print("Error%s,%s,%s,%s,%s" % (item['movie_name'],item['movie_message'],item['movie_star'], item['number'], item['movie_quote'])) return item # 如果爬蟲名是book elif spider.name == 'book': try: self.cursor.execute("insert into Book (book_name,author,book_type,book_state,book_update,book_time,new_href,book_intro) \ VALUES (%s,%s,%s,%s,%s,%s,%s,%s)", (item['book_name'], item['author'], item['book_type'], item['book_state'], item['book_update'], item['book_time'], item['new_href'], item['book_intro'])) self.conn.commit() except pymysql.Error: print("Error%s,%s,%s,%s,%s,%s,%s,%s" % (item['book_name'], item['author'], item['book_type'], item['book_state'], item['book_update'], item['book_time'], item['new_href'], item['book_intro'])) return item
這樣,只需寫一個piplines類即可讓蜘蛛與其pipline一一對應起來了。