從零開始學scrapy(python3版本)一
阿新 • • 發佈:2019-01-29
環境:
window10;python 3.6.2;scrapy 1.4.0
系統已安裝Python2,python3 共存模式
python2,3版本共存以及使用問題的記錄
- 建立專案
由於scrapy官網的示例站 dmoz.org 403了,所以先拿美劇天堂的網站練手
我的專案工程路徑在D:\workspaces\python\scrapy
開啟cmd命令列工具
cd /d D:\workspaces\python\scrapy
python3 -m scrapy startproject tutorial
cd tutorial
python3 -m scrapy genspider meijutt meijutt.com
- 編寫爬蟲指令碼,此時工程路徑下已經自動建立了
D:\workspaces\python\scrapy\tutorial\tutorial\spiders\meijutt.py
import scrapy
from tutorial.items import MeijuttItem
class MeijuttSpider(scrapy.Spider):
name = 'meijutt'
allowed_domains = ['meijutt.com']
start_urls = ['http://www.meijutt.com/new100.html']
def parse(self, response):
items = []
for sel in response.xpath('//ul[@class="top-list fn-clear"]/li'):
item = MeijuttItem()
item['storyName'] = sel.xpath('./h5/a/text()').extract()
item['storyState'] = sel.xpath('./span[1]/font/text()').extract()
if item['storyState']:
pass
else:
item['storyState'] = sel.xpath('./span[1]/text()').extract()
item['tvStation'] = sel.xpath('./span[2]/text()').extract()
if item['tvStation']:
pass
else:
item['tvStation'] = [u'未知']
item['updateTime'] = sel.xpath('./div[2]/text()').extract()
if item['updateTime']:
pass
else:
item['updateTime'] = sel.xpath('./div[2]/font/text()').extract()
items.append(item)
return items
- 設定爬取陣列
D:\workspaces\python\scrapy\tutorial\tutorial\items.py
import scrapy
class TutorialItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
pass
class MeijuttItem(scrapy.Item):
# define the fields for your item here like:
storyName = scrapy.Field()
storyState = scrapy.Field()
tvStation = scrapy.Field()
updateTime = scrapy.Field()
- 對爬取資料進行處理
D:\workspaces\python\scrapy\tutorial\tutorial\pipelines.py
import time
import sys
import importlib
importlib.reload(sys)
class TutorialPipeline(object):
def process_item(self, item, spider):
return item
class MeijuttPipeline(object):
def process_item(self, item, spider):
today = time.strftime('%Y%m%d',time.localtime())
fileName = today + 'movie.txt'
with open(fileName,'a') as fp:
fp.write(item['storyName'][0] + '\t' + str(item['storyState'][0]) + '\t' + str(item['tvStation'][0]) + '\t' + str(item['updateTime'][0]) + '\n')
return item
- 執行爬蟲
D:\workspaces\python\scrapy\tutorial>python3 -m scrapy crawl meijutt
檢視爬取結果
參考文獻
scrapy實戰–爬取最新美劇–python2版本
Scrapy入門教程
問題
如有更多問題可評論,或者關注的我的微信公眾號,可以獲取本專案的全部程式碼,我將後續跟進scrapy爬蟲專案的系列教程。