Scrapy應用之抓取《宦海沈浮》小說
阿新 • • 發佈:2019-02-17
創建 xpath like 寫入 spi for creat pat str 目標站點
http://www.shushu8.com/huanhaichenfu/
第一步:新建項目
KeysdeMacBook:Desktop keys$ scrapy startproject MyCrawl New Scrapy project ‘MyCrawl‘, using template directory ‘/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapy/templates/project‘, created in: /Users/keys/Desktop/MyCrawl You can start your first spider with: cd MyCrawl scrapy genspider example example.com
第二步:創建爬蟲
KeysdeMacBook:Desktop keys$ cd MyCrawl/ KeysdeMacBook:MyCrawl keys$ scrapy genspider FirstSpider www.shushu8.com/huanhaichenfu
第三步:配置item.py
import scrapy class MycrawlItem(scrapy.Item): url = scrapy.Field() title = scrapy.Field() text = scrapy.Field()第四步:編寫爬蟲
# -*- coding: utf-8 -*- import scrapy from MyCrawl.items import MycrawlItem class FirstspiderSpider(scrapy.Spider): name = ‘FirstSpider‘ allowed_domains = [‘www.shushu8.com/huanhaichenfu‘] start_urls = [‘http://www.shushu8.com/huanhaichenfu/‘+str(i+1) for i in range(502)] def parse(self, response): url = response.url title = response.xpath(‘//*[@id="main"]/div[2]/div/div[1]/h1/text()‘).extract_first(‘‘) text = response.css(‘#content::text‘).extract() myitem = MycrawlItem() myitem[‘url‘] = url myitem[‘title‘] = title myitem[‘text‘] = ‘,‘.join(text) yield myitem
第五步:配置pipeline.py
# -*- coding: utf-8 -*- import pymysql class MysqlPipeline(object): # 采用同步的機制寫入mysql def __init__(self): self.conn = pymysql.connect( ‘127.0.0.1‘, ‘root‘, ‘rootkeys‘, ‘Article‘, charset="utf8", use_unicode=True) self.cursor = self.conn.cursor() def process_item(self, item, spider): insert_sql = """ insert into huanhaichenfu(url, title, text) VALUES (%s, %s, %s) """ # 使用VALUES實現傳值 self.cursor.execute( insert_sql, (item["url"], item["title"], item["text"])) self.conn.commit()
第六步:配置setting.py
# -*- coding: utf-8 -*- BOT_NAME = ‘MyCrawl‘ SPIDER_MODULES = [‘MyCrawl.spiders‘] NEWSPIDER_MODULE = ‘MyCrawl.spiders‘ USER_AGENT = ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36‘ ROBOTSTXT_OBEY = False ITEM_PIPELINES = { ‘MyCrawl.pipelines.MysqlPipeline‘: 1, }
第七步:運行爬蟲
import os import sys from scrapy.cmdline import execute sys.path.append(os.path.dirname(os.path.abspath(__file__))) run_spider = ‘FirstSpider‘ if __name__ == ‘__main__‘: print(‘Running Spider of ‘ + run_spider) execute([‘scrapy‘, ‘crawl‘, run_spider])
Scrapy應用之抓取《宦海沈浮》小說