1. 程式人生 > >scrapy 的一個例子

scrapy 的一個例子

extra 邏輯 進入 spi lines rec 使用步驟 middle over

1、目標:

  scrapy 是一個爬蟲構架,現用一個簡單的例子來講解,scrapy 的使用步驟

2、創建一個scrapy的項目:

  創建一個叫firstSpider的項目,命令如下:

scrapy startproject firstSpider 
[[email protected] ~]$ scrapy startproject firstSpider 
New Scrapy project firstSpider, using template directory /usr/local/python-3.6.2/lib/python3.6/site-packages/scrapy/templates/project
, created in: /home/jianglexing/firstSpider You can start your first spider with: cd firstSpider scrapy genspider example example.com

  

3、創建一個項目時scrapy 命令幹了一些什麽:

  創建一個項目時scrapy 會創建一個目錄,並向目錄中增加若幹文件

[[email protected] ~]$ tree firstSpider/
firstSpider/
├── firstSpider
│   ├── __init__.py
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── __pycache__
│   ├── settings.py
│   └── spiders
│       ├── __init__.py
│       └── __pycache__
└── scrapy.cfg

4 directories, 7 files

4、進入項目所在的目錄並創建爬蟲:

[[email protected] ~]$ cd firstSpider/
[[email protected] firstSpider]$ scrapy genspider financeSpider www.financedatas.com
Created spider financeSpider using template basic in module:
  firstSpider.spiders.financeSpider

5、一只爬蟲在scrapy 項目中對應一個文件:

[[email protected] firstSpider]$ tree ./
./
├── firstSpider
│   ├── __init__.py
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── __pycache__
│   │   ├── __init__.cpython-36.pyc
│   │   └── settings.cpython-36.pyc
│   ├── settings.py
│   └── spiders
│       ├── financeSpider.py    # 這個就是剛才創建的爬蟲文件
│       ├── __init__.py
│       └── __pycache__
│           └── __init__.cpython-36.pyc
└── scrapy.cfg

6、編寫爬蟲的處理邏輯:

  以爬取 http://www.financedatas.com 網站首頁的title為例

# -*- coding: utf-8 -*-
import scrapy


class FinancespiderSpider(scrapy.Spider):
    name = financeSpider
    allowed_domains = [www.financedatas.com]
    start_urls = [http://www.financedatas.com/]

    def parse(self, response):
        """在parse方法中編寫處理邏輯"""
        print(**64)
        title=response.xpath(//title/text()).extract() #xpath 語法抽取數據
        print(title)
        print(**64)

7、運行爬蟲,查看效果:

[[email protected] spiders]$ scrapy crawl financeSpider
2017-08-10 16:11:38 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: firstSpider)
2017-08-10 16:11:38 [scrapy.utils.log] INFO: Overridden settings: {BOT_NAME: firstSpider, NEWSPIDER_MODULE: firstSpider.spiders, ROBOTSTXT_OBEY: True, SPIDER_MODULES: [firstSpider.spiders]}
.... ....
2017-08-10 16:11:39 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://www.financedatas.com/robots.txt> (referer: None)
2017-08-10 16:11:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.financedatas.com/> (referer: None)
****************************************************************
[歡迎來到 www.financedatas.com]   # 這裏就抽取到的數據
****************************************************************2017-08-10 16:11:39 [scrapy.core.engine] INFO: Spider closed (finished)

----

scrapy 的一個例子