1. 程式人生 > >Scrapy 的學習筆記(一)

Scrapy 的學習筆記(一)

Scrapy 的學習筆記(一)

使用pip 按裝Scrapy

命令: pip install Scrapy

建立一個Scrapy 工程

命令:scrapy startproject tutorial (其中tutorial 是工程名字)

Scrapy 的工程目錄結構

tutorial/

scrapy.cfg            # deploy configuration file
						工程配置檔案
tutorial/             # project's Python module, you'll import your code from here
						工程的python 模組    可以在這裡匯入自己的程式碼
    __init__.py
    items.py          # project items definition file
    						工程專案定義的檔案
    middlewares.py    # project middlewares file
    						工程中介軟體檔案
    pipelines.py      # project pipelines file
							專案管限檔案
    settings.py       # project settings file
							專案設定檔案
    spiders/          # a directory where you'll later put your spiders
    						一個放爬蟲的資料夾
        __init__.py

Our first Spider

需要將我們的第一個 Spider 放在 工程目錄下的spiders資料夾裡面

import scrapy
class QuotesSpider(scrapy.Spider):
name = “quotes”
def start_requests(self):
urls = [
http://quotes.toscrape.com/page/1/’,
http://quotes.toscrape.com/page/2/’,
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)

def parse(self, response):
    page = response.url.split("/")[-2]
    filename = 'quotes-%s.html' % page
    with open(filename, 'wb') as f:
        f.write(response.body)
    self.log('Saved file %s' % filename)

As you can see, our Spider subclasses scrapy.Spider and defines some attributes and methods:
你會明白我們的爬蟲子類定義了一些屬性和方法


name: identifies the Spider. It must be unique within a project, that is, you can’t set the same name for different Spiders.
name是區分爬蟲的屬性。在一個project裡面他必須是唯一的,你不會看見相同的name屬性在不同的爬蟲裡面


start_requests(): must return an iterable of Requests (you can return a list of requests or write a generator function) which the Spider will begin to crawl from. Subsequent requests will be generated successively from these initial requests.
start_request()方法 必須返回一個iterable (迭代器) (你可以返回一個請求列表 或者一個寫生成器函式) Spider 從start_request()函式開始爬取,後續請求會從這個初始請求依次生成


parse(): a method that will be called to handle the response downloaded for each of the requests made. The response parameter is an instance of TextResponse that holds the page content and has further helpful methods to handle it.
parse() 會被呼叫去處理請求的響應 響應引數是一個包含頁面內容的TextResponse例項,並且有進一步有用的方法去處理它


The parse() method usually parses the response, extracting the scraped data as dicts and also finding new URLs to follow and creating new requests (Request) from them.