1. 程式人生 > >windows簡易安裝scrapy

windows簡易安裝scrapy

windows簡易安裝scrapy


Scrapy,Python開發的一個快速、高層次的螢幕抓取和web抓取框架,用於抓取web站點並從頁面中提取結構化的資料。Scrapy用途廣泛,可以用於資料探勘、監測和自動化測試。
Scrapy吸引人的地方在於它是一個框架,任何人都可以根據需求方便的修改。它也提供了多種型別爬蟲的基類,如BaseSpider、sitemap爬蟲等,最新版本又提供了web2.0爬蟲的支援

寫在前面:在本文中將使用的python版本為3.7,讀者可自行選擇版本。系統為windows 64位。

第一步 確保環境要求

讀者請自行安裝python,並將python目錄,以及python目錄下Scripts加入系統環境變數中。如下圖所示:


注:在安裝過程中如果讀者勾選了將python加入環境變數,即跳過此步驟。
準備兩個所需檔案:**Twisted-18.7.0-cp37-cp37m-win_amd64.whl
lxml-4.2.3-cp37-cp37m-win_amd64.whl**
檔案下載地址–>https://pan.baidu.com/s/1TC2q_oC5h6Z4ymRpmpSxsA (包含了3.5 以及3.7版本)
讀者也可以自行下載–>非官方windows-python擴充套件包地址:pythonhttps://www.lfd.uci.edu/~gohlke/pythonlibs/
注:由於scrapy使用Twisted為框架,以及使用lxml解析html,在正常的安裝過程中無法正確的安裝這兩個元件,故進行單獨安裝。

第二步 安裝scrapy

進入Twisted-18.7.0-cp37-cp37m-win_amd64.whl 、lxml-4.2.3-cp37-cp37m-win_amd64.whl檔案存放目錄,使用pip命令進行安裝:

C:\Users\WU\Downloads\scrapyFile>pip install lxml-4.2.3-cp37-cp37m-win_amd64.whl

C:\Users\WU\Downloads\scrapyFile>pip install Twisted-18.7.0-cp37-cp37m-win_amd64.whl

注:本人將這兩個檔案存放C:\Users\WU\Downloads\scrapyFile檔案中


在lxml Twisted安裝成功後,執行如下命令,進行scrapy安裝:

pip install pywin32
pip install scrapy

注:由於python後續還將訪問windows系統的API庫,故需安裝pywin32

第三步 驗證scrapy是否安裝成功

在cmd中執行‘scrapy’,出現如下資訊:

C:\Users\WU\Downloads\scrapyFile>scrapy
Scrapy 1.5.1 - no active project

Usage:
  scrapy <command> [options] [args]

Available commands:
  bench         Run quick benchmark test
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

  [ more ]      More commands available when run from project directory

Use "scrapy <command> -h" to see more info about a command

第四步 建立scrapy專案

進入工作目錄,執行如下命令,即可檢視對應專案的生成:

D:\pythonplace\scrapy>scrapy startproject helloworld
New Scrapy project 'helloworld', using template directory 'd:\\software\\python3.7\\lib\\site-packages\\scrapy\\templates\\project', created in:
    D:\pythonplace\scrapy\helloworld

You can start your first spider with:
    cd helloworld
    scrapy genspider example example.com

附錄:
1.當讀者用執行scrapy crawl xxx命令啟動爬蟲時,出現如下錯誤:

Traceback (most recent call last):
  File "/usr/local/bin/scrapy", line 11, in <module>
    sys.exit(execute())
  File "/usr/local/lib/python3.7/site-packages/scrapy/cmdline.py", line 150, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "/usr/local/lib/python3.7/site-packages/scrapy/cmdline.py", line 90, in _run_print_help
    func(*a, **kw)
  File "/usr/local/lib/python3.7/site-packages/scrapy/cmdline.py", line 157, in _run_command
    cmd.run(args, opts)
  File "/usr/local/lib/python3.7/site-packages/scrapy/commands/crawl.py", line 57, in run
    self.crawler_process.crawl(spname, **opts.spargs)
  File "/usr/local/lib/python3.7/site-packages/scrapy/crawler.py", line 170, in crawl
    crawler = self.create_crawler(crawler_or_spidercls)
  File "/usr/local/lib/python3.7/site-packages/scrapy/crawler.py", line 198, in create_crawler
    return self._create_crawler(crawler_or_spidercls)
  File "/usr/local/lib/python3.7/site-packages/scrapy/crawler.py", line 203, in _create_crawler
    return Crawler(spidercls, self.settings)
  File "/usr/local/lib/python3.7/site-packages/scrapy/crawler.py", line 55, in __init__
    self.extensions = ExtensionManager.from_crawler(self)
  File "/usr/local/lib/python3.7/site-packages/scrapy/middleware.py", line 58, in from_crawler
    return cls.from_settings(crawler.settings, crawler)
  File "/usr/local/lib/python3.7/site-packages/scrapy/middleware.py", line 34, in from_settings
    mwcls = load_object(clspath)
  File "/usr/local/lib/python3.7/site-packages/scrapy/utils/misc.py", line 44, in load_object
    mod = import_module(module)
  File "/usr/local/lib/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 728, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/usr/local/lib/python3.7/site-packages/scrapy/extensions/telnet.py", line 12, in <module>
    from twisted.conch import manhole, telnet
  File "/usr/local/lib/python3.7/site-packages/twisted/conch/manhole.py", line 154
    def write(self, data, async=False):
                              ^
SyntaxError: invalid syntax

請找到python目錄下Lib/site-packages/twisted/conch/manhole.py檔案的154、155、240、241、247行的async重新命名
如下:

154    def write(self, data, async1=False):
155        self.handler.addOutput(data, async1)

       ........

240    def addOutput(self, data, async1=False):
241        if async1:
       ........

247        if async1: