1. 程式人生 > >scrapy 環境搭建與試執行

scrapy 環境搭建與試執行

背景:最近在做php伺服器的專案,但是內容對於一個產品起到的是最為重要的東西,那麼對於內容的獲取就變的非常重要,所有想通過爬蟲的方式來獲取內容,所以想到了使用scrapy來當作爬蟲工具,但是苦逼的是,環境對於開發者來說永遠都是那麼的不順。浪費了很多時間。這裡記錄一下,方便以後檢視。

開工:其實官網介紹,安裝scrapy非常簡單。

官方教程:https://doc.scrapy.org/en/latest/intro/install.html

1.確保你有python

2.使用pip install scrapy

就成功了,官網說的更多,但是我嘗試時,好多都是不需要的。都會自動下載安裝。

但是我在使用過程中遇到了很多問題,

問題階段

1.就是會提示etree.so中找不到各種東西。

然後經過查詢網上大多的答案都是mac自帶的python版本不匹配導致的。這裡卡了很久,使用virtualenv也並沒有解決我的問題。

查詢網上,各種設定各種環境變數,但是都沒有解決我的問題。

探索階段

我一狠心,將mac自帶的python都刪除乾淨了,之前是2.7.10。現在回想這個版本應該是ok的。

安裝了2.7.12還是有問題。

然後通過進入lxml目錄。

執行otool lxml/etree.so

發現連線的檔案也都是存在的,也不是網上所說的link錯動態庫的問題。

結果我又一狠心,將lxml刪掉了,重新安裝還是不行。

但是在刪除的過程中,我運行了 scrapy 發現是下面的錯誤,關於etree.so找不到方法的解決辦法比較少,而且沒有什麼用,但是這個錯誤網上很多啊

wudideMacBook-Pro:~ xiepengchong$ scrapy

Traceback (most recent call last):

  File "/usr/local/bin/scrapy", line 7, in <module>

    from scrapy.cmdline import execute

  File "/usr/local/lib/python2.7/site-packages/scrapy/__init__.py", line 34, in <module>

    from scrapy.spiders import Spider

  File "/usr/local/lib/python2.7/site-packages/scrapy/spiders/__init__.py", line 10, in <module>

    from scrapy.http import Request

  File "/usr/local/lib/python2.7/site-packages/scrapy/http/__init__.py", line 11, in <module>

    from scrapy.http.request.form import FormRequest

  File "/usr/local/lib/python2.7/site-packages/scrapy/http/request/form.py", line 9, in <module>

    import lxml.html

ImportError: No module named lxml.html

其中有個解決辦法是使用

easy_install Scrapy

安裝scrapy,我就抱著試一試的態度,運行了一下。

wudideMacBook-Pro:ucarandroid xiepengchong$ easy_install Scrapy

Searching for Scrapy

Best match: Scrapy 1.2.1

Adding Scrapy 1.2.1 to easy-install.pth file

Installing scrapy script to /usr/local/bin

Using /usr/local/lib/python2.7/site-packages

Processing dependencies for Scrapy

Searching for lxml

Reading https://pypi.python.org/simple/lxml/

Best match: lxml 3.6.4

Downloading https://pypi.python.org/packages/4f/3f/cf6daac551fc36cddafa1a71ed48ea5fd642e5feabd3a0d83b8c3dfd0cb4/lxml-3.6.4.tar.gz#md5=6dd7314233029d9dab0156e7b1c7830b

Processing lxml-3.6.4.tar.gz

Writing /var/folders/b7/3fpgyn013wzfrsdmrcy0gp640000gn/T/easy_install-kjY8jp/lxml-3.6.4/setup.cfg

Running lxml-3.6.4/setup.py -q bdist_egg --dist-dir /var/folders/b7/3fpgyn013wzfrsdmrcy0gp640000gn/T/easy_install-kjY8jp/lxml-3.6.4/egg-dist-tmp-0VlZ05

Building lxml version 3.6.4.

Building without Cython.

Using build configuration of libxslt 1.1.28

creating /usr/local/lib/python2.7/site-packages/lxml-3.6.4-py2.7-macosx-10.10-x86_64.egg

Extracting lxml-3.6.4-py2.7-macosx-10.10-x86_64.egg to /usr/local/lib/python2.7/site-packages

Adding lxml 3.6.4 to easy-install.pth file

Installed /usr/local/lib/python2.7/site-packages/lxml-3.6.4-py2.7-macosx-10.10-x86_64.egg


結果成功了(所以應該還是之前安裝的lxml版本的問題,雖然之前安裝的也是3.6.4版本,這裡哪位大神知道具體原因,希望可以指點一下)

wudideMacBook-Pro:opensource xiepengchong$ scrapy

Scrapy 1.2.1 - no active project

Usage:

  scrapy <command> [options] [args]

Available commands:

  bench         Run quick benchmark test

  commands      

  fetch         Fetch a URL using the Scrapy downloader

  genspider     Generate new spider using pre-defined templates

  runspider     Run a self-contained spider (without creating a project)

  settings      Get settings values

  shell         Interactive scraping console

  startproject  Create new project

  version       Print Scrapy version

  view          Open URL in browser, as seen by Scrapy

  [ more ]      More commands available when run from project directory

Use "scrapy <command> -h" to see more info about a command


成功了還是小小的激動了一下,接下來就是來執行自己的第一個工程來試一下。

官網教程:http://scrapy-chs.readthedocs.io/zh_CN/0.24/intro/tutorial.html

wudideMacBook-Pro:myScrapy xiepengchong$ scrapy startproject tutorial

New Scrapy project 'tutorial', using template directory '/usr/local/lib/python2.7/site-packages/scrapy/templates/project', created in:

    /Users/xiepengchong/opensource/scrapy/myScrapy/tutorial

You can start your first spider with:

    cd tutorial

    scrapy genspider example example.com

wudideMacBook-Pro:myScrapy xiepengchong$ 

其實接下來,根本不需要看教程了,都告訴我們接下來做什麼了。

wudideMacBook-Pro:myScrapy xiepengchong$ cd tutorial/

wudideMacBook-Pro:tutorial xiepengchong$ scrapy genspider example example.com

Created spider 'example' using template 'basic' in module:

  tutorial.spiders.example

wudideMacBook-Pro:tutorial xiepengchong$ 


看來我們已經建立成功了,接下來執行一下試試吧

wudideMacBook-Pro:tutorial xiepengchong$ scrapy crawl example

2016-11-09 20:47:21 [scrapy] INFO: Scrapy 1.2.1 started (bot: tutorial)

2016-11-09 20:47:21 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tutorial.spiders', 'SPIDER_MODULES': ['tutorial.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'tutorial'}

2016-11-09 20:47:21 [scrapy] INFO: Enabled extensions:

['scrapy.extensions.logstats.LogStats',

 'scrapy.extensions.telnet.TelnetConsole',

 'scrapy.extensions.corestats.CoreStats']

2016-11-09 20:47:21 [scrapy] INFO: Enabled downloader middlewares:

['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',

 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',

 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',

 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',

 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',

 'scrapy.downloadermiddlewares.retry.RetryMiddleware',

 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',

 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',

 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',

 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',

 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',

 'scrapy.downloadermiddlewares.stats.DownloaderStats']

2016-11-09 20:47:21 [scrapy] INFO: Enabled spider middlewares:

['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',

 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',

 'scrapy.spidermiddlewares.referer.RefererMiddleware',

 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',

 'scrapy.spidermiddlewares.depth.DepthMiddleware']

2016-11-09 20:47:21 [scrapy] INFO: Enabled item pipelines:

[]

2016-11-09 20:47:21 [scrapy] INFO: Spider opened

2016-11-09 20:47:21 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

2016-11-09 20:47:21 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023

2016-11-09 20:47:22 [scrapy] DEBUG: Crawled (404) <GET http://example.com/robots.txt> (referer: None)

2016-11-09 20:47:23 [scrapy] DEBUG: Crawled (200) <GET http://example.com/> (referer: None)

2016-11-09 20:47:23 [scrapy] INFO: Closing spider (finished)

2016-11-09 20:47:23 [scrapy] INFO: Dumping Scrapy stats:

{'downloader/request_bytes': 428,

 'downloader/request_count': 2,

 'downloader/request_method_count/GET': 2,

 'downloader/response_bytes': 1899,

 'downloader/response_count': 2,

 'downloader/response_status_count/200': 1,

 'downloader/response_status_count/404': 1,

 'finish_reason': 'finished',

 'finish_time': datetime.datetime(2016, 11, 9, 12, 47, 23, 133884),

 'log_count/DEBUG': 3,

 'log_count/INFO': 7,

 'response_received_count': 2,

 'scheduler/dequeued': 1,

 'scheduler/dequeued/memory': 1,

 'scheduler/enqueued': 1,

 'scheduler/enqueued/memory': 1,

 'start_time': datetime.datetime(2016, 11, 9, 12, 47, 21, 424947)}

2016-11-09 20:47:23 [scrapy] INFO: Spider closed (finished)

wudideMacBook-Pro:tutorial xiepengchong$ 

雖然還沒有完全瞭解這個爬蟲的工作原理,但是第一步總算是完成了,

萬事開頭難嗎,剩下其他的就慢慢積累學習吧。