scrapy 環境搭建與試執行
背景:最近在做php伺服器的專案,但是內容對於一個產品起到的是最為重要的東西,那麼對於內容的獲取就變的非常重要,所有想通過爬蟲的方式來獲取內容,所以想到了使用scrapy來當作爬蟲工具,但是苦逼的是,環境對於開發者來說永遠都是那麼的不順。浪費了很多時間。這裡記錄一下,方便以後檢視。
開工:其實官網介紹,安裝scrapy非常簡單。
官方教程:https://doc.scrapy.org/en/latest/intro/install.html
1.確保你有python
2.使用pip install scrapy
就成功了,官網說的更多,但是我嘗試時,好多都是不需要的。都會自動下載安裝。
但是我在使用過程中遇到了很多問題,
問題階段
1.就是會提示etree.so中找不到各種東西。
然後經過查詢網上大多的答案都是mac自帶的python版本不匹配導致的。這裡卡了很久,使用virtualenv也並沒有解決我的問題。
查詢網上,各種設定各種環境變數,但是都沒有解決我的問題。
探索階段
我一狠心,將mac自帶的python都刪除乾淨了,之前是2.7.10。現在回想這個版本應該是ok的。
安裝了2.7.12還是有問題。
然後通過進入lxml目錄。
執行otool lxml/etree.so
發現連線的檔案也都是存在的,也不是網上所說的link錯動態庫的問題。
結果我又一狠心,將lxml刪掉了,重新安裝還是不行。
但是在刪除的過程中,我運行了 scrapy 發現是下面的錯誤,關於etree.so找不到方法的解決辦法比較少,而且沒有什麼用,但是這個錯誤網上很多啊
wudideMacBook-Pro:~ xiepengchong$ scrapy
Traceback (most recent call last):
File "/usr/local/bin/scrapy", line 7, in <module>
from scrapy.cmdline import execute
File "/usr/local/lib/python2.7/site-packages/scrapy/__init__.py", line 34, in <module>
from scrapy.spiders import Spider
File "/usr/local/lib/python2.7/site-packages/scrapy/spiders/__init__.py", line 10, in <module>
from scrapy.http import Request
File "/usr/local/lib/python2.7/site-packages/scrapy/http/__init__.py", line 11, in <module>
from scrapy.http.request.form import FormRequest
File "/usr/local/lib/python2.7/site-packages/scrapy/http/request/form.py", line 9, in <module>
import lxml.html
ImportError: No module named lxml.html
其中有個解決辦法是使用easy_install Scrapy
安裝scrapy,我就抱著試一試的態度,運行了一下。
wudideMacBook-Pro:ucarandroid xiepengchong$ easy_install Scrapy
Searching for Scrapy
Best match: Scrapy 1.2.1
Adding Scrapy 1.2.1 to easy-install.pth file
Installing scrapy script to /usr/local/bin
Using /usr/local/lib/python2.7/site-packages
Processing dependencies for Scrapy
Searching for lxml
Reading https://pypi.python.org/simple/lxml/
Best match: lxml 3.6.4
Downloading https://pypi.python.org/packages/4f/3f/cf6daac551fc36cddafa1a71ed48ea5fd642e5feabd3a0d83b8c3dfd0cb4/lxml-3.6.4.tar.gz#md5=6dd7314233029d9dab0156e7b1c7830b
Processing lxml-3.6.4.tar.gz
Writing /var/folders/b7/3fpgyn013wzfrsdmrcy0gp640000gn/T/easy_install-kjY8jp/lxml-3.6.4/setup.cfg
Running lxml-3.6.4/setup.py -q bdist_egg --dist-dir /var/folders/b7/3fpgyn013wzfrsdmrcy0gp640000gn/T/easy_install-kjY8jp/lxml-3.6.4/egg-dist-tmp-0VlZ05
Building lxml version 3.6.4.
Building without Cython.
Using build configuration of libxslt 1.1.28
creating /usr/local/lib/python2.7/site-packages/lxml-3.6.4-py2.7-macosx-10.10-x86_64.egg
Extracting lxml-3.6.4-py2.7-macosx-10.10-x86_64.egg to /usr/local/lib/python2.7/site-packages
Adding lxml 3.6.4 to easy-install.pth file
Installed /usr/local/lib/python2.7/site-packages/lxml-3.6.4-py2.7-macosx-10.10-x86_64.egg
結果成功了(所以應該還是之前安裝的lxml版本的問題,雖然之前安裝的也是3.6.4版本,這裡哪位大神知道具體原因,希望可以指點一下)
wudideMacBook-Pro:opensource xiepengchong$ scrapy
Scrapy 1.2.1 - no active project
Usage:
scrapy <command> [options] [args]
Available commands:
bench Run quick benchmark test
commands
fetch Fetch a URL using the Scrapy downloader
genspider Generate new spider using pre-defined templates
runspider Run a self-contained spider (without creating a project)
settings Get settings values
shell Interactive scraping console
startproject Create new project
version Print Scrapy version
view Open URL in browser, as seen by Scrapy
[ more ] More commands available when run from project directory
Use "scrapy <command> -h" to see more info about a command
成功了還是小小的激動了一下,接下來就是來執行自己的第一個工程來試一下。
官網教程:http://scrapy-chs.readthedocs.io/zh_CN/0.24/intro/tutorial.html
wudideMacBook-Pro:myScrapy xiepengchong$ scrapy startproject tutorial
New Scrapy project 'tutorial', using template directory '/usr/local/lib/python2.7/site-packages/scrapy/templates/project', created in:
/Users/xiepengchong/opensource/scrapy/myScrapy/tutorial
You can start your first spider with:
cd tutorial
scrapy genspider example example.com
wudideMacBook-Pro:myScrapy xiepengchong$
其實接下來,根本不需要看教程了,都告訴我們接下來做什麼了。wudideMacBook-Pro:myScrapy xiepengchong$ cd tutorial/
wudideMacBook-Pro:tutorial xiepengchong$ scrapy genspider example example.com
Created spider 'example' using template 'basic' in module:
tutorial.spiders.example
wudideMacBook-Pro:tutorial xiepengchong$
看來我們已經建立成功了,接下來執行一下試試吧
wudideMacBook-Pro:tutorial xiepengchong$ scrapy crawl example
2016-11-09 20:47:21 [scrapy] INFO: Scrapy 1.2.1 started (bot: tutorial)
2016-11-09 20:47:21 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tutorial.spiders', 'SPIDER_MODULES': ['tutorial.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'tutorial'}
2016-11-09 20:47:21 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2016-11-09 20:47:21 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-11-09 20:47:21 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-11-09 20:47:21 [scrapy] INFO: Enabled item pipelines:
[]
2016-11-09 20:47:21 [scrapy] INFO: Spider opened
2016-11-09 20:47:21 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-11-09 20:47:21 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-11-09 20:47:22 [scrapy] DEBUG: Crawled (404) <GET http://example.com/robots.txt> (referer: None)
2016-11-09 20:47:23 [scrapy] DEBUG: Crawled (200) <GET http://example.com/> (referer: None)
2016-11-09 20:47:23 [scrapy] INFO: Closing spider (finished)
2016-11-09 20:47:23 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 428,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 1899,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/404': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 11, 9, 12, 47, 23, 133884),
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'response_received_count': 2,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2016, 11, 9, 12, 47, 21, 424947)}
2016-11-09 20:47:23 [scrapy] INFO: Spider closed (finished)
wudideMacBook-Pro:tutorial xiepengchong$
雖然還沒有完全瞭解這個爬蟲的工作原理,但是第一步總算是完成了,萬事開頭難嗎,剩下其他的就慢慢積累學習吧。