python2.7.9+scrapy+selenium的window7系統和linux系統
1.安裝python2.7.5環境
window系統可以直接下載python2.7.5然後傻瓜式下一步就好
2.安裝pip
win:在安裝pip前,請確認你win系統中已經安裝好了python,和easy_install工具,如果系統安裝成功,easy_install在目錄C:\Python27\Scripts 下面,進入命令列,然後把目錄切換到python的安裝目錄下的Script資料夾下,執行 easy_inatall pip,pip安裝成功後,在cmd下執行pip list可以成功就可以
linux:如果有需要升級的用:sudo pip install --upgrade pip
沒有:
wget https://bootstrap.pypa.io/get-pip.py
python get-pip.py
3.安裝scrapy
sudo pip install scrapy
4.安裝selenium
sudo pip install selenium
5.安裝pyopenssl
sudo pip install pyopenssl
6.window系統需要安裝 win32api
在cmd:
pip install win32api
反正報錯百度缺什麼在安裝
7.安裝 firefoxl因為用到模擬瀏覽器linux安裝一個
yum install firefox
firefox version
8.下載火狐驅動
geckodriver的安裝方法
下載地址:連結
我下載的是geckodriver-v0.20.0-linux64.tar.gz .
安裝方法與chromedriver類似:
在終端下切換到下載路徑下,輸入以下命令移動到火狐啟動路徑/usr/bin
tar -xvzf geckodriver*
chmod +x geckodriver
sudo mv geckodriver /usr/bin/
現在到了爬蟲的時刻:
我用的是pycharm編譯器
1.middlewares.py
from random import choice from selenium import webdriver as wb from scrapy.http import HtmlResponse from selenium.webdriver import DesiredCapabilities from selenium.webdriver import FirefoxOptions ua_list = [ "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/48.0.2564.82 Chrome/48.0.2564.82 Safari/537.36", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36", "Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.93 Safari/537.36", "Mozilla/5.0 (X11; OpenBSD i386) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1664.3 Safari/537.36" ] dcap = dict(DesiredCapabilities.FIREFOX) dcap["firefox.page.settings.resourceTimeout"] = 15 dcap["firefox.page.settings.loadImages"] = False dcap["firefox.page.settings.userAgent"] = choice(ua_list) class SeleniumMiddleware(object): def process_request(self, request, spider): # 設定無頭模式 瀏覽器就不會彈出來 建議開始不要設定 彈出來說明訪問成功 opts = FirefoxOptions() opts.add_argument("--headless") driver = wb.Firefox(firefox_options=opts) driver.get(request.url) driver.implicitly_wait(3) page = driver.page_source # .decode('utf-8','ignore') driver.close() return HtmlResponse(request.url, body=page, encoding='utf-8', request=request, )
2.item.py
class TopWenItem(scrapy.Item):
title = scrapy.Field()
content = scrapy.Field()
3.pipline.py
import json
import codecs
#以Json的形式儲存
class JsonWithEncodingCnblogsPipeline(object):
def __init__(self):
self.file = codecs.open('xx.json', 'w', encoding='utf-8')
def process_item(self, item, spider):
line = json.dumps(dict(item), ensure_ascii=False) + "\n"
self.file.write(line)
return item
def spider_closed(self, spider):
self.file.close()
4.scrapy.py
class GsSpider(scrapy.Spider):
name = "gsspider"
start_urls = [
'http://movie.douban.com/top250/'
]
# start_urls = []
# file = open('C:/test-scrapy/tutorial/tutorial/url.txt')
# for word in file:
# word = word.strip()
# url = 'http://www.gsdata.cn/query/wx?q=' + word
#
# start_urls.append(url)
def parse(self, response):
items = []
for info in response.xpath('//div[@class="item"]'):
item = MovieItem()
item['rank'] = info.xpath('div[@class="pic"]/em/text()').extract()
item['title'] = info.xpath('div[@class="pic"]/a/img/@alt').extract()
item['link'] = info.xpath('div[@class="pic"]/a/@href').extract()
item['rate'] = info.xpath('div[@class="info"]/div[@class="bd"]/div[@class="star"]/span/text()').extract()
item['quote'] = info.xpath('div[@class="info"]/div[@class="bd"]/p[@class="quote"]/span/text()').extract()
items.append(item)
yield item
# 翻頁
next_page = response.xpath('//span[@class="next"]/a/@href')
if next_page:
url = response.urljoin(next_page[0].extract())
# 爬每一頁
yield scrapy.Request(url, self.parse)
在工程路徑/usr/home/spiderdemo/spiderdemo/路徑執行 scrapy crawl gsspider 第三個引數是裡不是檔名 是你爬蟲設定的name
問題總結:
這個也是因為linux沒有圖形介面導致報錯按照問題3解決
2.from OpenSSL._util import lib as pyOpenSSLlib
ImportError: No module named _util
這個問題有兩個原因 你版本不夠高 ,第二個你的openssl安裝有問題裡面包不全 建議刪掉openssl 目錄在python目錄裡
這是linux沒有圖形介面導致的 你可以方法有兩個:
第一個:安裝Xvfb
yum install -y Xvfb
啟動Xvfb
Xvfb -ac :7 -screen 0 1280x1024x8 &
export DISPLAY=:7 (和上一步的number號相同)
第二個方法:
from selenium.webdriver import FirefoxOptions
opts = FirefoxOptions()
opts.add_argument("--headless")
browser = webdriver.Firefox(firefox_options=opts)