1. 程式人生 > >python2.7.9+scrapy+selenium的window7系統和linux系統

python2.7.9+scrapy+selenium的window7系統和linux系統

1.安裝python2.7.5環境

window系統可以直接下載python2.7.5然後傻瓜式下一步就好

2.安裝pip

win:在安裝pip前,請確認你win系統中已經安裝好了python,和easy_install工具,如果系統安裝成功,easy_install在目錄C:\Python27\Scripts 下面,進入命令列,然後把目錄切換到python的安裝目錄下的Script資料夾下,執行 easy_inatall pip,pip安裝成功後,在cmd下執行pip list可以成功就可以

linux:如果有需要升級的用:sudo pip install --upgrade pip

沒有:

wget https://bootstrap.pypa.io/get-pip.py  

python get-pip.py

3.安裝scrapy

sudo pip install scrapy

4.安裝selenium

sudo pip install selenium

5.安裝pyopenssl

sudo pip install pyopenssl

6.window系統需要安裝 win32api

在cmd:

pip install win32api

反正報錯百度缺什麼在安裝

7.安裝 firefoxl因為用到模擬瀏覽器linux安裝一個

yum install firefox

firefox version

8.下載火狐驅動

geckodriver的安裝方法 
下載地址:連結 
我下載的是geckodriver-v0.20.0-linux64.tar.gz . 
安裝方法與chromedriver類似: 
在終端下切換到下載路徑下,輸入以下命令移動到火狐啟動路徑/usr/bin

tar -xvzf geckodriver*
chmod +x geckodriver
sudo mv geckodriver /usr/bin/

現在到了爬蟲的時刻:

我用的是pycharm編譯器

1.middlewares.py

from random import choice

from selenium import webdriver as wb
from scrapy.http import HtmlResponse
from selenium.webdriver import DesiredCapabilities
from selenium.webdriver import FirefoxOptions

ua_list = [
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/48.0.2564.82 Chrome/48.0.2564.82 Safari/537.36",
    "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.93 Safari/537.36",
    "Mozilla/5.0 (X11; OpenBSD i386) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1664.3 Safari/537.36"
]

dcap = dict(DesiredCapabilities.FIREFOX)
dcap["firefox.page.settings.resourceTimeout"] = 15
dcap["firefox.page.settings.loadImages"] = False
dcap["firefox.page.settings.userAgent"] = choice(ua_list)


class SeleniumMiddleware(object):

    def process_request(self, request, spider):

        # 設定無頭模式 瀏覽器就不會彈出來 建議開始不要設定 彈出來說明訪問成功
        opts = FirefoxOptions()
        opts.add_argument("--headless")
        driver = wb.Firefox(firefox_options=opts)
        driver.get(request.url)
        driver.implicitly_wait(3)

        page = driver.page_source  # .decode('utf-8','ignore')
        driver.close()

        return HtmlResponse(request.url, body=page, encoding='utf-8', request=request, )

2.item.py

class TopWenItem(scrapy.Item):
    title = scrapy.Field()
    content = scrapy.Field()

3.pipline.py

import json
import codecs
#以Json的形式儲存
class JsonWithEncodingCnblogsPipeline(object):
    def __init__(self):
        self.file = codecs.open('xx.json', 'w', encoding='utf-8')
    def process_item(self, item, spider):
        line = json.dumps(dict(item), ensure_ascii=False) + "\n"
        self.file.write(line)
        return item
    def spider_closed(self, spider):
        self.file.close()

4.scrapy.py

class GsSpider(scrapy.Spider):


    name = "gsspider"
   

    start_urls = [


        'http://movie.douban.com/top250/'

    ]
    # start_urls = []
    # file = open('C:/test-scrapy/tutorial/tutorial/url.txt')
    # for word in file:
    #     word = word.strip()
    #     url = 'http://www.gsdata.cn/query/wx?q=' + word
    #
    #     start_urls.append(url)


    def parse(self, response):
        items = []
        for info in response.xpath('//div[@class="item"]'):
            item = MovieItem()
            item['rank'] = info.xpath('div[@class="pic"]/em/text()').extract()
            item['title'] = info.xpath('div[@class="pic"]/a/img/@alt').extract()
            item['link'] = info.xpath('div[@class="pic"]/a/@href').extract()
            item['rate'] = info.xpath('div[@class="info"]/div[@class="bd"]/div[@class="star"]/span/text()').extract()
            item['quote'] = info.xpath('div[@class="info"]/div[@class="bd"]/p[@class="quote"]/span/text()').extract()
            items.append(item)
            yield item
        # 翻頁
        next_page = response.xpath('//span[@class="next"]/a/@href')
        if next_page:
            url = response.urljoin(next_page[0].extract())
            # 爬每一頁
            yield scrapy.Request(url, self.parse)

在工程路徑/usr/home/spiderdemo/spiderdemo/路徑執行 scrapy crawl gsspider 第三個引數是裡不是檔名 是你爬蟲設定的name

問題總結:

這個也是因為linux沒有圖形介面導致報錯按照問題3解決

2.from OpenSSL._util import lib as pyOpenSSLlib

ImportError: No module named _util

這個問題有兩個原因 你版本不夠高 ,第二個你的openssl安裝有問題裡面包不全 建議刪掉openssl 目錄在python目錄裡

這是linux沒有圖形介面導致的 你可以方法有兩個:

第一個:安裝Xvfb

yum install -y Xvfb

啟動Xvfb

Xvfb -ac :7 -screen 0 1280x1024x8 &

export  DISPLAY=:7 (和上一步的number號相同)

第二個方法:

from selenium.webdriver import FirefoxOptions

opts = FirefoxOptions()
opts.add_argument("--headless")
browser = webdriver.Firefox(firefox_options=opts)