Scrapy爬取大眾點評

阿新 • • 發佈：2018-04-18

BE info enable each city wow64 news 數據 windows

最近想吃烤肉，所以想看看深圳哪裏的烤肉比較好吃，於是自己就開始爬蟲咯。這是個靜態網頁，有反爬機制，我在setting和middlewares設置了反爬措施

Setting

# -*- coding: utf-8 -*-

# Scrapy settings for dazhong project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation: 

#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = ‘dazhong‘

SPIDER_MODULES = [‘dazhong.spiders‘]
NEWSPIDER_MODULE = ‘dazhong.spiders‘


# Crawl responsibly by identifying yourself (and your website) on the user-agent 

USER_AGENT = ‘Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 UBrowser/6.2.3964.2 Safari/537.36‘

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0) 

# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 10
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   ‘Accept‘: ‘text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8‘,
#   ‘Accept-Language‘: ‘en‘,
#}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    ‘dazhong.middlewares.DazhongSpiderMiddleware‘: 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
    ‘scrapy.downloadermiddleware.useragent.UserAgentMiddleware‘: None, 
    ‘dazhong.middlewares.MyUserAgentMiddleware‘: 400,
}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    ‘scrapy.extensions.telnet.TelnetConsole‘: None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
‘dazhong.pipelines.DazhongPipeline‘: 200,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = ‘httpcache‘
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = ‘scrapy.extensions.httpcache.FilesystemCacheStorage‘

MY_USER_AGENT = [‘Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 UBrowser/6.2.3964.2 Safari/537.36‘]

ITEM

import scrapy

class DazhongItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    name = scrapy.Field()
    location = scrapy.Field()
    people = scrapy.Field()
    money = scrapy.Field()
    taste = scrapy.Field()
    envir = scrapy.Field()
    taste_score = scrapy.Field()
    service = scrapy.Field()

Spider：

# -*- coding: utf-8 -*-
import scrapy
import re
from bs4 import BeautifulSoup
from scrapy.http import Request
from dazhong.items import DazhongItem

class DzSpider(scrapy.Spider):
    name = ‘dz‘
    allowed_domains = [‘www.dianping.com‘]
    #headers = {‘USER-Agent‘:‘Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 UBrowser/6.2.3964.2 Safari/537.36‘}
    #custom_settings = {‘User-Agent‘:‘Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 UBrowser/6.2.3964.2 Safari/537.36‘}
    first_url = ‘http://www.dianping.com/shenzhen/ch10/g114‘
    last_url = ‘p‘
    def start_requests(self):
        for i in range(1,45):
            url = self.first_url + self.last_url + str(i)
            yield Request(url,self.parse)    
    def parse(self, response):
        soup = BeautifulSoup(response.body.decode(‘UTF-8‘),‘lxml‘)
        for site in soup.find_all(‘div‘,class_=‘txt‘):
            item = DazhongItem()
            try:
                item[‘name‘] = site.find(‘div‘,class_=‘tit‘).find({‘h4‘}).get_text()
                item[‘location‘] = site.find(‘div‘,class_=‘tag-addr‘).find(‘span‘,class_=‘addr‘).get_text()
                item[‘people‘] = site.find(‘div‘,class_=‘comment‘).find(‘a‘).find(‘b‘).get_text()
                item[‘money‘] = site.find(‘div‘,class_=‘comment‘).find_all(‘a‘)[1].find(‘b‘).get_text()
                item[‘taste‘] = site.find(‘div‘,class_= ‘tag-addr‘).find(‘a‘).find(‘span‘).get_text() 
                item[‘envir‘] = site.find(‘span‘,class_= ‘comment-list‘).find_all(‘span‘)[1].find(‘b‘).get_text()
                item[‘taste_score‘] = site.find(‘span‘,class_= ‘comment-list‘).find_all(‘span‘)[0].find(‘b‘).get_text()
                item[‘service‘] = site.find(‘span‘,class_= ‘comment-list‘).find_all(‘span‘)[2].find(‘b‘).get_text()
                yield item
            except:
                pass

PIPELINE：

from openpyxl import Workbook

class DazhongPipeline(object):  # 設置工序一
    def __init__(self):
        self.wb = Workbook()
        self.ws = self.wb.active
        self.ws.append([‘店鋪名稱‘,‘地點‘,‘評論人數‘,‘平均消費‘,‘口味‘,‘環境評分‘,‘口味評分‘,‘服務評分‘,])  # 設置表頭
    def process_item(self, item, spider):  # 工序具體內容
        line = [item[‘name‘],item[‘location‘],item[‘people‘],item[‘money‘],item[‘taste‘],item[‘envir‘],item[‘taste_score‘],item[‘service‘]]  # 把數據中每一項整理出來
        self.ws.append(line)  # 將數據以行的形式添加到xlsx中
        self.wb.save(‘dazhong.xlsx‘)  # 保存xlsx文件
        return item
    def spider_closed(self, spider):
        self.file.close()

middlewares：

import scrapy
from scrapy import signals
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
import random

class MyUserAgentMiddleware(UserAgentMiddleware):
    def __init__(self, user_agent):
        self.user_agent = user_agent
    @classmethod
    def from_crawler(cls,crawler):
        return cls(
                user_agent = crawler.settings.get(‘MY_USER_AGENT‘)
            )
    def process_request(self, request, spider):
        agent = random.choice(self.user_agent)
        request.headers[‘User-Agent‘] = agent

那些沒有環境評分、服務評分數據的也就跳過了，爬來沒意義

結果如下：

技術分享圖片

決定去吃姜虎東

Scrapy爬取大眾點評

BE info enable each city wow64 news 數據 windows 最近想吃烤肉，所以想看看深圳哪裏的烤肉比較好吃，於是自己就開始爬蟲咯。這是個靜態網頁，有反爬機制，我在setting和middlewares設置了反爬措施 Setting # -

Python利用scrapy框架，爬取大眾點評部分商鋪資料~

分享一下，自己從0開始，用python爬取資料的歷程。希望可以可以幫到一起從0開始的小夥伴~~加油。首先，我的開發環境是：電腦：macOS Sierra 10.12.6 編譯器：PyCharm + 終端我的電腦自帶的Python版本為2.7，我下載了一個Python3.6。使

【Python爬蟲實戰專案一】爬取大眾點評團購詳情及團購評論

1 專案簡介從大眾點評網收集北京市所有美髮、健身類目的團購詳情以及團購評論,儲存為本地txt檔案。技術：Requests+BeautifulSoup 以美髮為例：http://t.dianping.com/list/beijing?q=美髮爬取內容包括：【團購詳情】團購名稱、原

爬取大眾點評之初步試探

常規的反爬機制有訪問頻率限制、cookie限制、驗證碼、js加密引數等。目前解決不了的js加密是今日頭條的_signature引數、京東的s引數(在搜尋結果的ajax中，返回的結果根據s引數的不同而不同，目前沒有發現規律)、新版12306登陸時的callback引數等而今天的網站的反爬

爬取大眾點評之獲取商家地址

昨天爬取大眾點評的文章昨天試探性的爬取了大眾點評的數字資訊，但一般我們獲取的資料中，不止是這些數字資訊。在基本資訊裡面，地址也是一個很重要的資料。於是今天嘗試一下怎麼獲取地址。思路和數字是一樣的，概括就是，通過css檔案裡的偏移量找到class屬性和svg檔案中的漢字的對應關係。

Python爬取大眾點評成都資料，只為告訴你哪家火鍋最好吃

冬天到了，天氣越來越冷，小編起床越來越困難了，每一天都想吃辣辣的火鍋。成都到處都是火鍋店，有名的店，稍微去晚一點，排隊都要排好久，沒聽說的店，又怕味道不好。那麼如何選擇火鍋店呢？最簡單的肯定是在美團。大眾點評上找一找啊。所以，本文就從大眾點評上爬取了成都的火鍋資料，來進行了分析。 Python學

爬取大眾點評資料

通過觀察每個城市的連結主要區別於ranKld，每個城市有特定的ID，因此先獲取到相應城市的ID，便可進行後續抓取。獲取到的城市ID為： [“上海”,“fce2e3a36450422b7fad3f2b90370efd71862f838d1255ea693b9

python2.7爬蟲例項詳細介紹之爬取大眾點評的資料

一．Python作為一種語法簡潔、面向物件的解釋性語言，其便捷性、容易上手性受到眾多程式設計師的青睞，基於python的包也越來越多，使得python能夠幫助我們實現越來越多的功能。本文主要介紹如何利用python進行網站資料的抓取工作。我看到過利用c++和java進行爬蟲的

Python 爬取大眾點評 50 頁資料，最好吃的成都火鍋根本想不到！

成都到處都是火鍋店，有名的店，稍微去晚一點，排隊都要排好久，沒聽說的店，又怕味道不好。那麼如何選擇火鍋店呢？最簡單的肯定是在美團。大眾點評上找一找啊。所以，本文就從大眾點評上爬取了成都的火鍋資料，來進行了分析。 &nbs

python爬蟲爬取大眾點評中所有行政區內的商戶將獲取資訊存於excle中

import xlwt ''' 爬取網頁時直接出現403，意思是沒有訪問許可權 ''' import requests from bs4 import BeautifulSoup #入口網頁 start_url = 'https://www.dianping.com/se

python2.7爬取大眾點評模擬滑鼠 python第二天含原始碼

*第二天是指寫部落格的第二天創作背景對於新手來說最快的學習方法就是看專案，在百度搜索python爬蟲基本都是爬大眾點評的，不知道這個網站做錯了什麼被這麼多人爬。接下來博主興沖沖的找了幾個有程式碼的部落格，改了改就測試，但是結果無非就是網站不能正常訪問啊，需要拖動驗證之

Python3爬蟲實戰：爬取大眾點評網某地區所有酒店相關資訊

歷時一下午加一晚上，終於把這個爬蟲程式碼寫好，後面還有很多想完善的地方（譬如資料儲存用redis、使用多執行緒加快速度、爬取圖片、細分資料等等），待有空再做更改，下面是具體的步驟與思路：工具：PyC

python爬取大眾點評網商家資訊以及評價，並將資料儲存到excel表中（原始碼及註釋）

import requests from bs4 import BeautifulSoup import traceback # 異常處理 import xlwt # 寫入xls表 # Cookie記錄登入資訊，session請求 def get_content(url,he

Python3爬蟲：爬取大眾點評網北京所有酒店評分資訊

學習Python3爬蟲實戰：爬取大眾點評網某地區所有酒店相關資訊，我爬取的北京地區的酒店，由於網站更新，原文中的一些方法已經不再適用，我的工作是在該文指導下重寫了一個爬蟲。爬蟲無非分為這幾塊：分析目標、下載頁面、解析頁面、儲存內容，其中下載頁面不提。

python2 scrapy-redisd搭建,簡單使用。爬取豆瓣點評

Scrapy 和 scrapy-redis的區別 Scrapy 是一個通用的爬蟲框架，但是不支援分散式，Scrapy-redis是為了更方便地實現Scrapy分散式爬取，而提供了一些以redis為基礎的元件(僅有元件)。 Scrapy-redis提供了下面四種元件（com

scrapy爬取中關村在線手機頻道

tex ice extract base .section title .html release nbsp 1 # -*- coding: utf-8 -*- 2 import scrapy 3 from pyquery import PyQuery as pq

scrapy爬取豆瓣電影top250

imp port 爬取 all lba item text request top 1 # -*- coding: utf-8 -*- 2 # scrapy爬取豆瓣電影top250 3 4 import scrapy 5 from douban.items i

scrapy爬取小說盜墓筆記

xtra pipeline odin trac items style ict ref open # -*- coding: utf-8 -*- import scrapy import requests from daomu.items import DaomuItem

scrapy爬取西刺網站ip

close mon ins css pro bject esp res first # scrapy爬取西刺網站ip # -*- coding: utf-8 -*- import scrapy from xici.items import XiciItem clas

Python爬蟲從入門到放棄（十八）之 Scrapy爬取所有知乎用戶信息(上)

user 說過 -c convert 方式 bsp 配置文件 https 爬蟲爬取的思路首先我們應該找到一個賬號，這個賬號被關註的人和關註的人都相對比較多的，就是下圖中金字塔頂端的人，然後通過爬取這個賬號的信息後，再爬取他關註的人和被關註的人的賬號信息，然後爬取被關註人

Scrapy爬取大眾點評

相關推薦