1. 程式人生 > >爬取大眾點評資料

爬取大眾點評資料

通過觀察每個城市的連結主要區別於ranKld,每個城市有特定的ID,因此先獲取到相應城市的ID,便可進行後續抓取。

獲取到的城市ID為:

[“上海”,“fce2e3a36450422b7fad3f2b90370efd71862f838d1255ea693b953b1d49c7c0”],

[“北京”,“d5036cf54fcb57e9dceb9fefe3917fff71862f838d1255ea693b953b1d49c7c0”],

[“廣州”,“e749e3e04032ee6b165fbea6fe2dafab71862f838d1255ea693b953b1d49c7c0”],

[“深圳”,“e049aa251858f43d095fc4c61d62a9ec71862f838d1255ea693b953b1d49c7c0”],

[“天津”,“2e5d0080237ff3c8f5b5d3f315c7c4a508e25c702ab1b810071e8e2c39502be1”],

[“杭州”,“91621282e559e9fc9c5b3e816cb1619c71862f838d1255ea693b953b1d49c7c0”]

,[“南京”,“d6339a01dbd98141f8e684e1ad8af5c871862f838d1255ea693b953b1d49c7c0”],

[“蘇州”,“536e0e568df850d1e6ba74b0cf72e19771862f838d1255ea693b953b1d49c7c0”],

[“成都”,“c950bc35ad04316c76e89bf2dc86bfe071862f838d1255ea693b953b1d49c7c0”],

[“武漢”,“d96a24c312ed7b96fcc0cedd6c08f68c08e25c702ab1b810071e8e2c39502be1”],

[“重慶”,“6229984ceb373efb8fd1beec7eb4dcfd71862f838d1255ea693b953b1d49c7c0”],

[“西安”,“ad66274c7f5f8d27ffd7f6b39ec447b608e25c702ab1b810071e8e2c39502be1”]

抓取頁面

抓取分析
通過瀏覽器分析可發現該網站通過Ajax請求,所有資料來源於:

該連結同之前請求一樣,只需要替換rankId便可進行多城市資料獲取。最終抓取的資料只需要解析json邊可獲得所需欄位,由於大眾沒有特別反爬限制,只需要不斷輪換userAgent便可繞過反爬。

我們對上海,北京,廣州,深圳,天津,杭州,南京,蘇州,成都,武漢,重慶,西安等城市的前100家商鋪進行資料獲取,並分析最終所獲取資料集,見《大眾點評資料分析》

請求頭

USER_AGENT_LIST = [

“Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1”,

“Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11”,

“Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6”,

“Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6”,

“Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1”,

“Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5”]

head = {

‘User-Agent’: ‘{0}’.format(random.sample(USER_AGENT_LIST, 1)[0])  # 隨機獲取

}

程式碼展示:

#!/usr/bin/env python

encoding: utf-8

“”"
@version: v1.0
@author: W_H_J
@license: Apache Licence
@contact: [email protected]
@site:
@software: PyCharm
@file: dazhongFood.py
@time: 2018/7/18 15:46
@describe: 大眾點評美食抓取
list_city :城市的ID號碼,依次是:上海,北京,廣州,深圳,天津,杭州,南京,蘇州,成都,武漢,重慶,西安
“”"
import json
import random
import requests
from base.dbhelper import DBHelper

城市列表

list_city = [[“上海”,“fce2e3a36450422b7fad3f2b90370efd71862f838d1255ea693b953b1d49c7c0”],[“北京”,“d5036cf54fcb57e9dceb9fefe3917fff71862f838d1255ea693b953b1d49c7c0”],[“廣州”,“e749e3e04032ee6b165fbea6fe2dafab71862f838d1255ea693b953b1d49c7c0”],[“深圳”,“e049aa251858f43d095fc4c61d62a9ec71862f838d1255ea693b953b1d49c7c0”],[“天津”,“2e5d0080237ff3c8f5b5d3f315c7c4a508e25c702ab1b810071e8e2c39502be1”],[“杭州”,“91621282e559e9fc9c5b3e816cb1619c71862f838d1255ea693b953b1d49c7c0”],[“南京”,“d6339a01dbd98141f8e684e1ad8af5c871862f838d1255ea693b953b1d49c7c0”],[“蘇州”,“536e0e568df850d1e6ba74b0cf72e19771862f838d1255ea693b953b1d49c7c0”],[“成都”,“c950bc35ad04316c76e89bf2dc86bfe071862f838d1255ea693b953b1d49c7c0”],[“武漢”,“d96a24c312ed7b96fcc0cedd6c08f68c08e25c702ab1b810071e8e2c39502be1”],[“重慶”,“6229984ceb373efb8fd1beec7eb4dcfd71862f838d1255ea693b953b1d49c7c0”],[“西安”,“ad66274c7f5f8d27ffd7f6b39ec447b608e25c702ab1b810071e8e2c39502be1”]]

請求頭

USER_AGENT_LIST = [
“Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1”,
“Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11”,
“Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6”,
“Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6”,
“Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1”,
“Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5”]
head = {
‘User-Agent’: ‘{0}’.format(random.sample(USER_AGENT_LIST, 1)[0]) # 隨機獲取
}

flag = 0
code = 0

解析

def findFood(city,data):
global flag,code
mysql_db = DBHelper()
for data in json.loads(data)[“shopBeans”]:
flag +=1
# 詳細地址
shopAddress = data[“address”]
# 人均消費
avgPrice = data[“avgPrice”]
# 商鋪圖片
defaultPic = data[“defaultPic”]
# 分類名稱
mainCategoryName = data[“mainCategoryName”]
# 所在區域名稱
mainRegionName = data[“mainRegionName”]
# 口味評分
tasteScore = data[“score1”]
# 環境評分
environmentScore = data[“score2”]
# 服務評分
serviceScore = data[“score3”]
# 商品編號
shopId = data[“shopId”]
# 商鋪網址
shopUrl = “http://www.dianping.com/shop/”+shopId
# 商鋪名稱
shopName = data[“shopName”]
# 商鋪星級
shopPower = data[“shopPower”]
sql = ‘’‘insert into dazhongfood(shopUrl,shopName, shopId, shopPower, mainRegionName, mainCategoryName, tasteScore, environmentScore, serviceScore, avgPrice, shopAddress, defaultPic, city) VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)’’’
params = (shopUrl,shopName, shopId, shopPower, mainRegionName, mainCategoryName, tasteScore, environmentScore, serviceScore, avgPrice, shopAddress, defaultPic, city)
try:
mysql_db.insert(sql,*params)
code +=1
print("----- 插入:", code, “條------”)
except:
print(“已存在不再重複插入!!”)
print(“總條數:”, flag)

抓取

def foodSpider(city_list):
city = city_list[0]
url = city_list[1]
base_url = “http://www.dianping.com/mylist/ajax/shoprank?rankId=”+url
html = requests.get(base_url, headers=head)
findFood(city=city, data=str(html.text))

if name == ‘main’:
for city_data in list_city:
foodSpider(city_data)
最終獲取結果儲存至MySQL。(完整資料集見daZhongFood/data)

最終結果

後續後釋出對抓取結果的《大眾點評熱門餐廳抓取與資料分析》,資料分析結果同上見github。