爬蟲-攜程酒店資訊抓取降妖除魔(下)
阿新 • • 發佈:2019-02-06
#這篇主要是講抓取酒店頁面list的經歷,也有很多坑,反爬,價格資料放在其他位置
多分析才能事半功倍
1.通過分析酒店相關資訊list也是ajax載入,存放在json資料中,價格也在同一個json中但是放在另外的位置通過酒店id對應
2.下來就是主要提取自己需要的資訊,然後儲存就好,這邊存放csv和mysql資料庫,程式碼中有兩個註釋知識點著重留意下,然後就是儲存到mysql是通過pymysql建立,navicat建立表,記得表名和列名需要與插入資料一一對應
import requests import json import re import csv import demjson import pymysql #連線寫入提交 conn = pymysql.Connect(host='localhost', port=3306, user='root', passwd='***', db='jiudian') curor = conn.cursor() lists=[] dicts={} ss=0 for i in range(1,20): url="http://hotels.ctrip.com/Domestic/Tool/AjaxHotelList.aspx" headers={ "Connection": "keep-alive", "origin":"http://hotels.ctrip.com", "Host": "hotels.ctrip.com", "referer": "http://hotels.ctrip.com/hotel/beijing1", "user-agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36", } data={ "StartTime":"2018-10-09", "DepTime": "2018-10-10", "RoomGuestCount": "1,1,0", "cityId":1, "cityPY":" beijing", "cityCode":"010", "cityLat": 39.9105329229, "cityLng":116.413784021, "page":i, } html=requests.post(url,headers=headers,data=data) #ValueError: Invalid \escape: line 1 column 35442 (char 35441)問題在於編碼中是\xa0之類的,當遇到有些 不用轉義的\http之類的,則會出現以上錯誤。解決方案如下: regex = re.compile(r'\\(?![/u"])') fixed = regex.sub(r"\\\\", html.text) aa=json.loads(fixed) for n in range(0,25): dianming = aa["hotelPositionJSON"][n]["name"] #python eval函式,將列表樣式的字串轉化為列表 jiage=eval(aa["HotelMaiDianData"]["value"]["htllist"])[n]["amount"] xinji=aa["hotelPositionJSON"][n]["star"][-2:] dangci=aa["hotelPositionJSON"][n]["stardesc"] pingfen=aa["hotelPositionJSON"][n]["score"] lianjie="http://hotels.ctrip.com"+aa["hotelPositionJSON"][n]["url"] ss += 1 lists.append([ss, dianming,xinji,dangci,pingfen,jiage + "元",lianjie]) # lists.append([s,"酒店名:"+name,"星級:"+xinji,"檔次:"+dangci,"評分:"+pingfen,"價格:"+jiage+"元"]) dicts[ss]=["酒店名:"+dianming,"星級:"+xinji,"檔次:"+dangci,"評分:"+pingfen,"價格:"+jiage+"元","連結:"+lianjie] print("正在檢索中"+str(ss)) hot = "insert into jdlist(jd_num,jd_name,jd_star,jd_good,jd_fen,jd_jiage,jd_link) values('%s','%s','%s','%s','%s','%s','%s')" % (ss,dianming,xinji,dangci,pingfen,jiage,lianjie) curor.execute(hot) conn.commit() # self.conn.close() # mm=re.findall('.*?"amount":"(.*?)"}',jiage) # print(lists) with open("bjjiudian.csv", "w", encoding="utf-8",newline="") as f: k = csv.writer(f, dialect="excel") k.writerow(["數量", "酒店名", "星級", "檔次", "評分", "價格","連結"]) for list in lists: k.writerow(list) print(lists) print(dicts)