為了找到物美價廉的房子，連夜爬了某租房網站1W多條租房資訊

阿新 • • 發佈：2020-11-03

本文的文字及圖片來源於網路,僅供學習、交流使用,不具有任何商業用途,版權歸原作者所有,如有問題請及時聯絡我們以作處理

以下文章來源於python資料分析之禪，作者：小dull鳥

前言

最近要租房子了，為了找到物美價廉的房子，我昨天連夜爬了某租房網站7000多條租房資訊，爬取結果如下：

本次爬取難點在於數字解密，好在最後都解決了，下面把爬取過程分享給大家

一、分析網頁，獲取原始資料

網址為：https://bj.58.com/zufang/

此網頁有2類資料：

第一種是嵌在網頁內的資料
第二種是axja獲取的json資料，解析後插入網頁

對於第一種，由於資料在網頁中，我們只需模擬請求網頁，解析網頁資料，把我們需要的資料儲存即可：

由上圖可以發現，原始網頁中，戶型、價格等數字資訊顯示不對，已被加密。這裡先不管，後面再講解密，爬蟲程式碼如下：

import requests
from bs4 import BeautifulSoup
s = requests.Session()
s.headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'accept-language': 'en-US,zh;q=0.8,zh-CN;q=0.7,zh-TW;q=0.5,zh-HK;q=0.3,en;q=0.2',
    'referer': 'https://bj.58.com/zufang/',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0',
}
s.get(url='https://bj.58.com/zufang/')
response=s.get('https://bj.58.com/zufang/pn1/?PGTID=0d300008-0000-1e0f-27d7-0238e53f2f24&ClickID=2')
soup=BeautifulSoup(response.text,'html.parser')
strongbox=soup.find_all('li',class_='house-cell')+soup.find_all('li',class_='apartments')
for box in strongbox:
    result=[]
    name=box.find_all('a',class_="strongbox")[0].text.replace('\n','').replace(' ','').replace('~','')
    room=box.find_all('p',class_='room')[0].text.replace('\n','').replace(' ','').replace('\xa0','')
    layout=room.split('衛')[0]+'衛'
    area=room.split('衛')[1]
    infor=box.find_all('p',class_='infor')[0].text.replace('\n','').replace(' ','').replace('  ','')
    money=box.find_all('div',class_='money')[0].text.replace('\n','')

對於第二種，需要抓包獲取資料介面：

通過抓包，很容易獲取資料介面，該介面通過pageNum引數控制頁碼，總共有54頁，返回的是json格式資料，程式碼如下：

importrequests,json
forpageinrange(54):
url='https://gongyu.58.com/guide/api_for_renting?displayLimitNum=15&basequery=room:j|cityId:1|areaId:1|cateId:8&cookie=e87rZl4Z2EYLG6ynBNNEAg==&pageNum={0}&_=1603797279731'.format(page)
response=requests.get(url)
response.encoding='utf-8'
data=json.loads(response.text)['data']
data=data['position1']['list'][1:]+data['position2']['list'][1:]+data['position3']['list'][1:]+data['position4']['list'][1:]
foriindata:
name=i['title'].replace('','')
layout=i['layout']
area=i['rentRoomArea']
infor=i['dispLocal']
money=i['price']

可以發現，這部分資料沒有加密

二、對第一部分資料進行解碼

網上有一種方法是找出加密後的文字與數字的對應關係，然後進行解密，這種方法是不對的，因為網頁每重新整理1次，這種對應關係就會重新改變。

這類問題屬於字型加密，字型加密一般是網頁修改了預設的字元編碼集，在網頁上載入的他們自己定義的字型檔案作為字型的樣式，可以正確地顯示數字，但是在原始碼上同樣的二進位制數由於未載入自定義的字型檔案就由計算機預設編碼成了亂碼。

一般來說，通用的解決辦法是找到字型檔案，分析檔案中的對映關係。一般來說，字型檔案都是作為樣式加在加密字型的部位。

通過測試，當取消font-family前面的勾選後，網頁中的資料開始加密

所以可以確定，fangchan-secret最可能是字型加密檔案

在原始碼中Ctrl+F搜尋fangchan-secret 尋找字型加密檔案

字型檔案是通過base64加密之後放在js裡面

下面開始寫程式碼進行解密：

1.用正則將加密部分提取出來，然後用base64解碼，轉化成為二進位制形式

bs64Str=re.findall("charset=utf-8;base64,(.*?)\'\)",response.text)[0]
binData=base64.decodebytes(bs64Str.encode())

2.寫入otf字型檔案

filePath01=r'\jiemi.otf'
withopen(filePath01,'wb')asf:
f.write(binData)
f.close()

3.解析字型庫

font01=TTFont(filePath01)
utfList=font01['cmap'].tables[0].ttFont.tables['cmap'].tables[0].cmap#c=font.getBestCmap()
retList=[]
foriingetText:
 iford(i)inutfList:
text=int(utfList[ord(i)][-2:])-1
else:
text=i

4.構造解密函式，傳入加密後的文字，返回解析後的數字

defconvert(getText):
bs64Str=re.findall("charset=utf-8;base64,(.*?)\'\)",response.text)[0]
binData=base64.decodebytes(bs64Str.encode())
#寫入otf字型檔案
filePath01=r'\jiemi.otf'
withopen(filePath01,'wb')asf:
f.write(binData)
f.close()
#解析字型庫
font01=TTFont(filePath01)
utfList=font01['cmap'].tables[0].ttFont.tables['cmap'].tables[0].cmap#c=font.getBestCmap()
retList=[]
foriingetText:
#ord()以字元作為引數，返回對應的Unicode數值
iford(i)inutfList:
text=int(utfList[ord(i)][-2:])-1
else:
text=i
retList.append(text)
return(''.join([str(j)forjinretList]).split('\n'))

三、將解密函式應用到爬蟲程式碼中，並將最終資料儲存在csv表格中

name=box.find_all('a',class_="strongbox")[0].text.replace('\n','').replace('','').replace('~','')
room=box.find_all('p',class_='room')[0].text.replace('\n','').replace('','').replace('\xa0','')
room=convert(room)[0]
layout=room.split('衛')[0]+'衛'
area=room.split('衛')[1]
infor=box.find_all('p',class_='infor')[0].text.replace('\n','').replace('','').replace('','')
money=box.find_all('div',class_='money')[0].text.replace('\n','')
money=convert(money)[0]
result=[name,layout,area,infor,money]
withopen('租房資料20201027.csv','a+',newline='',encoding='gb18030')asf:
f_csv=csv.writer(f)
f_csv.writerow(result)