Py爬蟲後的資料分析+出圖

阿新 • • 發佈：2018-12-17

隔了好久都沒及時把爬蟲後的資料進行處理，幹嗎去了呢，看了個非同步載入網頁爬蟲以及emmm校園瑣事，今天填坑。

上次爬蟲後主有六個維度的資訊：價格、面積、當前熱度，這三個是數值型的；區域、描述、戶型，這三個是文字型。

這次並沒有對戶型和描述做分析，這個放在下次單獨學詞雲的作圖。

先對數值型進行分析

df.describe()
Out[10]: 
               price       square      popular
count    8694.000000  8694.000000  8694.000000
mean     9463.443639    95.332610     2.725213
std      8854.171512    57.460427     4.084303
min       600.000000    11.630000     0.000000
25%      4800.000000    58.330000     0.000000
50%      6800.000000    81.610000     1.000000
75%     10500.000000   112.150000     4.000000
max    130000.000000   682.080000    44.000000

在經過一系列租房市場震盪後符合我們要求結果有8694條。

價格在9463.4元/月，一半的價格是在6800元以下的，價格極差較大[600,13w]；

面積平均95.3平方米，一半的面積小於81.6平方米，75%以下的面積小於112.15平方米，雖然最大到了682平方米的租房面積，112.15以上的還是少數（佔25%）；

受歡迎程度來看極差較大[0,44],後面我們來看看到底哪些房源受歡迎，哪些沒人看；

這裡可以加一列面積的均價：

df['average_price']=df['price']/df['square']

df.describe()
Out[12]: 
               price       square      popular  average_price
count    8694.000000  8694.000000  8694.000000    8694.000000
mean     9463.443639    95.332610     2.725213      99.250649
std      8854.171512    57.460427     4.084303      65.722560
min       600.000000    11.630000     0.000000      19.686589
25%      4800.000000    58.330000     0.000000      64.365126
50%      6800.000000    81.610000     1.000000      92.093085
75%     10500.000000   112.150000     4.000000     123.314479
max    130000.000000   682.080000    44.000000    4135.338346

租金面價均價99.25元/平方，75%在123元/平方以下，最大到了4135元/平，一個月的租金，這個最大值讓我忍不住先去看了下到底何方神聖

從圖中可看出，9月北京市各區域的房租均在3500元/月以上，其中最高區域為朝陽區，均價達12695元/月。同時朝陽區也是房源最多的區，超過房源第二的海淀區一倍以上。

朝陽區位與東城區、豐臺區、海淀區相毗鄰，北連昌平區、順義區，東與通州區接壤，南與大興區相鄰，幅員面積470.8平方公里，平均海拔34米，是北京市中心城區中面積最大的一個區。全區常住人口308.3萬（2008年資料，現在估計會*1.5）

上圖可以看出，均價在8000-10000之間的房屋數量最多，其次為4000-8000和1w+的第二梯隊，第三梯隊為2000-4000，小於2000的很少【哪裡有這麼好的地方我去租，，，】

據北京市統計局資料，2017年全市居民月人均可支配收入為4769元。

當然考慮到租房中合租行為較多，

另查58同城和趕集網釋出的報告，2017年北京人均月租金為2795元。

北京租房者的房租收入比，驚人地接近60%。很多人一半的收入，都花在了租房上。

為了進一步拆解價格，分析單位面積均價特徵

上圖表明單位面積的均價在50-100最多，100-150其次。按照人家租房花費2795元來計算，均價在50-100的主力軍所代表的是人均18.6-27.9平米的住房空間，emmm很多人真的是隻租了臥室啊。。怪不得還見過一個臥室分床再出租的┗( ▔, ▔ )┛

從上圖可以看出瀏覽較多，受歡迎的房源還是100平以下的居多，在鏈家網租房子為了住的還是多數，大平米的瀏覽都比較少了。

從房源面積來看60-100平房源最多，其次為30-60平的房源，整個供需還是配套的。

Python資料分析出圖練習end

做完圖順便查了下：

北京房地產中介行業協會在九月和十月還是做出些成績的

10月21日的新聞：北京市房地產中介行業協會公佈的最新資料顯示，截至10月20日，10家租賃服務企業累計投放房源133131套（間），完成公開承諾總量的1.1倍。環比9月，北京住房租賃市場量、價繼續回落，成交量環比下降約8%

最後程式碼

import numpy as np
import pandas as pd
from pyecharts import Overlap, Bar, Line, Kline, Pie
df = pd.read_csv('house price.csv',sep=',',header=None,encoding='utf-8',names=['area','title','price','rtype','square','popular'])
#dtype = {'area':str,'title':str,'price':int,'rtype':str,'square':float,'popular':int}
df=df.dropna(axis = 0)
#df=df[~df['popular'].isin(['nan'])]  備選
df['price']=df['price'].astype(int)
df['square']=df['square'].astype(float)
 #清理過後只有8694條有效資料，雖然只爬了鏈家（自如又爆出舉報檢測機構），從這家老牌租房網站來看其實北京租房市場也沒我之前想象的大，自如蛋殼就更少了。
 #均價為9463.5元/月，中位數為6800。一半的房源價格在6800以下，所有房源的價格極差大[600,130000]
df['average_price']=df['price']/df['square']
area = df.groupby(['area'])
house_com = area['price'].agg(['mean','count'])
house_com.reset_index(inplace=True)
area_main = house_com.sort_values('count',ascending=False)
attr = area_main['area']

v1 = area_main['count']

v2 = area_main['mean']
line = Line("北京主要區域房租均價")

line.add("房租均價",attr,v2,is_stack=True,xaxis_rotate=30,yaxix_min=4.2,
    mark_point=['min','max'],xaxis_interval=0,line_color='#32CD32',
    line_width=4, mark_point_textcolor='red',mark_point_symbol="",)
bar = Bar("北京主要區域房屋數量與均價")
bar.add("數量",attr,v1,is_stack=True,xaxis_rotate=30,yaxix_min=4.2,
    xaxis_interval=0,is_splitline_show=False,color='green')
overlap = Overlap()

overlap.add(bar)

overlap.add(line,yaxis_index=1,is_add_yaxis=True)

overlap.render('北京路段_房屋均價分佈圖.html')
#房源價格區間分佈圖

price_info = df[['area', 'price','average_price']]

#對價格分割槽
bins = [0,1000,1500,2000,2500,3000,4000,5000,6000,8000,10000]
level = ['0-1000','1000-1500', '1500-2000', '2000-3000', '3000-4000', '4000-5000', '5000-6000', '6000-8000', '8000-1000','10000以上']
price_stage = pd.cut(price_info['price'], bins = bins,labels = level).value_counts().sort_index()
attr = price_stage.index
v3 = price_stage.values
bar2 = Bar("價格區間&房源數量分佈")

bar2.add("",attr,v3,is_stack=True,xaxis_rotate=30,yaxix_min=4.2,

    xaxis_interval=0,is_splitline_show=False)

average_price_stage = pd.cut(price_info['average_price'], bins = bins,labels = level).value_counts().sort_index()
attr = average_price_stage.index
v4 = average_price_stage.values

bar3 = Bar("價格區間&房源數量分佈")
bar3.add("",attr,v4,is_stack=True,xaxis_rotate=30,yaxix_min=4.2,

    xaxis_interval=0,is_splitline_show=False)

overlap = Overlap()

overlap.add(bar2)

overlap.render('價格區間&房源數量分佈.html')





bins = [0,50,100,150,200,300,500,1000,10000]
level = ['0-50','50-100','100-150', '150-200', '200-300', '300-500', '500-1000', '1000以上']
average_price_stage = pd.cut(price_info['average_price'], bins = bins,labels = level).value_counts().sort_index()
attr = average_price_stage.index
v4 = average_price_stage.values

bar3 = Bar("面積均價區間&房源數量分佈")
bar3.add("",attr,v4,is_stack=True,xaxis_rotate=30,yaxix_min=4.2,

    xaxis_interval=0,is_splitline_show=False)

overlap = Overlap()

overlap.add(bar3)

overlap.render('單位面積價格區間&房源數量分佈.html')

distribution=[]
popular_com=df['square'].groupby(df['popular'])
for i in range(len(list(popular_com.max()))):
    distribution.append([list(popular_com.min())[i],list(popular_com.min())[i],list(popular_com.max())[i],list(popular_com.max())[i]])

kline = Kline("不同受歡迎程度房源的面積分布圖")
kline.add("", list(popular_com.max().index),distribution)
kline.render('熱度面積')
    

#房屋面積分佈

bins =[0,30,60,90,120,150,200,300,400,700]
level = ['0-30', '30-60', '60-90', '90-120', '120-150', '150-200', '200-300','300-400','400+']
square_level= pd.cut(df['square'],bins = bins,labels = level)
s = square_level.value_counts()
attr = s.index
v5 = s.values
pie = Pie("房屋面積分佈",title_pos='center')
pie.add(
    "",
    attr,
    v5,
    radius=[40, 75],
    label_text_color=None,
    is_label_show=True,
    legend_orient="vertical",
    legend_pos="left",
)
overlap = Overlap()
overlap.add(pie)
overlap.render('房屋面積分佈.html')

Py爬蟲後的資料分析+出圖

Py爬蟲後的資料分析+出圖

[py]監控內存並出圖

500G python web、爬蟲、資料分析、機器學習、大資料、前端實戰專案視訊程式碼免費分享

Python 爬蟲和資料分析實戰

【R語言資料分析】豆瓣電影R語言爬蟲和資料分析

python資料分析常用圖大集合

Python資料分析學習路徑圖

使用R語言ggplot2包繪製pathway富集分析氣泡圖（Bubble圖）:資料結構及程式碼

MT2511晶片技術分析資料，MT2511資料表原理圖

MT6370晶片技術分析資料，MT6370資料表原理圖

MT6238晶片技術分析資料，MT6238資料表原理圖

MT523晶片技術分析資料，MT523資料表原理圖

MT5932晶片技術分析資料，MT5932資料表原理圖

Python基礎（六）--- Python爬蟲，Python整合Hbase，PythonWorldCount，Spark資料分析生成分析圖表

MT6325晶片技術分析資料，MT6325資料表原理圖

爬蟲[1]---頁面分析及資料抓取

未明學院資料分析報告：我們爬了微博10位明星夫妻，分析出胡歌和迪麗熱巴的物件可能是

未明學院學員報告：做了微博資料分析後，我發現現在最火的明星原來是……

爬蟲入坑到資料分析，自學Python的幾點經驗分享

Python爬蟲：爬取拉勾網資料分析崗位資料

Py爬蟲後的資料分析+出圖

相關推薦