Python爬取並分析全國新樓盤資料

阿新 • • 發佈：2021-06-20

Python爬取並分析全國新樓盤資料

一、選題背景

Q:為什麼選擇選擇此題？

隨著網路的迅速發展，全球資訊網成為大量資訊的載體，如何有效地提取並利用這些資訊成為一個巨大的挑戰

Q：達到什麼預期目標？

未來發展前景廣闊，人口流入將會增加對於房產的需求，獲取更多的新樓盤資料，並分析人口流動。

Q：專案背景

十三屆全國人大四次會議5日上午9時在人民大會堂開幕，其中住房政策：“房住不炒”，解決好大城市住房突出問題!進行視覺化分析。

二、主題式網路爬蟲設計

Q:主題是網路爬蟲名稱

爬取並分析全國新樓盤資料

Q：主題式網路爬蟲的內容

爬取最新中國新樓盤資料，並實現資料視覺化。

Q：設計方案描述

爬蟲使用到的模組有requests_html、requests_cache、bs4.BeautifulSoup、re等

三、主頁體面的結構特徵分析

這是一個的首頁介面

結構特徵

標籤是div

結構分析

四、網路爬蟲程式分析

定義函式

#定義好獲取每個專案資訊的函式。

 1 defget_house_status(soup):
 2 """
 3 獲取房屋狀態資訊
 4 """
 5 house_status=[]
 6 status=soup.find_all(attrs={'class':'fangyuan'})
 7 forstateinstatus:
 8 _status=state.span.text
 9 house_status.append(_status)
10 return 
house_status
11  
12 defget_house_price(soup):
13 """
14 獲取房屋價格資訊
15 """
16 house_price=[]
17 regex=re.compile('\s(\S+)\s')
18 prices=soup.find_all(attrs={'class':'nhouse_price'})
19 forpriceinprices:
20 _prices=regex.findall(price.text)
21 _price=''
22 if_prices[0]=='價格待定':
23 pass
24 else:
25 p=_prices[0].split(' 
元')[0]
26 if'萬'inp:
27 _price=p+'元/套'
28 else:
29 _price=p+'元/m2'
30 house_price.append(_price)
31 returnhouse_price
32  
33 defget_house_address(soup,c_city):
34 """
35 獲取房屋地址資訊
36 """
37 house_address=[]
38 region=[]
39 regex=re.compile('\s(\S+)\s')
40 addresses=soup.find_all(attrs={'class':'address'})
41 foraddressinaddresses:
42 _address=regex.findall(address.text)
43 iflen(_address)>1:
44 region.append(_address[0].split('[')[1].split(']')[0])
45 else:
46 region.append(c_city)
47 house_address.append(address.a['title'])
48 returnregion,house_address
49  
50 defget_house_type(soup):
51 """
52 獲取房屋型別資訊
53 """
54 house_type=[]
55 regex=re.compile('\s(\S+)\s')
56 house_types=soup.find_all(attrs={'class':'house_typeclearfix'})
57 for_house_typeinhouse_types:
58 type_list=regex.findall(_house_type.text)
59 type_str=''
60 foriintype_list:
61 type_str+=i
62 house_type.append(type_str)
63 returnhouse_type
64  
65 defget_house_name(soup):
66 """
67 獲取專案名稱資訊
68 """
69 house_name=[]
70 regex=re.compile('\s(\S+)\s')
71 nlcd_names=soup.find_all(attrs={'class':'nlcd_name'})
72 fornlcd_nameinnlcd_names:
73 name=''
74 names=regex.findall(nlcd_name.text)
75  
76 iflen(names)>1:
77 forninnames:
78 name+=n
79 house_name.append(name)
80 else:
81 house_name.extend(names)
82 returnhouse_name

獲取資料的主函式

defget_data(c_city,city,start_page,cache):
"""
獲取資料
"""
requests_cache.install_cache()
requests_cache.clear()
session=requests_cache.CachedSession()#建立快取會話
session.hooks={'response':make_throttle_hook(np.random.randint(8,12))}#配置鉤子函式
print(f'現在爬取{c_city}'.center(50,'*'))
last_page=get_last_page(city)
print(f'{c_city}共有{last_page}頁')
time.sleep(np.random.randint(15,20))
df_city=pd.DataFrame()
user_agent=UserAgent().random
forpageinrange(start_page,last_page):
try:
cache['start_page']=page
print(cache)
cache_json=json.dumps(cache,ensure_ascii=False)
withopen('cache.txt','w',encoding='utf-8')asfout:
fout.write(cache_json)
print(f'現在爬取{c_city}的第{page+1}頁.')
ifpage==0:
df_city=pd.DataFrame()
else:
df_city=pd.read_csv(f'df_{c_city}.csv',encoding='utf-8')
url=html_url(city,page+1)
ifpage%2==0:
user_agent=UserAgent().random
#建立隨機請求頭
header={"User-Agent":user_agent}
res=session.post(url,headers=header)
ifres.status_code==200:
res.encoding='gb18030'
soup=BeautifulSoup(res.text,features='lxml')
#對html進行解析，完成初始化
region,house_address=get_house_address(soup,c_city)
house_name=get_house_name(soup)
house_type=get_house_type(soup)
house_price=get_house_price(soup)
house_status=get_house_status(soup)
df_page=to_df(c_city,
region,
house_name,
house_address,
house_type,
house_price,
house_status)
df_city=pd.concat([df_city,df_page])
df_city.head(2)
time.sleep(np.random.randint(5,10))
df_city.to_csv(f'df_{c_city}.csv',
encoding='utf-8',
index=False)
except:
#若報錯則儲存資料、以便繼續
df_city.to_csv(f'df_{c_city}.csv',encoding='utf-8',index=False)
cache_json=json.dumps(cache,ensure_ascii=False)
withopen('cache.txt','w',encoding='utf-8')asfout:
fout.write(cache_json)
returndf_city

爬取過程中，將每個城市單獨儲存為一個csv檔案。

合併資料

 1 importos
 2 importpandasaspd
 3 df_total=pd.DataFrame()
 4 forroot,dirs,filesinos.path.walk('./全國房價資料集'):
 5 forfileinfiles:
 6 split_file=os.path.splitext(file)
 7 file_ext=split_file[1]
 8 iffile_ext=='.csv':
 9 path=root+os.sep+file
10 df_city=pd.read_csv(path,encoding='utf-8')
11 df_total=pd.concat([df_total,df_city])
12 df_total.to_csv(root+os.sep+'全國新房202102.csv',encoding='utf-8',index=False)

資料清洗

匯入需要用的模組

1 importpandasaspd
2 importnumpyasnp
3 importmatplotlib.pyplotasplt
4 importseabornassns
5 importmissingnoasmsno

讀取資料　　

1 raw_data=pd.read_csv('全國新房202102.csv',encoding='utf-8') 2 raw_data.sample(5)

檢視下資料基本情況

1 >>>raw_data.shape
2 (54733,7)
3  
4 
5 >>>len(raw_data.city.drop_duplicates())
6 581

爬取了全國581個城市，共計54733個在售、預售房產專案。

由於獲取到的資料存在缺失值、異常值以及不能直接使用的資料，因此在分析前需要先處理缺失值、異常值等，以便後續分析。

缺失值分析

1 msno.matrix(raw_data)

整體來看，處理house_price存在缺失值，這是因為這部分樓盤是預售狀態，暫未公佈售價。

house_type

再仔細分析，house_price有兩種形式。

除了預售缺失值外，有單價和總價兩種，為方便統計，需將總價除以面積，將價格統一為單均價。因此需要對戶型house_type進行處理。

defdeal_house_type(data):
res=[]
ifdataisnp.nan:
return[np.nan,np.nan,np.nan]
else:
if'－'indata:
types=data.split('－')[0]
areas=data.split('－')[1]
area=areas.split('~')
iflen(area)==1:
min_area=areas.split('~')[0][0:-2]
max_area=areas.split('~')[0][0:-2]
else:
min_area=areas.split('~')[0]
max_area=areas.split('~')[1][0:-2]
 
res=[types,int(min_area),int(max_area)]
returnres
else:
return[np.nan,np.nan,np.nan]

1 series_type=raw_data.house_type.map(lambdax:deal_house_type(x))
2 df_type=pd.DataFrame(series_type.to_dict(),index=['house_type','min_area','max_area']).T
3 data_type=pd.concat([data_copy.drop(labels='house_type',axis=1),df_type],axis=1)
4 data_type.head()

得到下表

house_price

得到戶型面積後，接下來處理房屋價格。

 1 defdeal_house_price(data):
 2 try:
 3 ifdata.house_priceisnp.nan:
 4 returnnp.nan
 5 else:
 6 if"價格待定"indata.house_price:
 7 returnnp.nan
 8  
 9 elif"萬"notindata.house_price:
10 price=int(data.house_price.split('元')[0])
11 else:
12 price_total=int(float(data.house_price.split('萬')[0])*10000)
13 ifdata.min_areaisnp.nananddata.max_areaisnp.nan:
14 returnnp.nan
15 elifdata.min_areaisnp.nan:
16 price=price_total/data.max_area
17 elifdata.max_areaisnp.nan:
18 price=price_total/data.min_area
19 else:
20 price=price_total/(data.min_area+data.max_area)
21 returnint(price)
22 except:
23 returnnp.nan

1 series_price=data_type.apply(lambdax:deal_house_price(x),axis=1)
2 data_type['house_price']=series_price
3 data_type.head()

得到結果

缺失值處理

1 data=data_type.copy()
2 #房價缺失值用0填充
3 data['house_price']=data_type.house_price.fillna(0)
4 data['house_type']=data_type.house_type.fillna('未知')

異常值分析

1 data.describe([.1,.25,.5,.75,.99]).T

很明顯有個缺失值，檢視原網頁，此數值因較特殊，清洗過程中多乘100000，因此直接將此值更改過來。

還可以通過視覺化（箱圖）的方式檢視異常值。

1 frompyechartsimportoptionsasopts
2 frompyecharts.chartsimportBoxplot
3  
4 v=[int(i)foriindata.house_price]
5 c=Boxplot()
6 c.add_xaxis(["house_price"])
7 c.add_yaxis("house_price",v)
8 c.set_global_opts(title_opts=opts.TitleOpts(title="house_price"))
9 c.render_notebook()

視覺化分析

全國城市在售新房均價條形圖

 1 frompyecharts.chartsimportBar
 2 frompyecharts.globalsimportThemeType
 3  
 4 x_axis=[iforiindata_pivot.index[0:15]]
 5 y_axis=[round(float(i),1)foriindata_pivot.house_price.values[0:15]]
 6  
 7 c=(
 8 Bar({"theme":ThemeType.DARK})
 9 .add_xaxis(x_axis)
10 .add_yaxis("house_price_avg",y_axis)
11 .set_global_opts(
12 title_opts=opts.TitleOpts(title="全國城市在售新房均價TOP15",subtitle="資料:STUDIO"),
13 brush_opts=opts.BrushOpts(),
14 )
15 )
16 c.render_notebook()

結果如下，排名前面的一直都是深圳、北京、上海等一線城市。

全國房價地理位置圖

 1 importpandasaspd
 2 frompyecharts.globalsimportThemeType,CurrentConfig,GeoType
 3 frompyechartsimportoptionsasopts
 4 frompyecharts.chartsimportGeo
 5  
 6 #自定義各城市的經緯度
 7 #geo_cities_coords={df.iloc[i]['城市']:[df.iloc[i]['經度'],df.iloc[i]['緯度']]foriinrange(len(df))}
 8  
 9 datas=[(i,int(j))fori,jinzip(data_pivot.index,data_pivot.values)]
10 #print(datas)
11  
12 geo=(Geo(init_opts=opts.InitOpts(width='1000px',
13 height='600px',
14 theme=ThemeType.PURPLE_PASSION),
15 is_ignore_nonexistent_coord=True)
16 .add_schema(maptype='china',
17 label_opts=opts.LabelOpts(is_show=True))#顯示label省名
18 .add('均價',
19 data_pair=datas,
20 type_=GeoType.EFFECT_SCATTER,
21 symbol_size=8,
22 #geo_cities_coords=geo_cities_coords
23 )
24 .set_series_opts(label_opts=opts.LabelOpts(is_show=False))
25 .set_global_opts(
26 title_opts=opts.TitleOpts(title='全國城市在售新房均價',subtitle="製圖:資料STUDIO"),
27 visualmap_opts=opts.VisualMapOpts(max_=550,
28 is_piecewise=True,
29 pieces=[
30 {"max":5000,"min":1000,"label":"1000-5000","color":"#708090"},
31 {"max":10000,"min":5001,"label":"5001-10000","color":"#00FFFF"},
32 {"max":20000,"min":10001,"label":"10001-20000","color":"#FF69B4"},
33 {"max":30000,"min":20001,"label":"20001-30000","color":"#FFD700"},
34 {"max":40000,"min":30001,"label":"30001-40000","color":"#FF0000"},
35 {"max":100000,"min":40001,"label":"40000-100000","color":"#228B22"},])
36 )
37 )
38  
39 geo.render('全國城市在售新房均價.html')

近年來，火熱的樓市價格一路飆升，為了穩定房價，各地政府相繼出臺各項調控政策。據統計，今年內全國各地累計出臺樓市調控政策次數已高達97次（近100次），其中，1月份單月全國各地樓市調控政策次數高達42次，2月份比1月份多3次，共計45次。

全國新房專案總數排行榜

接下來看看全國在售\預售新房專案總數排行TOP20，排在前五的分別是四川成都--1000個，重慶--938個，湖北武漢--859個，陝西西安--840個，河南鄭州--822個，均是新一線城市（成都、杭州、重慶、武漢、蘇州、西安、天津、南京、鄭州、長沙、瀋陽、青島、寧波、東莞和無錫）。

現在的新一線城市經濟發展速度較快，未來發展前景廣闊，可以說是僅次於北上廣深。人口都在持續流入，人口流入將會增加對於房產的需求，房產需求增長將會讓房產價格穩步攀升。也是很值得投資的。

 1 frompyechartsimportoptionsasopts
 2 frompyecharts.chartsimportBar
 3  
 4 city_counts=data.city.value_counts()[0:20]
 5 x_values=city_counts.index.to_list()
 6 y_values=[int(i)foriincity_counts.values]
 7  
 8 bar=(
 9 Bar()
10 .add_xaxis(x_values)
11 .add_yaxis("",y_values,itemstyle_opts=opts.ItemStyleOpts(color="#749f83"))
12 .set_global_opts(title_opts=opts.TitleOpts(title="全國新房專案總數TOP20"),
13 toolbox_opts=opts.ToolboxOpts(),
14 legend_opts=opts.LegendOpts(is_show=False),
15 datazoom_opts=opts.DataZoomOpts(),)
16 )
17 bar.render_notebook()

五、總結

在學習python爬取並分析全國新樓盤資料，對爬取知識有了一個深的瞭解，雖然還有很多的不足，但是隊python網路爬取的知識有了深刻的瞭解，而且在爬取的過程也是一個非常有趣的。

Python爬取並分析全國新樓盤資料

Python爬取並分析全國新樓盤資料

爬取並分析全國新樓盤資料

三、主頁體面的結構特徵分析

四、網路爬蟲程式分析

定義函式

獲取資料的主函式

合併資料

資料清洗

匯入需要用的模組

讀取資料

檢視下資料基本情況

缺失值分析

house_type

house_price

缺失值處理

異常值分析

視覺化分析

全國城市在售新房均價條形圖

全國房價地理位置圖

全國新房專案總數排行榜

讀取資料

相關推薦

讀取資料