基於 Python 和 Pandas 的數據分析(7) --- Pickling
上一節我們介紹了幾種合並數據的方法. 這一節, 我們將重新開始不動產的例子. 在第四節中我們寫了如下代碼:
import Quandl
import pandas as pd
fiddy_states = pd.read_html(‘https://simple.wikipedia.org/wiki/List_of_U.S._states‘)
for abbv in fiddy_states[0][1][2:]:
#print(abbv)
print("FMAC/HPI_"+str(abbv))
上面這段代碼是為了獲取 50 個州的簡稱, 進行遍歷, 並生成恰當的 Quandl 查詢語句, 用以查出每個州的房價. 如果單個處理, 將會有50個 dataframes. 大家可以查看一下第四節的教程, 發現房價的查詢結果都會返回 "Date"索引, 所以我們可以使用 "join" 來做數據合並. (通常情況下, 可能沒有現成的索引, 那麽可以使用 "concatenate", 並加上參數 on = ‘Date‘)
下面我們開始收集數據:
import Quandl
import pandas as pd
mytoken = ‘HK1FqyWqrmUwLCYnWcHa‘
fiddy_states = pd.read_html(‘https://simple.wikipedia.org/wiki/List_of_U.S._states‘)
main_df = pd.DataFrame()
for abbv in fiddy_states[0][1][2:]:
query = "FMAC/HPI_"+str(abbv)
df = Quandl.get(query, authtoken=mytoken)
if main_df.empty:
main_df = df
else:
main_df = main_df.join(df)
註意: 這裏可能會有一點問題, Quandl 在返回數據的時候, 如果是只有一個列, 它會自動把列名改為 "value". 那麽當我們循環的時候, 就會有問題, 變成每個列名都是 "value", 就會報類似這樣的錯:
ValueError: columns overlap but no suffix specified: Index([u‘Value‘], dtype=‘object‘)
所以這裏我們要改一下, 循環之後, 把列名"value" 改成每個州的縮寫, 也就是每個 "abbv" 的值.
df.rename(columns={‘NSA Value‘:str(abbv)+‘NSA Value‘, ‘SA Value‘:str(abbv)+‘SA Value‘}, inplace=True)
很好, 但是你可能發現, 每運行一次都至少要花費30秒鐘, 這是讓人很不爽的. 所以我們要想辦法把這個數據保存成一個文件, 這樣以後需要用的時候, 就基於這個文件就好了.
這裏就引出 python 的另一個模塊"Pickle", 它可以把對象轉成二進制字節流, 也可以通過 unpickling 進行反向轉換. 我們可以通過它保存任何 Python 對象, 不管是機器學習的 "classifier", 還是 "Dictionary", 又或者是 "dataframe" 都沒有問題. 另外, 除了 Python 的 pickle, Pandas 也有 pickle. 我們會分別學習這兩種.
首先, 我們來看下 Python 的 pickle. 我們可以用它來處理任何 Python 對象, 不一定非得是 dataframe. 但是在我們這個例子中, 我們還是用它來處理 dataframe.
第一步, 引入 pickle:
import pickle
第二步:
pickle_out = open(‘fiddy_states.pickle‘,‘wb‘)
pickle.dump(main_df, pickle_out)
pickle_out.close()
在上面的代碼中, 首先是以可寫的方式創建並打開一個 .pickle 的文件, 然後我們將數據 main_df 寫入這個文件. 最後保存關閉這個文件.
盡管我們不想重復地運行獲取每個州的房價的代碼, 但是我們仍然有偶爾需要運行它的可能, 所以我們還是要把代碼整理一下, 以提高它的復用性:
import Quandl
import pandas as pd
import pickle
mytoken = ‘HK1FqyWqrmUwLCYnWcHa‘
def state_list():
fiddy_states = pd.read_html(‘https://simple.wikipedia.org/wiki/List_of_U.S._states‘)
return fiddy_states[0][1][2:]
def grab_initial_state_data():
states = state_list()
main_df = pd.DataFrame()
for abbv in states:
query = "FMAC/HPI_"+str(abbv)
df = Quandl.get(query, authtoken=api_key)
print(query)
if main_df.empty:
main_df = df
else:
main_df = main_df.join(df)
pickle_out = open(‘fiddy_states.pickle‘,‘wb‘)
pickle.dump(main_df, pickle_out)
pickle_out.close()
grab_initial_state_data()
如上, 我們封裝了兩個函數, 這樣我們就可以很方便地隨時獲取所有州的列表, 也可以獲取所有房價的數據並把它們保存成 pickle 文件.
現在如果想要再次獲取房價的信息, 就只要運行下面的代碼就可以了:
pickle_in = open(‘fiddy_states.pickle‘,‘rb‘)
HPI_data = pickle.load(pickle_in)
print(HPI_data)
以可讀的方式打開剛剛保存的 pickle 文件, 把文件的內容保存到 HPI_data 變量中, 並打印出來.
ALNSA Value ALSA Value AKNSA Value AKSA Value AZNSA Value AZSA Value ARNSA Value ... WASA Value WVNSA Value WVSA Value WINSA Value WISA Value WYNSA Value WYSA Value
Date ...
1975-01-31 35.701696 35.757921 34.456361 34.655493 28.872050 28.841062 37.136508 ... 17.459678 40.675291 40.657172 27.742952 27.817227 31.322153 31.293386
1975-02-28 35.896549 35.961777 34.954360 35.123805 29.371317 29.365897 37.428949 ... 17.566117 41.759012 41.696684 28.178063 28.237985 31.808090 31.791372
1975-03-31 36.124717 36.138987 35.465174 35.513620 29.839679 29.825888 37.672883 ... 17.654450 42.890424 42.660165 28.598691 28.558429 32.319995 32.293365
1975-04-30 36.401656 36.355040 35.998432 35.890060 30.198725 30.147602 37.866182 ... 17.752318 44.062355 43.680733 28.968088 28.811228 32.845934 32.753322
1975-05-31 36.644220 36.536191 36.581805 36.298572 30.374603 30.269359 38.063496 ... 17.814398 45.245558 44.758631 29.267055 28.971560 33.354786 33.150798
1975-06-30 36.731053 36.581766 37.208749 36.714638 30.305563 30.135694 38.284160 ... 17.906027 46.371800 45.811578 29.506152 29.074535 33.840685 33.517736
1975-07-31 36.611401 36.455401 37.839987 37.179961 29.978801 29.815028 38.471020 ... 18.040211 47.354109 46.769160 29.713240 29.212413 34.309437 33.919358
1975-08-31 36.353737 36.193565 38.422661 37.742392 29.484860 29.308448 38.551793 ... 18.122491 48.125707 47.588758 29.918971 29.457899 34.741733 34.394430
1975-09-30 36.097182 35.948436 38.932931 38.391789 28.985826 28.809285 38.490112 ... 18.142710 48.614964 48.202548 30.189301 29.836924 35.110433 34.820695
1975-10-31 35.991355 35.853360 39.390670 39.092214 28.726736 28.562418 38.372037 ... 18.193266 48.775900 48.551512 30.500255 30.253867 35.379704 35.174103
1975-11-30 36.111149 36.005626 39.794554 39.774496 28.880050 28.758223 38.359030 ... 18.385560 48.679916 48.592545 30.720418 30.565281 35.553098 35.429515
1975-12-31 36.459427 36.435918 40.129427 40.291694 29.296251 29.189322 38.487981 ... 18.653890 48.456979 48.424852 30.775030 30.765919 35.667694 35.603243
1976-01-31 36.997043 37.055307 40.397424 40.630890 29.705911 29.674028 38.704100 ... 18.955271 48.212566 48.191089 30.711465 30.793688 35.763022 35.730177
1976-02-29 37.599405 37.667727 40.618981 40.815886 29.991147 29.985613 39.017508 ... 19.178024 48.050434 47.978716 30.615016 30.680120 35.889923 35.871059
1976-03-31 38.103297 38.118349 40.819723 40.875483 30.136340 30.122412 39.454330 ... 19.327694 48.016346 47.758569 30.568294 30.525259 36.078335 36.048608
1976-04-30 38.413508 38.364316 41.012457 40.888991 30.171771 30.120693 39.955964 ... 19.466310 48.064191 47.647910 30.655795 30.489796 36.330964 36.228526
1976-05-31 38.557345 38.443675 41.199138 40.880155 30.152499 30.048025 40.385980 ... 19.617460 48.133561 47.615554 30.868413 30.556750 36.666209 36.441969
1976-06-30 38.620795 38.463828 41.380588 40.831079 30.129364 29.960483 40.704702 ... 19.869510 48.182909 47.600807 31.074662 30.620101 37.123928 36.769646
1976-07-31 38.691612 38.526748 41.548439 40.823728 30.139181 29.974532 40.999260 ... 20.245008 48.190939 47.595652 31.221162 30.694918 37.695979 37.267397
1976-08-31 38.768830 38.598017 41.697225 40.958980 30.183743 30.003149 41.332880 ... 20.590821 48.177652 47.640123 31.376070 30.892543 38.319513 37.936444
1976-09-30 38.830829 38.670819 41.883560 41.301406 30.240191 30.056011 41.631820 ... 20.818679 48.185949 47.777173 31.573523 31.204989 38.962741 38.641213
1976-10-31 38.923251 38.774015 42.162554 41.843095 30.373628 30.199889 41.822286 ... 20.997060 48.231964 48.010079 31.783570 31.526815 39.616352 39.386130
1976-11-30 39.102867 38.988602 42.525372 42.503936 30.587154 30.458126 41.966578 ... 21.235578 48.329286 48.242544 32.048837 31.886991 40.227093 40.087263
1976-12-31 39.292385 39.267048 42.945530 43.119184 30.764663 30.652374 42.156294 ... 21.476291 48.512275 48.480111 32.415178 32.405580 40.718681 40.645104
1977-01-31 39.388110 39.450139 43.380501 43.631207 30.877957 30.844815 42.448026 ... 21.691517 48.804702 48.782961 32.849718 32.937665 41.087572 41.049836
1977-02-28 39.417217 39.488843 43.780287 43.992518 31.010112 31.004389 42.833127 ... 21.850019 49.189489 49.116071 33.361979 33.432924 41.393100 41.371344
1977-03-31 39.444205 39.459786 44.105283 44.165532 31.263171 31.248722 43.219935 ... 22.111569 49.644356 49.377838 33.935754 33.887978 41.675873 41.641534
1977-04-30 39.581335 39.530647 44.364021 44.230465 31.581495 31.528031 43.514299 ... 22.569938 50.169375 49.734861 34.454736 34.268166 41.956252 41.837952
1977-05-31 39.876794 39.759235 44.601709 44.256382 31.861477 31.751082 43.731107 ... 23.051615 50.689889 50.144370 34.893232 34.540933 42.276660 42.018108
1977-06-30 40.293017 40.129254 44.839057 44.243621 32.196370 32.015902 43.973704 ... 23.491496 51.108407 50.490962 35.381716 34.864151 42.736895 42.329047
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2016-07-31 132.599826 130.608248 176.787750 173.850747 168.526273 166.780562 138.132931 ... 187.322124 145.667676 143.439483 134.177075 130.504647 187.363168 184.746506
2016-08-31 133.022075 131.238156 176.034537 173.649132 169.526923 168.253567 138.164521 ... 189.210701 145.909376 143.768963 134.448757 131.207161 187.940275 185.288656
2016-09-30 132.848388 131.801316 175.196564 173.477463 170.675230 169.760097 137.838737 ... 191.147989 146.013298 144.254722 134.210854 131.783168 187.979164 185.806246
2016-10-31 132.131421 132.018031 174.602969 173.724838 171.990541 171.351792 137.478133 ... 192.798667 145.928107 144.650738 133.892183 132.380145 187.244371 185.776781
2016-11-30 131.351051 132.197436 174.315989 174.198443 172.884633 172.595990 137.410025 ... 194.543839 145.638546 144.906261 133.143545 132.875691 186.143898 185.545706
2016-12-31 130.925074 132.470198 174.203049 174.724112 173.422643 173.669422 137.741628 ... 196.686728 145.135293 145.001205 132.223438 133.512318 185.352769 185.518939
2017-01-31 131.113598 132.746046 174.320504 175.062814 174.134166 174.813060 138.201868 ... 198.674350 144.575795 144.904867 132.083906 134.139242 185.271251 185.955447
2017-02-28 132.122403 133.351309 174.893424 175.479807 175.076866 175.555160 138.752477 ... 200.559705 144.319529 144.868353 132.729630 134.736105 185.958796 186.739416
2017-03-31 133.712535 134.032444 175.890924 175.846975 176.660680 176.455310 139.560338 ... 202.905037 144.621493 144.790175 134.271819 135.171768 187.160834 187.516561
2017-04-30 135.408734 134.708751 177.145475 176.135739 178.784398 177.812440 140.529965 ... 205.335079 145.377521 144.743952 136.681300 135.986929 188.595601 188.151366
2017-05-31 136.708666 135.217818 178.362239 176.359848 180.510333 179.065741 141.558524 ... 207.588036 146.219754 144.722067 139.017516 136.697898 190.032933 188.701542
2017-06-30 137.263876 135.351875 179.231632 176.375707 181.803532 180.178989 142.694403 ... 209.844259 146.920037 144.868521 140.820982 137.269649 191.127304 188.953937
2017-07-31 137.076476 135.037968 179.671911 176.745654 183.181473 181.405424 143.654029 ... 211.724453 147.320848 145.019306 141.847835 137.973854 191.666629 189.027898
2017-08-31 136.831085 135.064366 179.766361 177.458431 184.253238 182.943114 144.008325 ... 213.259237 147.344639 145.185812 141.930517 138.566212 191.847514 189.139049
2017-09-30 136.972719 135.966552 179.574158 177.838111 184.761422 183.787057 143.957040 ... 215.583300 147.047528 145.306474 141.725747 139.177291 192.017190 189.783158
2017-10-31 137.084720 136.932829 178.989936 178.045002 185.403227 184.702703 143.829009 ... 217.946751 146.774686 145.559958 141.486375 139.895949 192.242059 190.711517
2017-11-30 136.886310 137.733946 178.141870 177.993174 186.438127 186.098927 143.456254 ... 220.491188 146.816202 146.150690 140.823935 140.603174 192.402151 191.747904
2017-12-31 136.838747 138.486416 177.462500 177.994371 187.319646 187.524212 142.999699 ... 223.079771 147.152612 147.021966 139.899256 141.302010 192.480743 192.612389
2018-01-31 137.386527 139.162989 177.274582 177.986031 188.063391 188.760985 143.061276 ... 225.560577 147.623851 147.968792 139.601181 141.765811 192.432381 193.138089
2018-02-28 138.560315 139.883118 177.628707 178.165078 189.699826 190.199068 143.989044 ... 228.368460 148.279081 148.832500 140.758958 142.892707 192.392620 193.187242
2018-03-31 140.207132 140.532600 178.439853 178.343795 192.052473 191.799890 145.385364 ... 230.752912 149.224547 149.376685 143.383724 144.307930 192.866379 193.227071
2018-04-30 142.030988 141.234311 179.534347 178.499005 193.974311 192.895602 146.600819 ... 232.064224 150.265376 149.587593 146.410871 145.624758 193.963855 193.517263
2018-05-31 143.759420 142.113835 180.635290 178.656108 195.652644 194.099898 147.709272 ... 232.389332 151.117827 149.543416 148.752405 146.237262 195.499823 194.155389
2018-06-30 145.071132 143.030359 181.458619 178.590479 197.299826 195.586430 148.676113 ... 232.570205 151.704632 149.558389 150.387997 146.566876 197.230004 195.019059
2018-07-31 145.649750 143.479151 181.864196 178.924144 198.463708 196.597856 149.214379 ... 233.058889 152.059646 149.659148 151.169180 147.047478 198.758395 196.045439
2018-08-31 145.615858 143.784381 181.981395 179.702489 199.367515 197.968454 149.460333 ... 233.178394 152.202201 149.984611 151.230192 147.676558 200.044993 197.217841
2018-09-30 145.371929 144.351813 181.828577 180.063556 200.313523 199.245448 149.691593 ... 233.534676 152.105658 150.324990 151.142824 148.414725 201.069979 198.718684
2018-10-31 145.028104 144.868072 181.219464 180.220086 201.062186 200.275900 149.909376 ... 233.585392 151.784579 150.566075 150.583958 148.886498 201.945318 200.320199
2018-11-30 144.636785 145.521425 180.401487 180.242216 201.586444 201.205283 150.080574 ... 233.240225 151.494684 150.842738 149.500263 149.319695 203.000387 202.298014
2018-12-31 144.364052 146.109780 179.764783 180.331613 201.983817 202.192416 150.316729 ... 233.093139 151.254986 151.102280 148.266162 149.774247 203.983473 204.111784
[528 rows x 100 columns]
在終端輸出的時候, 會自動省略掉中間的很多數據. 但是最後可以看到一共輸出了100列, 因為我們每個州是有兩列數據, 所以是一共有50個州的數據都在這.
這裏我們也可以用 Pandas 的 pickle 來操作, 這裏是在已有 HPI_data 的數據的前提下, 也就是說把一個 dataframe 輸出成 pickle 文件, 以及從 pickle 文件讀出 dataframe 只需要各一行數據就可以了.
HPI_data.to_pickle(‘pickle.pickle‘)
HPI_data2 = pd.read_pickle(‘pickle.pickle‘)
print(HPI_data2)
這裏大家可以自己打印一下試試, 會發現非常簡單, 也非常快就可以得到跟上面輸出一樣的數據.
你可能會有這樣的疑問: 既然 Python 已經有 pickle 模塊了, 為什麽 Pandas 還要有呢? 我也真的不知道為什麽. 但是顯然, Pandas 的 pickle 在出來大型數據的時候會更快一些.
現在, 我們已經把房價數據保存成 pickle 文件了, 我們可以準備進入下一節課的進一步研究了.
後面會持續更新, 有任何問題或者錯誤, 歡迎留言, 希望和大家交流學習.
基於 Python 和 Pandas 的數據分析(7) --- Pickling