Pandas基本功能之層次化索引及層次化彙總
阿新 • • 發佈:2018-11-14
層次化索引
層次化也就是在一個軸上擁有多個索引級別
Series的層次化索引
data=Series(np.random.randn(10),index=[ ['a','a','a','b','b','b','c','c','d','d'], [1,2,3,1,2,3,1,2,2,3] ]) data a 1 0.965999 2 -0.271733 3 0.133910 b 1 -0.806885 2 -0.622905 3 -0.355330 c 1 -0.659194 2 -1.082872 d 2 -0.043984 3 -1.125324 dtype: float64 # 選取資料子集 data['b'] 1 -0.806885 2 -0.622905 3 -0.355330 dtype: float64 data['b':'c'] # 在pandas中顧頭也顧尾 b 1 -0.806885 2 -0.622905 3 -0.355330 c 1 -0.659194 2 -1.082872 dtype: float64 data.ix[['b','d']] # 按行索引名稱選擇 b 1 -0.806885 2 -0.622905 3 -0.355330 d 2 -0.043984 3 -1.125324 dtype: float64 # 在內層中進行選取,選擇所有的行索引中的2這一行 data[:,2] a -0.271733 b -0.622905 c -1.082872 d -0.043984 dtype: float64 # 層次化索引在資料重塑和基於分組的操作中扮演著重要的角色 # 這個函式會把層次化索引轉為DataFrame格式,最外層的行索引作為DataFrame的行索引,內層的索引作為列索引 data.unstack() 1 2 3 a 0.965999 -0.271733 0.133910 b -0.806885 -0.622905 -0.355330 c -0.659194 -1.082872 NaN d NaN -0.043984 -1.125324 # unstack()的逆運算,轉回來 data.unstack().stack() a 1 0.965999 2 -0.271733 3 0.133910 b 1 -0.806885 2 -0.622905 3 -0.355330 c 1 -0.659194 2 -1.082872 d 2 -0.043984 3 -1.125324 dtype: float64
DataFrame的層次化索引
frame = pd.DataFrame(np.arange(12).reshape(4,3),index=[['a','a','b','b'],[1,2,1,2]], columns=[['ohio','ohio','color'],['green','red','green']] ) frame ohio color green red green a 1 0 1 2 2 3 4 5 b 1 6 7 8 2 9 10 11 # 給層級行索引加名字 frame.index.names = ['key1','key2'] # 給層級列索引加名字 frame.columns.names = ['state','color'] frame state ohio color color green red green key1 key2 a 1 0 1 2 2 3 4 5 b 1 6 7 8 2 9 10 11 frame['ohio'] color green red key1 key2 a 1 0 1 2 3 4 b 1 6 7 2 9 10
重排分級順序
frame state ohio color color green red green key1 key2 a 1 0 1 2 2 3 4 5 b 1 6 7 8 2 9 10 11 # 這裡sortlevel()括號裡的0指把key2和key1交換後按key2排序 frame.swaplevel(0,1).sortlevel(0) state ohio color color green red green key2 key1 1 a 0 1 2 b 6 7 8 2 a 3 4 5 b 9 10 11 # 1指按key1排序 frame.swaplevel(0,1).sortlevel(1) state ohio color color green red green key2 key1 1 a 0 1 2 2 a 3 4 5 1 b 6 7 8 2 b 9 10 11
根據層次索引級別彙總統計
frame
state ohio color
color green red green
key1 key2
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
# 以key2的1和1相加,2和2索引相加
frame.sum(level='key2')
state ohio color
color green red green
key2
1 6 8 10
2 12 14 16
# 以行索引的green索引相加,red沒有不做改變
frame.sum(level='color',axis=1)
color green red
key1 key2
a 1 2 1
2 8 4
b 1 14 7
2 20 10
使用DataFrame的列
frame1 = pd.DataFrame({'a':range(7),'b':range(7,0,-1),
'c':['one','one','one','two','two','two','two'],
'd':[0,1,2,0,1,2,3]
})
frame1
a b c d
0 0 7 one 0
1 1 6 one 1
2 2 5 one 2
3 3 4 two 0
4 4 3 two 1
5 5 2 two 2
6 6 1 two 3
#把c/d設定為行索引,預設會刪除這兩列,如果不想刪除,可以吧drop=False開啟
frame1.set_index(['c','d'])
a b
c d
one 0 0 7
1 1 6
2 2 5
two 0 3 4
1 4 3
2 5 2
3 6 1
# reset_index會把cd設定為列索引,瞭解就行
frame2.reset_index()
index a b c d
0 0 0 7 one 0
1 1 1 6 one 1
2 2 2 5 one 2
3 3 3 4 two 0
4 4 4 3 two 1
5 5 5 2 two 2
6 6 6 1 two 3