pandas之基本功能
pandas 的官方文件:
1. 重新索引
作用:建立一個適應新索引的新物件,會根據新索引對原資料進行重排,如果是新引入的索引,則會引入缺失值(也可用 fill_value 指定填充值)。
reindex 的函式引數:
index | New sequence to use as index. Can be Index instance or any other sequence-like Python data structure. An Index will be used exactly as is without any copying |
method | Interpolation (fill) method, see table for options. |
fill_value | Substitute value to use when introducing missing data by reindexing |
limit | When forward- or backfilling, maximum size gap to fill |
level | Match simple Index on level of MultiIndex, otherwise select subset of |
copy | Do not copy underlying data if new index is equivalent to old index. True by default (i.e. always copy data). |
In [49]: obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c']) In [50]: obj Out[50]: d 4.5 b 7.2 a -5.3 c 3.6 dtype: float64 In [51]: obj.reindex(['a','b','c','d','e']) # obj.reindex(['a','b','c','d','e'],fill_value=0) Out[51]: a -5.3 b 7.2 c 3.6 d 4.5 e NaN # e 0dtype: float64
對於有序的索引序列,在重新索引時,我們可以用 method 選項進行前後填充值:
In [56]: obj1 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4]) In [57]: obj1.reindex(range(6),method='ffill') Out[59]: 0 blue 1 blue 2 purple 3 purple 4 yellow 5 yellow dtype: object
reindex 的(插值)method選項:
ffill or pad | Fill (or carry) values forward |
bfill or backfill | Fill (or carry) values backward |
對於 Dataframe 可以單獨重新指定 index 和 columns,也可以同時指定,預設是重新索引行。
Dataframe 中的插值只能應用在行上(即軸0)。
n [64]: frame = pd.DataFrame(np.arange(9).reshape((3, 3)), index=['a', 'c', 'd'], ...: columns=['Ohio', 'Texas', 'California']) In [66]: frame.reindex(['a','b','c','d']) In [67]: states = ['Texas', 'Utah', 'California'] In [68]: frame.reindex(columns=states) In [71]: frame.reindex(index=['a','b','c','d'],columns=states) In [82]: frame.reindex(index=['a','b','c','d'],method='ffill').reindex(columns=states) # python資料分析書中這句程式碼是: # frme.reindex(index=['a', 'b', 'c', 'd'], method='ffill',columns=states) # 由於版本的原因執行這句程式碼可能會出錯
python資料分析書上利用 ix 的標籤索引功能,這個在未來可能會廢棄掉:
In[87]:frame.ix[['a','b','c','d'],states] W:\software\Python\Python35\Scripts\ipython:1: DeprecationWarning: .ix is deprecated. Please use .loc for label based indexing or .iloc for positional indexing See the documentation here: http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated W:\software\Python\Python35\Scripts\ipython:1: FutureWarning:Passing list-likes to .loc or [] with any missing label will raiseKeyError in the future, you can use .reindex() as an alternative. See the documentation here: https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike Out[87]: Texas Utah California a 1.0 NaN 2.0 b NaN NaN NaN c 4.0 NaN 5.0 d 7.0 NaN 8.0 In [88]: frame.loc[['a','b','c','d'],states] W:\software\Python\Python35\Scripts\ipython:1: FutureWarning: Passing list-likes to .loc or [] with any missing label will raiseKeyError in the future, you can use .reindex() as an alternative. See the documentation here: https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike Out[88]: Texas Utah California a 1.0 NaN 2.0 b NaN NaN NaN c 4.0 NaN 5.0 d 7.0 NaN 8.0
2. 刪除指定軸上的項
drop 方法返回的是一個刪除指定軸上的項後的新物件。
In [96]: obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
In [97]: new_obj = obj.drop('c')
In [98]: obj.drop(['d', 'c'])
Out[98]:
a 0.0
b 1.0
e 4.0
dtype: float64
對於 Dataframe 可以刪除任意軸上的索引值:
In [99]: data = pd.DataFrame(np.arange(16).reshape((4, 4)), ...: ....: index=['Ohio', 'Colorado', 'Utah', 'New York'], ...: ....: columns=['one', 'two', 'three', 'four']) In [100]: data Out[100]: one two three four Ohio 0 1 2 3 Colorado 4 5 6 7 Utah 8 9 10 11 New York 12 13 14 15 In [101]: data.drop(['Colorado', 'Ohio']) Out[101]: one two three four Utah 8 9 10 11 New York 12 13 14 15 In [102]: data.drop('two', axis=1) Out[102]: one three four Ohio 0 2 3 Colorado 4 6 7 Utah 8 10 11 New York 12 14 15 In [103]: data.drop(['two', 'four'], axis=1) Out[103]: one three Ohio 0 2 Colorado 4 6 Utah 8 10 New York 12 14
3.索引、選取、過濾
Series 的類似於numpy 陣列的索引:
In [102]: obj = Series(np.arange(4.), index=['a', 'b', 'c', 'd']) In [103]: obj['b'] In [104]: obj[1] In [105]: obj[2:4] In [106]: obj[['b', 'a', 'd']]In [107]: obj[[1, 3]] In [108]: obj[obj < 2]
利用標籤進行索引和賦值(其末端包含):
In [110]: obj['b':'c'] = 5
對於 Dataframe 進行索引就是選取一個或多個列:In [112]: data = pd.DataFrame(np.arange(16).reshape((4, 4)), .....: index=['Ohio', 'Colorado', 'Utah', 'New York'], .....: columns=['one', 'two', 'three', 'four'])In [114]: data['two'] In [115]: data[['three', 'one']]
通過切片或布林型陣列選取行:
In [116]: data[:2] In [117]: data[data['three'] > 5]通過布林型進行索引:
In [118]: data < 5 In [119]: data[data < 5] = 0 In [120]: data
用 ix 進行索引列和行(未來可能廢除,改用其他方法,例:loc、iloc):
In [121]: data.ix['Colorado', ['two', 'three']] In [122]: data.ix[['Colorado', 'Utah'], [3, 0, 1]] In [123]: data.ix[2] In [124]: data.ix[:'Utah', 'two'] In [125]: data.ix[data.three > 5, :3]
Dataframe 的索引選項:
obj.ix[val] | Selects single row of subset of rows from the DataFrame. |
obj.ix[:, val] | Selects single column of subset of columns. |
obj.ix[val1, val2] | Select both rows and columns. |
reindex | Conform one or more axes to new indexes. |
xs | Select single row or column as a Series by label. |
icol, irow | Select single column or row, respectively, as a Series by integer location. |
get_value, set_value | Select single value by row and column label. |
4. 算術運算
物件相加時,結果索引是每個物件的索引的並集,對於不重疊的索引,其值會填充 NA(可以指定填充值):
In [126]: s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
In [127]: s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])
In [128]: s1 + s2
Out[128]:
a 5.2
c 1.1
d NaN
e 0.0
f NaN
g NaN
dtype: float64
In [129]: df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),
...: index=['Ohio', 'Texas', 'Colorado'])
In [130]: df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
...: index=['Utah', 'Ohio', 'Texas', 'Oregon'])
In [131]: df1 + df2
Out[131]:
b c d e
Colorado NaN NaN NaN NaN
Ohio 3.0 NaN 6.0 NaN
Oregon NaN NaN NaN NaN
Texas 9.0 NaN 12.0 NaN
Utah NaN NaN NaN NaN
In [132]: df1.add(df2,fill_value=0)
Out[132]:
b c d e
Colorado 6.0 7.0 8.0 NaN
Ohio 3.0 1.0 6.0 5.0
Oregon 9.0 NaN 10.0 11.0
Texas 9.0 4.0 12.0 8.0
Utah 0.0 NaN 1.0 2.0
Method Descriptionadd | Method for addition (+) |
sub | Method for subtraction (-) |
div | Method for division (/) |
mul | Method for multiplication (*) |
Dataframe 和 Series 之間的運算
這兩者之間的運算涉及到了廣播的知識,以後會有介紹廣播相關的知識。一維二維的廣播都比較容易理解。
In [153]: frame = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
...: index=['Utah', 'Ohio', 'Texas', 'Oregon'])
In [154]: series = frame.iloc[0]
In [155]: frame - series # 嗯,大概就是這樣,理解一下
Out[155]:
b d e
Utah 0.0 0.0 0.0
Ohio 3.0 3.0 3.0
Texas 6.0 6.0 6.0
Oregon 9.0 9.0 9.0
如果你希望匹配行且在列上廣播,必須使用算數運算:
In [160]: frame = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
...: index=['Utah', 'Ohio', 'Texas', 'Oregon'])
In [161]: series1 = frame['d']
In [162]: frame.sub(series1,axis=0)
Out[162]:
b d e
Utah -1.0 0.0 1.0
Ohio -1.0 0.0 1.0
Texas -1.0 0.0 1.0
Oregon -1.0 0.0 1.0
5. 函式應用於對映
numpy 的 ufuncs (元素級陣列方法)也可用於操作pandas物件:
In [164]: frame = pd.DataFrame(np.random.randn(4,3), columns=list('bde'),
...: index=['Utah', 'Ohio', 'Texas', 'Oregon'])
In [167]: frame
Out[167]:
b d e
Utah -0.637896 0.509292 -0.919939
Ohio -0.604495 0.298296 -0.377575
Texas -0.710751 -0.091902 0.607375
Oregon 0.576612 1.664728 0.264065
In [168]: np.abs(frame)
Out[168]:
b d e
Utah 0.637896 0.509292 0.919939
Ohio 0.604495 0.298296 0.377575
Texas 0.710751 0.091902 0.607375
Oregon 0.576612 1.664728 0.264065
也可用apply方法把函式應用到由各列或各行形成的一維陣列上:
In [172]: f = lambda x: x.max() - x.min() In [173]: frame.apply(f) Out[173]: b 1.287363 d 1.756631 e 1.527314 dtype: float64
也可返回多個值組成的Series:
In [176]: def f(x): ...: return pd.Series([x.min(), x.max()], index=['min', 'max']) In [177]: frame.apply(f) Out[177]: b d e min -0.710751 -0.091902 -0.919939 max 0.576612 1.664728 0.607375
元素級的python函式也是可用的,使用applymap方法:
In [179]: format = lambda x: '%.2f' % x In [180]: frame.applymap(format) Out[180]: b d e Utah -0.64 0.51 -0.92 Ohio -0.60 0.30 -0.38 Texas -0.71 -0.09 0.61 Oregon 0.58 1.66 0.26
6. 排序與排名
對行或列索引進行排序可以使用 sort_index 方法
對Series安值進行排序,可使用sort_values方法,若某個索引缺失值,則會被放到末尾
In [183]: obj = pd.Series(range(4), index=['d', 'a', 'b', 'c']) In [187]: obj.sort_index() # obj.sort_index(ascending=False) 降序 In [189]: obj.sort_values() # obj.sort_values(ascending=False) 降序
對於Dataframe 可以根據任意軸上的索引進行排序,預設是升序,也可降序排序:
In [196]: frame = pd.DataFrame(np.arange(8).reshape((2, 4)), index=['three', 'one'], ...: columns=['d', 'a', 'b', 'c']) In [197]: frame.sort_index() Out[197]: d a b c one 4 5 6 7 three 0 1 2 3 In [198]: frame.sort_index(axis=1) Out[198]: a b c d three 1 2 3 0 one 5 6 7 4 In [199]: frame.sort_index(axis=1, ascending=False) Out[199]: d c b a three 0 3 2 1 one 4 7 6 5
在 Dataframe 上還可以使用 by 關鍵字,根據一或多列的值進行排序:
In [203]: frame.sort_values(by='b') # FutureWarning: by argument to sort_index
# is deprecated, please use .sort_values(by=...)
Out[203]:
a b
2 0 -3
3 1 2
0 0 4
1 1 7
In [204]: frame.sort_values(by=['a','b'])
Out[204]:
a b
2 0 -3
0 0 4
3 1 2
1 1 7
注意:對DataFrame的值進行排序的時候,我們必須要使用by指定某一行(列)或者某幾行(列), 如果不使用by引數進行指定的時候,就會報TypeError: sort_values() missing 1 required positional argument: 'by'。 使用by引數進行某幾列(行)排序的時候,以列表中的第一個為準,可能後面的不會生效,因為有的時候無法做到既對第一行(列) 進行升序排序又對第二行(列)進行排序。在指定行值進行排序的時候,必須設定axis=1,不然會報錯,因為預設指定的是列索引, 找不到這個索引所以報錯,axis=1的意思是指定行索引。
排名:
排名和排序有點類似,排名會有一個排名值(從1開始,一直到陣列中有效資料的數量),它與numpy.argsort的間接排序索引差不多,只不過它可以根據某種規則破壞平級關係。
In [214]: obj = pd.Series([7, -5, 7, 4, 2, 0, 4]) # 下標對應 (0, 1, 2, 3, 4, 5, 6) In [215]: obj.rank() # 預設是根據值的大小進行平均排名 Out[215]: 0 6.5 # 7 最大 由於有兩個 7 ,所以排名為 6,7 名,平均排名 6.5 1 1.0 # -5 最小 對應下標為 1 ,排在第一 2 6.5 3 4.5 4 3.0 5 2.0 6 4.5 dtype: float64
根據值在陣列中出現的順序進行排名 :
In [216]: obj.rank(method='first') # 也可以按降序排名obj.rank(ascending=False,method='max') 按照分組的最大排名排序Out[216]: 0 6.0 1 1.0 2 7.0 3 4.0 4 3.0 5 2.0 6 5.0 dtype: float64
也可以指定軸進行排名:
In [219]: frame = pd.DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1], ...: 'c': [-2, 5, 8, -2.5]})In [220]: frame Out[220]: a b c 0 0 4.3 -2.0 1 1 7.0 5.0 2 0 -3.0 8.0 3 1 2.0 -2.5In [221]: frame.rank(axis=1) Out[221]: a b c 0 2.0 3.0 1.0 1 1.0 3.0 2.0 2 2.0 1.0 3.0 3 2.0 3.0 1.0
排名時用於破壞平級關係的method選項:
average | Default: assign the average rank to each entry in the equal group |
min | Use the minimum rank for the whole group |
max | Use the maximum rank for the whole group |
first | Assign ranks in the order the values appear in the data |
7. 彙總和計算描述統計
可以指定對行或列進行統計。統計時預設會跳過 NA 值,也可以用 skipna 指定不跳過。
In [224]: df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5], ...: [np.nan, np.nan], [0.75, -1.3]], ...: index=['a', 'b', 'c', 'd'], ...: columns=['one', 'two'])In [225]: df Out[225]: one two a 1.40 NaN b 7.10 -4.5 c NaN NaN d 0.75 -1.3In [227]: df.sum() Out[227]: one 9.25 two -5.80 dtype: float64In [228]: df.sum(axis=1) Out[228]: a 1.40 b 2.60 c 0.00 d -0.55 dtype: float64In [229]: df.sum(axis=1,skipna=False) Out[229]: a NaN b 2.60 c NaN d -0.55 dtype: float64約簡方法的選項
Method Description
axis | Axis to reduce over. 0 for DataFrame’s rows and 1 for columns. |
skipna | Exclude missing values, True by default. |
level | Reduce grouped by level if the axis is hierarchically-indexed (MultiIndex). |
Method Description
count | Number of non-NA values |
describe | Compute set of summary statistics for Series or each DataFrame column |
min, max | Compute minimum and maximum values |
argmin, argmax | Compute index locations (integers) at which minimum or maximum value obtained, respectively |
idxmin, idxmax | Compute index values at which minimum or maximum value obtained, respectively |
quantile | Compute sample quantile ranging from 0 to 1 |
sum | Sum of values |
mean | Mean of values |
median | Arithmetic median (50% quantile) of values |
mad | Mean absolute deviation from mean value |
var | Sample variance of values |
std | Sample standard deviation of values |
skew | Sample skewness (3rd moment) of values |
kurt | Sample kurtosis (4th moment) of values |
cumsum | Cumulative sum of values |
cummin, cummax | Cumulative minimum or maximum of values, respectively |
cumprod | Cumulative product of values |
diff | Compute 1st arithmetic difference (useful for time series) |
pct_change | Compute percent changes |
8.唯一值,值計數,成員資格
In [231]: obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])In [232]: uniques = obj.unique() # 唯一值In [233]: uniques Out[233]: array(['c', 'a', 'd', 'b'], dtype=object)In [234]: obj.value_counts() # 計數值 Out[234]: c 3 a 3 b 2 d 1 dtype: int64In [235]: mask = obj.isin(['b', 'c']) # 成員關係In [236]: mask Out[236]: 0 True 1 False 2 False 3 False 4 False 5 True 6 True 7 True 8 True dtype: boolIn [237]: obj[mask] Out[237]: 0 c 5 b 6 b 7 c 8 c dtype: objectMethod Description
isin | Compute boolean array indicating whether each Series value is contained in the passed sequence of values. |
unique | Compute array of unique values in a Series, returned in the order observed. |
value_counts | Return a Series containing unique values as its index and frequencies as its values, ordered count in descending order. |
將pandas的 pandas.value_counts 傳給 Dataframe 的 apply 函式:
In [239]: data = pd.DataFrame({'Qu1': [1, 3, 4, 3, 4], ...: 'Qu2': [2, 3, 1, 2, 3], ...: 'Qu3': [1, 5, 2, 4, 4]})In [240]: data Out[240]: Qu1 Qu2 Qu3 0 1 2 1 1 3 3 5 2 4 1 2 3 3 2 4 4 4 3 4In [241]: result = data.apply(pd.value_counts).fillna(0)In [242]: result Out[242]: Qu1 Qu2 Qu3 1 1.0 1.0 1.0 2 0.0 2.0 1.0 3 2.0 2.0 0.0 4 2.0 0.0 2.0 5 0.0 0.0 1.0
9.缺失值的處理
pandas 中使用浮點值 NAN (Not a Number) 表示浮點和非浮點陣列中的缺失資料,只是一種便於表示的標記。
NA處理的方法:
dropna | Filter axis labels based on whether values for each label have missing data, with varying thresholds for how much missing data to tolerate. |
fillna | Fill in missing data with some value or using an interpolation method such as 'ffill' or 'bfill'. |
isnull | Return like-type object containing boolean values indicating which values are missing / NA. |
notnull | Negation of isnull. |
過濾缺失數據:dropna
對於 Series ,dropna()僅僅返回一個非空資料和索引值的 Series: In [6]: from numpy import nan as NA In [7]: data = pd.Series([1,NA,4,NA,5]) In [8]: data.dropna() # 也可以通過bool索引達到此目的:data[data.notnull()] Out[8]: 0 1.0 2 4.0 4 5.0 dtype: float64
對於 Dataframe ,dropna 預設丟棄任何含有缺失值的行,傳入引數 how='all' ,只丟棄全為NA的行。要丟棄為NA的行,傳入引數 axis=1,即可。引數 thresh 可以保留部分資料。
填充缺失資料:fillna
In [9]: df = pd.DataFrame(np.random.randn(7,3)) In [10]: df Out[10]: 0 1 2 0 -1.405991 -1.032070 -0.421601 1 0.878711 -0.786235 1.483527 2 -0.082090 -0.163028 -0.718293 3 -0.576532 0.229013 0.387237 4 -0.682892 0.547743 0.297142 5 -1.367772 -0.169607 -2.359635 6 -0.591433 -0.318911 0.449039 In [12]: df.iloc[:4,1] = NA In [13]: df.iloc[:2,2] = NA In [14]: df Out[14]: 0 1 2 0 -1.405991 NaN NaN 1 0.878711 NaN NaN 2 -0.082090 NaN -0.718293 3 -0.576532 NaN 0.387237 4 -0.682892 0.547743 0.297142 5 -1.367772 -0.169607 -2.359635 6 -0.591433 -0.318911 0.449039 In [15]: df.fillna(0) # 用任意數去填充所有的缺失值 Out[15]: 0 1 2 0 -1.405991 0.000000 0.000000 1 0.878711 0.000000 0.000000 2 -0.082090 0.000000 -0.718293 3 -0.576532 0.000000 0.387237 4 -0.682892 0.547743 0.297142 5 -1.367772 -0.169607 -2.359635 6 -0.591433 -0.318911 0.449039 In [20]: df.fillna({1:0.5,2:1.}) # 傳入一個字典,可以實現對列中缺失值的填充 Out[20]: 0 1 2 0 -1.405991 0.500000 1.000000 1 0.878711 0.500000 1.000000 2 -0.082090 0.500000 -0.718293 3 -0.576532 0.500000 0.387237 4 -0.682892 0.547743 0.297142 5 -1.367772 -0.169607 -2.359635 6 -0.591433 -0.318911 0.449039
fillna 預設總會產生一個新的物件,用inplace 引數實現就地修改。
用 method 引數填充資料:
In [27]: df.fillna(method='bfill') Out[27]: 0 1 2 0 -1.405991 0.547743 -0.718293 1 0.878711 0.547743 -0.718293 2 -0.082090 0.547743 -0.718293 3 -0.576532 0.547743 0.387237 4 -0.682892 0.547743 0.297142 5 -1.367772 -0.169607 -2.359635 6 -0.591433 -0.318911 0.449039 In [28]: df.fillna(method='bfill',limit=2) Out[28]: 0 1 2 0 -1.405991 NaN -0.718293 1 0.878711 NaN -0.718293 2 -0.082090 0.547743 -0.718293 3 -0.576532 0.547743 0.387237 4 -0.682892 0.547743 0.297142 5 -1.367772 -0.169607 -2.359635 6 -0.591433 -0.318911 0.449039
fillna 引數:
value | Scalar value or dict-like object to use to fill missing values |
method | Interpolation, by default 'ffill' if function called with no other arguments |
axis | Axis to fill on, default axis=0 |
inplace | Modify the calling object without producing a copy |
limit | For forward and backward filling, maximum number of consecutive periods to fill |
10. 層次化索引
能在一個軸上擁有多個索引級別,能以低維度形式處理高緯度資料。
建立一個層次化索引的 Series:
In [31]: data = pd.Series(np.random.randn(10),
...: index=[['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'd', 'd'],
...: [1, 2, 3, 1, 2, 3, 1, 2, 2, 3]])
In [32]: data
Out[32]:
a 1 -0.413827
2 0.660228
3 0.209686
b 1 -0.361603
2 -0.982985
3 -0.267620
c 1 -1.130506
2 -2.023760
d 2 0.989250
3 1.074886
dtype: float64
In [33]: data.index
Out[33]:
MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 1, 2]])
索引:
In [34]: data['a'] Out[34]: 1 -0.413827 2 0.660228 3 0.209686 dtype: float64 In [35]: data[:,2] # 實現內層索引 Out[35]: a 0.660228 b -0.982985 c -2.023760 d 0.989250dtype: float64 In [36]: data['b':'d'] # 實現切片索引 Out[36]: b 1 -0.361603 2 -0.982985 3 -0.267620 c 1 -1.130506 2 -2.023760 d 2 0.989250 3 1.074886 dtype: float64 In [37]: data.loc[['b','c']] Out[37]: b 1 -0.361603 2 -0.982985 3 -0.267620 c 1 -1.130506 2 -2.023760 dtype: float64
層次化索引可以通過 unstack 方法生成 Dataframe 資料:
In [38]: data.unstack() Out[38]: 1 2 3 a -0.413827 0.660228 0.209686 b -0.361603 -0.982985 -0.267620 c -1.130506 -2.023760 NaN d NaN 0.989250 1.074886 In [39]: data.unstack().unstack() Out[39]: 1 a -0.413827 b -0.361603 c -1.130506 d NaN 2 a 0.660228 b -0.982985 c -2.023760 d 0.989250 3 a 0.209686 b -0.267620 c NaN d 1.074886 dtype: float64 In [42]: data.unstack().stack() # stack 是unstack的逆運算 Out[42]: a 1 -0.413827 2 0.660228 3 0.209686 b 1 -0.361603 2 -0.982985 3 -0.267620 c 1 -1.130506 2 -2.023760 d 2 0.989250 3 1.074886 dtype: float64
對於 Dataframe 每條軸都可以有層次化索引,每個索引還都可以有名字:
In [44]: frame = pd.DataFrame(np.arange(12).reshape((4, 3)),
...: index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
...: columns=[['Ohio', 'Ohio', 'Colorado'],
...: ['Green', 'Red', 'Green']]
...: )
In [45]: frame
Out[45]:
Ohio Colorado
Green Red Green
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
In [46]: frame.index.names = ['key1', 'key2']
In [47]: frame.columns.names = ['state', 'color']
In [48]: frame
Out[48]:
state Ohio Colorado
color Green Red Green
key1 key2
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
swaplevel() : 調整某條軸上各級別的順序;sort_index(): 對各級別上的資料值進行排序
In [49]: frame.swaplevel('key1','key2') Out[49]: state Ohio Colorado color Green Red Green key2 key1 1 a 0 1 2 2 a 3 4 5 1 b 6 7 8 2 b 9 10 11 In [51]: frame.swaplevel(0,1) Out[51]: state Ohio Colorado color Green Red Green key2 key1 1 a 0 1 2 2 a 3 4 5 1 b 6 7 8 2 b 9 10 11 In [53]: frame.swaplevel(0,1,axis=1) Out[53]: color Green Red Green state Ohio Ohio Colorado key1 key2 a 1 0 1 2 2 3 4 5 b 1 6 7 8 2 9 10 11 In [54]: frame.sortlevel(1) W:\software\Python\Python35\Scripts\ipython:1: FutureWarning: sortlevel is deprecated, use sort_index(level= ...) Out[54]: state Ohio Colorado color Green Red Green key1 key2 a 1 0 1 2 b 1 6 7 8 a 2 3 4 5 b 2 9 10 11 In [55]: frame.sort_index(level=1) Out[55]: state Ohio Colorado color Green Red Green key1 key2 a 1 0 1 2 b 1 6 7 8 a 2 3 4 5 b 2 9 10 11 In [56]: frame.sort_index(level=1,axis=1) Out[56]: state Colorado Ohio color Green Green Red key1 key2 a 1 2 0 1 2 5 3 4 b 1 8 6 7 2 11 9 10
有的時候我們想要把 Dataframe 的某列或某幾列當做 Dataframe 的索引:
In [59]: frame = pd.DataFrame({'a': range(7), 'b': range(7, 0, -1), ...: 'c': ['one', 'one', 'one', 'two', 'two', 'two', 'two'], ...: 'd': [0, 1, 2, 0, 1, 2, 3]}) In [60]: frame Out[60]: a b c d 0 0 7 one 0 1 1 6 one 1 2 2 5 one 2 3 3 4 two 0 4 4 3 two 1 5 5 2 two 2 6 6 1 two 3 In [61]: frame1 = frame.set_index(['c','d']) In [62]: frame1 Out[62]: a b c d one 0 0 7 1 1 6 2 2 5 two 0 3 4 1 4 3 2 5 2 3 6 1
預設情況下,被當做索引的列會被移除,也可通過drop=False儲存下來:
In [63]: frame.set_index(['c','d'],drop=False) Out[63]: a b c d c d one 0 0 7 one 0 1 1 6 one 1 2 2 5 one 2 two 0 3 4 two 0 1 4 3 two 1 2 5 2 two 2 3 6 1 two 3
reset_index的作用跟 set_index 正好相反:
In [64]: frame1.reset_index() Out[64]: c d a b 0 one 0 0 7 1 one 1 1 6 2 one 2 2 5 3 two 0 3 4 4 two 1 4 3 5 two 2 5 2 6 two 3 6 1