pandas之基本功能

阿新 • • 發佈：2019-02-15

pandas 的官方文件：

1. 重新索引

作用：建立一個適應新索引的新物件，會根據新索引對原資料進行重排，如果是新引入的索引，則會引入缺失值(也可用 fill_value 指定填充值)。

reindex 的函式引數：

index	New sequence to use as index. Can be Index instance or any other sequence-like Python data structure. An Index will be used exactly as is without any copying
method	Interpolation (fill) method, see table for options.
fill_value	Substitute value to use when introducing missing data by reindexing
limit	When forward- or backfilling, maximum size gap to fill
level	Match simple Index on level of MultiIndex, otherwise select subset of
copy	Do not copy underlying data if new index is equivalent to old index. True by default (i.e. always copy data).

In [49]: obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
In [50]: obj
Out[50]:
d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64
In [51]: obj.reindex(['a','b','c','d','e'])  # obj.reindex(['a','b','c','d','e']，fill_value=0) 
Out[51]:
a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN     # e  0 

dtype: float64

對於有序的索引序列，在重新索引時，我們可以用 method 選項進行前後填充值：

In [56]: obj1 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
In [57]: obj1.reindex(range(6),method='ffill')
Out[59]:
0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

reindex 的(插值)method選項：

ffill or pad	Fill (or carry) values forward
bfill or backfill	Fill (or carry) values backward

對於 Dataframe 可以單獨重新指定 index 和 columns，也可以同時指定，預設是重新索引行。

Dataframe 中的插值只能應用在行上(即軸0)。

n [64]: frame = pd.DataFrame(np.arange(9).reshape((3, 3)), index=['a', 'c', 'd'],
    ...:                      columns=['Ohio', 'Texas', 'California'])
In [66]: frame.reindex(['a','b','c','d'])
In [67]: states = ['Texas', 'Utah', 'California']
In [68]: frame.reindex(columns=states)
In [71]: frame.reindex(index=['a','b','c','d'],columns=states)
In [82]: frame.reindex(index=['a','b','c','d'],method='ffill').reindex(columns=states) 
                                       # python資料分析書中這句程式碼是：
                                       # frme.reindex(index=['a', 'b', 'c', 'd'], method='ffill',columns=states) 
                                       # 由於版本的原因執行這句程式碼可能會出錯

python資料分析書上利用 ix 的標籤索引功能，這個在未來可能會廢棄掉：

In[87]:frame.ix[['a','b','c','d'],states]
W:\software\Python\Python35\Scripts\ipython:1: DeprecationWarning:
.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing
See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
W:\software\Python\Python35\Scripts\ipython:1: FutureWarning:Passing list-likes to 
.loc or [] with any missing label will raiseKeyError in the future, you can use .reindex() as an alternative.
See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
Out[87]:
   Texas  Utah  California
a    1.0   NaN         2.0
b    NaN   NaN         NaN
c    4.0   NaN         5.0
d    7.0   NaN         8.0
In [88]: frame.loc[['a','b','c','d'],states]
W:\software\Python\Python35\Scripts\ipython:1: FutureWarning:
Passing list-likes to .loc or [] with any missing label will raiseKeyError in the future,
you can use .reindex() as an alternative.
See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
Out[88]:
   Texas  Utah  California
a    1.0   NaN         2.0
b    NaN   NaN         NaN
c    4.0   NaN         5.0
d    7.0   NaN         8.0

2. 刪除指定軸上的項

drop 方法返回的是一個刪除指定軸上的項後的新物件。

In [96]: obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
In [97]: new_obj = obj.drop('c')
In [98]: obj.drop(['d', 'c'])
Out[98]:
a    0.0
b    1.0
e    4.0
dtype: float64

對於 Dataframe 可以刪除任意軸上的索引值：

In [99]: data = pd.DataFrame(np.arange(16).reshape((4, 4)),
    ...: ....: index=['Ohio', 'Colorado', 'Utah', 'New York'],
    ...: ....: columns=['one', 'two', 'three', 'four'])
In [100]: data
Out[100]:
          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15
In [101]: data.drop(['Colorado', 'Ohio'])
Out[101]:
          one  two  three  four
Utah        8    9     10    11
New York   12   13     14    15

In [102]: data.drop('two', axis=1)
Out[102]:
          one  three  four
Ohio        0      2     3
Colorado    4      6     7
Utah        8     10    11
New York   12     14    15
In [103]: data.drop(['two', 'four'], axis=1)
Out[103]:
          one  three
Ohio        0      2
Colorado    4      6
Utah        8     10
New York   12     14

3.索引、選取、過濾

Series 的類似於numpy 陣列的索引：

In [102]: obj = Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
In [103]: obj['b']
In [104]: obj[1]
In [105]: obj[2:4]
In [106]: obj[['b', 'a', 'd']]In [107]: obj[[1, 3]]
In [108]: obj[obj < 2]

利用標籤進行索引和賦值(其末端包含)：

In [110]: obj['b':'c'] = 5

對於 Dataframe 進行索引就是選取一個或多個列：

In [112]: data = pd.DataFrame(np.arange(16).reshape((4, 4)),
.....: index=['Ohio', 'Colorado', 'Utah', 'New York'],
.....: columns=['one', 'two', 'three', 'four'])In [114]: data['two'] 
In [115]: data[['three', 'one']]

通過切片或布林型陣列選取行：

In [116]: data[:2]    
In [117]: data[data['three'] > 5]

通過布林型進行索引：

In [118]: data < 5
In [119]: data[data < 5] = 0
In [120]: data

用 ix 進行索引列和行(未來可能廢除，改用其他方法，例：loc、iloc)：

In [121]: data.ix['Colorado', ['two', 'three']]
In [122]: data.ix[['Colorado', 'Utah'], [3, 0, 1]]
In [123]: data.ix[2]
In [124]: data.ix[:'Utah', 'two']
In [125]: data.ix[data.three > 5, :3]

Dataframe 的索引選項：

obj.ix[val]	Selects single row of subset of rows from the DataFrame.
obj.ix[:, val]	Selects single column of subset of columns.
obj.ix[val1, val2]	Select both rows and columns.
reindex	Conform one or more axes to new indexes.
xs	Select single row or column as a Series by label.
icol, irow	Select single column or row, respectively, as a Series by integer location.
get_value, set_value	Select single value by row and column label.

4. 算術運算

物件相加時，結果索引是每個物件的索引的並集，對於不重疊的索引，其值會填充 NA(可以指定填充值)：

In [126]: s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
In [127]: s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])
In [128]: s1 + s2
Out[128]:
a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

In [129]: df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),
     ...: index=['Ohio', 'Texas', 'Colorado'])
In [130]: df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
     ...: index=['Utah', 'Ohio', 'Texas', 'Oregon'])
In [131]: df1 + df2
Out[131]:
            b   c     d   e
Colorado  NaN NaN   NaN NaN
Ohio      3.0 NaN   6.0 NaN
Oregon    NaN NaN   NaN NaN
Texas     9.0 NaN  12.0 NaN
Utah      NaN NaN   NaN NaN

In [132]: df1.add(df2,fill_value=0)
Out[132]:
            b    c     d     e
Colorado  6.0  7.0   8.0   NaN
Ohio      3.0  1.0   6.0   5.0
Oregon    9.0  NaN  10.0  11.0
Texas     9.0  4.0  12.0   8.0
Utah      0.0  NaN   1.0   2.0

Method Description

add	Method for addition (+)
sub	Method for subtraction (-)
div	Method for division (/)
mul	Method for multiplication (*)

Dataframe 和 Series 之間的運算

這兩者之間的運算涉及到了廣播的知識，以後會有介紹廣播相關的知識。一維二維的廣播都比較容易理解。

In [153]: frame = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
     ...: index=['Utah', 'Ohio', 'Texas', 'Oregon'])
In [154]: series = frame.iloc[0]
In [155]: frame - series        # 嗯，大概就是這樣，理解一下
Out[155]:
          b    d    e
Utah    0.0  0.0  0.0
Ohio    3.0  3.0  3.0
Texas   6.0  6.0  6.0
Oregon  9.0  9.0  9.0

如果你希望匹配行且在列上廣播，必須使用算數運算：

In [160]: frame = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
     ...: index=['Utah', 'Ohio', 'Texas', 'Oregon'])
In [161]: series1 = frame['d']
In [162]: frame.sub(series1,axis=0)
Out[162]:
          b    d    e
Utah   -1.0  0.0  1.0
Ohio   -1.0  0.0  1.0
Texas  -1.0  0.0  1.0
Oregon -1.0  0.0  1.0

5. 函式應用於對映

numpy 的 ufuncs (元素級陣列方法)也可用於操作pandas物件：

In [164]: frame = pd.DataFrame(np.random.randn(4,3), columns=list('bde'),
     ...: index=['Utah', 'Ohio', 'Texas', 'Oregon'])
In [167]: frame
Out[167]:
               b         d         e
Utah   -0.637896  0.509292 -0.919939
Ohio   -0.604495  0.298296 -0.377575
Texas  -0.710751 -0.091902  0.607375
Oregon  0.576612  1.664728  0.264065
In [168]: np.abs(frame)
Out[168]:
               b         d         e
Utah    0.637896  0.509292  0.919939
Ohio    0.604495  0.298296  0.377575
Texas   0.710751  0.091902  0.607375
Oregon  0.576612  1.664728  0.264065

也可用apply方法把函式應用到由各列或各行形成的一維陣列上：

In [172]: f = lambda x: x.max() - x.min()
In [173]: frame.apply(f)
Out[173]:
b    1.287363
d    1.756631
e    1.527314
dtype: float64

也可返回多個值組成的Series：

In [176]: def f(x):
     ...:     return pd.Series([x.min(), x.max()], index=['min', 'max'])
In [177]: frame.apply(f)
Out[177]:
            b         d         e
min -0.710751 -0.091902 -0.919939
max  0.576612  1.664728  0.607375

元素級的python函式也是可用的，使用applymap方法：

In [179]: format = lambda x: '%.2f' % x
In [180]: frame.applymap(format)
Out[180]:
            b      d      e
Utah    -0.64   0.51  -0.92
Ohio    -0.60   0.30  -0.38
Texas   -0.71  -0.09   0.61
Oregon   0.58   1.66   0.26

6. 排序與排名

對行或列索引進行排序可以使用 sort_index 方法

對Series安值進行排序，可使用sort_values方法，若某個索引缺失值，則會被放到末尾

In [183]: obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])
In [187]: obj.sort_index()    # obj.sort_index(ascending=False)  降序
In [189]: obj.sort_values()   # obj.sort_values(ascending=False)  降序

對於Dataframe 可以根據任意軸上的索引進行排序，預設是升序，也可降序排序：

In [196]: frame = pd.DataFrame(np.arange(8).reshape((2, 4)), index=['three', 'one'],
     ...:  columns=['d', 'a', 'b', 'c'])
In [197]: frame.sort_index()
Out[197]:
       d  a  b  c
one    4  5  6  7
three  0  1  2  3
In [198]: frame.sort_index(axis=1)
Out[198]:
       a  b  c  d
three  1  2  3  0
one    5  6  7  4
In [199]: frame.sort_index(axis=1, ascending=False)
Out[199]:
       d  c  b  a
three  0  3  2  1
one    4  7  6  5

在 Dataframe 上還可以使用 by 關鍵字，根據一或多列的值進行排序：

In [203]: frame.sort_values(by='b')    # FutureWarning: by argument to sort_index
                                       # is deprecated, please use .sort_values(by=...)
Out[203]:
   a  b
2  0 -3
3  1  2
0  0  4
1  1  7
In [204]: frame.sort_values(by=['a','b'])
Out[204]:
   a  b
2  0 -3
0  0  4
3  1  2
1  1  7

注意：對DataFrame的值進行排序的時候，我們必須要使用by指定某一行（列）或者某幾行（列），
     如果不使用by引數進行指定的時候，就會報TypeError: sort_values() missing 1 required positional argument: 'by'。
     使用by引數進行某幾列（行）排序的時候，以列表中的第一個為準，可能後面的不會生效，因為有的時候無法做到既對第一行（列）
     進行升序排序又對第二行（列）進行排序。在指定行值進行排序的時候，必須設定axis=1，不然會報錯，因為預設指定的是列索引，
     找不到這個索引所以報錯，axis=1的意思是指定行索引。

排名：

排名和排序有點類似，排名會有一個排名值（從1開始，一直到陣列中有效資料的數量），它與numpy.argsort的間接排序索引差不多，只不過它可以根據某種規則破壞平級關係。

In [214]: obj = pd.Series([7, -5, 7, 4, 2, 0, 4])    # 下標對應 (0, 1, 2, 3, 4, 5, 6)
In [215]: obj.rank()        # 預設是根據值的大小進行平均排名  
Out[215]:                
0    6.5                    #  7 最大 由於有兩個 7 ，所以排名為 6,7 名，平均排名 6.5
1    1.0                    # -5 最小 對應下標為 1 ，排在第一
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

根據值在陣列中出現的順序進行排名：

In [216]: obj.rank(method='first')  # 也可以按降序排名obj.rank(ascending=False,method='max') 按照分組的最大排名排序Out[216]:
0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64

也可以指定軸進行排名：

In [219]: frame = pd.DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1],
     ...:                       'c': [-2, 5, 8, -2.5]})In [220]: frame
Out[220]:
   a    b    c
0  0  4.3 -2.0
1  1  7.0  5.0
2  0 -3.0  8.0
3  1  2.0 -2.5In [221]: frame.rank(axis=1)
Out[221]:
     a    b    c
0  2.0  3.0  1.0
1  1.0  3.0  2.0
2  2.0  1.0  3.0
3  2.0  3.0  1.0

排名時用於破壞平級關係的method選項：

average	Default: assign the average rank to each entry in the equal group
min	Use the minimum rank for the whole group
max	Use the maximum rank for the whole group
first	Assign ranks in the order the values appear in the data

7. 彙總和計算描述統計

可以指定對行或列進行統計。統計時預設會跳過 NA 值，也可以用 skipna 指定不跳過。

In [224]: df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
     ...: [np.nan, np.nan], [0.75, -1.3]],
     ...: index=['a', 'b', 'c', 'd'],
     ...: columns=['one', 'two'])In [225]: df
Out[225]:
    one  two
a  1.40  NaN
b  7.10 -4.5
c   NaN  NaN
d  0.75 -1.3In [227]: df.sum()
Out[227]:
one    9.25
two   -5.80
dtype: float64In [228]: df.sum(axis=1)
Out[228]:
a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64In [229]: df.sum(axis=1,skipna=False)
Out[229]:
a     NaN
b    2.60
c     NaN
d   -0.55
dtype: float64

約簡方法的選項
Method Description

axis	Axis to reduce over. 0 for DataFrame’s rows and 1 for columns.
skipna	Exclude missing values, True by default.
level	Reduce grouped by level if the axis is hierarchically-indexed (MultiIndex).

描述和彙總統計
Method Description

count	Number of non-NA values
describe	Compute set of summary statistics for Series or each DataFrame column
min, max	Compute minimum and maximum values
argmin, argmax	Compute index locations (integers) at which minimum or maximum value obtained, respectively
idxmin, idxmax	Compute index values at which minimum or maximum value obtained, respectively
quantile	Compute sample quantile ranging from 0 to 1
sum	Sum of values
mean	Mean of values
median	Arithmetic median (50% quantile) of values
mad	Mean absolute deviation from mean value
var	Sample variance of values
std	Sample standard deviation of values
skew	Sample skewness (3rd moment) of values
kurt	Sample kurtosis (4th moment) of values
cumsum	Cumulative sum of values
cummin, cummax	Cumulative minimum or maximum of values, respectively
cumprod	Cumulative product of values
diff	Compute 1st arithmetic difference (useful for time series)
pct_change	Compute percent changes

8.唯一值，值計數，成員資格

In [231]: obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])In [232]: uniques = obj.unique()      # 唯一值In [233]: uniques
Out[233]: array(['c', 'a', 'd', 'b'], dtype=object)In [234]: obj.value_counts()        # 計數值
Out[234]:
c    3
a    3
b    2
d    1
dtype: int64In [235]: mask = obj.isin(['b', 'c'])         # 成員關係In [236]: mask
Out[236]:
0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: boolIn [237]:  obj[mask]
Out[237]:
0    c
5    b
6    b
7    c
8    c
dtype: object

Method Description

isin	Compute boolean array indicating whether each Series value is contained in the passed sequence of values.
unique	Compute array of unique values in a Series, returned in the order observed.
value_counts	Return a Series containing unique values as its index and frequencies as its values, ordered count in descending order.

將pandas的 pandas.value_counts 傳給 Dataframe 的 apply 函式：

In [239]: data = pd.DataFrame({'Qu1': [1, 3, 4, 3, 4],
     ...: 'Qu2': [2, 3, 1, 2, 3],
     ...: 'Qu3': [1, 5, 2, 4, 4]})In [240]: data
Out[240]:
   Qu1  Qu2  Qu3
0    1    2    1
1    3    3    5
2    4    1    2
3    3    2    4
4    4    3    4In [241]: result = data.apply(pd.value_counts).fillna(0)In [242]: result
Out[242]:
   Qu1  Qu2  Qu3
1  1.0  1.0  1.0
2  0.0  2.0  1.0
3  2.0  2.0  0.0
4  2.0  0.0  2.0
5  0.0  0.0  1.0

9.缺失值的處理

pandas 中使用浮點值 NAN (Not a Number) 表示浮點和非浮點陣列中的缺失資料，只是一種便於表示的標記。

NA處理的方法：

dropna	Filter axis labels based on whether values for each label have missing data, with varying thresholds for how much missing data to tolerate.
fillna	Fill in missing data with some value or using an interpolation method such as 'ffill' or 'bfill'.
isnull	Return like-type object containing boolean values indicating which values are missing / NA.
notnull	Negation of isnull.

過濾缺失數據：dropna

對於 Series ，dropna()僅僅返回一個非空資料和索引值的 Series：
In [6]: from numpy import nan as NA
In [7]: data = pd.Series([1,NA,4,NA,5])
In [8]: data.dropna()           # 也可以通過bool索引達到此目的：data[data.notnull()]
Out[8]: 
0    1.0
2    4.0
4    5.0
dtype: float64

對於 Dataframe ，dropna 預設丟棄任何含有缺失值的行，傳入引數 how='all' ，只丟棄全為NA的行。要丟棄為NA的行，傳入引數 axis=1，即可。引數 thresh 可以保留部分資料。

填充缺失資料：fillna

In [9]: df = pd.DataFrame(np.random.randn(7,3))
In [10]: df
Out[10]: 
          0         1         2
0 -1.405991 -1.032070 -0.421601
1  0.878711 -0.786235  1.483527
2 -0.082090 -0.163028 -0.718293
3 -0.576532  0.229013  0.387237
4 -0.682892  0.547743  0.297142
5 -1.367772 -0.169607 -2.359635
6 -0.591433 -0.318911  0.449039
In [12]: df.iloc[:4,1] = NA
In [13]: df.iloc[:2,2] = NA
In [14]: df
Out[14]: 
          0         1         2
0 -1.405991       NaN       NaN
1  0.878711       NaN       NaN
2 -0.082090       NaN -0.718293
3 -0.576532       NaN  0.387237
4 -0.682892  0.547743  0.297142
5 -1.367772 -0.169607 -2.359635
6 -0.591433 -0.318911  0.449039

In [15]: df.fillna(0)                 # 用任意數去填充所有的缺失值
Out[15]: 
          0         1         2
0 -1.405991  0.000000  0.000000
1  0.878711  0.000000  0.000000
2 -0.082090  0.000000 -0.718293
3 -0.576532  0.000000  0.387237
4 -0.682892  0.547743  0.297142
5 -1.367772 -0.169607 -2.359635
6 -0.591433 -0.318911  0.449039

In [20]: df.fillna({1:0.5,2:1.})    # 傳入一個字典，可以實現對列中缺失值的填充
Out[20]: 
          0         1         2
0 -1.405991  0.500000  1.000000
1  0.878711  0.500000  1.000000
2 -0.082090  0.500000 -0.718293
3 -0.576532  0.500000  0.387237
4 -0.682892  0.547743  0.297142
5 -1.367772 -0.169607 -2.359635
6 -0.591433 -0.318911  0.449039

fillna 預設總會產生一個新的物件，用inplace 引數實現就地修改。

用 method 引數填充資料：

In [27]: df.fillna(method='bfill')
Out[27]: 
          0         1         2
0 -1.405991  0.547743 -0.718293
1  0.878711  0.547743 -0.718293
2 -0.082090  0.547743 -0.718293
3 -0.576532  0.547743  0.387237
4 -0.682892  0.547743  0.297142
5 -1.367772 -0.169607 -2.359635
6 -0.591433 -0.318911  0.449039

In [28]: df.fillna(method='bfill',limit=2)
Out[28]: 
          0         1         2
0 -1.405991       NaN -0.718293
1  0.878711       NaN -0.718293
2 -0.082090  0.547743 -0.718293
3 -0.576532  0.547743  0.387237
4 -0.682892  0.547743  0.297142
5 -1.367772 -0.169607 -2.359635
6 -0.591433 -0.318911  0.449039

fillna 引數：

value	Scalar value or dict-like object to use to fill missing values
method	Interpolation, by default 'ffill' if function called with no other arguments
axis	Axis to fill on, default axis=0
inplace	Modify the calling object without producing a copy
limit	For forward and backward filling, maximum number of consecutive periods to fill

10. 層次化索引

能在一個軸上擁有多個索引級別，能以低維度形式處理高緯度資料。

建立一個層次化索引的 Series：

 In [31]: data = pd.Series(np.random.randn(10),
    ...: index=[['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'd', 'd'],
    ...: [1, 2, 3, 1, 2, 3, 1, 2, 2, 3]])

In [32]: data
Out[32]: 
a  1   -0.413827
   2    0.660228
   3    0.209686
b  1   -0.361603
   2   -0.982985
   3   -0.267620
c  1   -1.130506
   2   -2.023760
d  2    0.989250
   3    1.074886
dtype: float64

In [33]: data.index
Out[33]: 
MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
           labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 1, 2]])

索引：

In [34]: data['a']
Out[34]: 
1   -0.413827
2    0.660228
3    0.209686
dtype: float64
In [35]: data[:,2]              # 實現內層索引
Out[35]: 
a    0.660228
b   -0.982985
c   -2.023760
d    0.989250dtype: float64
In [36]: data['b':'d']         # 實現切片索引
Out[36]: 
b  1   -0.361603   
2   -0.982985 
  3   -0.267620
c  
1   -1.130506   
2   -2.023760
d  
2    0.989250   
3    1.074886
dtype: float64
In [37]: data.loc[['b','c']] 
Out[37]: 
b  
1   -0.361603  
2   -0.982985   
3   -0.267620
c  
1   -1.130506   
2   -2.023760
dtype: float64

層次化索引可以通過 unstack 方法生成 Dataframe 資料：

In [38]: data.unstack()
Out[38]: 
          1         2         3
a -0.413827  0.660228  0.209686
b -0.361603 -0.982985 -0.267620
c -1.130506 -2.023760       NaN
d       NaN  0.989250  1.074886

In [39]: data.unstack().unstack()
Out[39]: 
1  a   -0.413827
   b   -0.361603
   c   -1.130506
   d         NaN
2  a    0.660228
   b   -0.982985
   c   -2.023760
   d    0.989250
3  a    0.209686
   b   -0.267620
   c         NaN
   d    1.074886
dtype: float64

In [42]: data.unstack().stack()        # stack 是unstack的逆運算
Out[42]: 
a  1   -0.413827
   2    0.660228
   3    0.209686
b  1   -0.361603
   2   -0.982985
   3   -0.267620
c  1   -1.130506
   2   -2.023760
d  2    0.989250
   3    1.074886
dtype: float64

對於 Dataframe 每條軸都可以有層次化索引，每個索引還都可以有名字：

In [44]: frame = pd.DataFrame(np.arange(12).reshape((4, 3)),
    ...:  index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
    ...:  columns=[['Ohio', 'Ohio', 'Colorado'],
    ...:  ['Green', 'Red', 'Green']]
    ...: )
  
In [45]: frame
Out[45]: 
     Ohio     Colorado
    Green Red    Green
a 1     0   1        2
  2     3   4        5
b 1     6   7        8
  2     9  10       11
In [46]: frame.index.names = ['key1', 'key2']
In [47]: frame.columns.names = ['state', 'color']
In [48]: frame
Out[48]: 
state      Ohio     Colorado
color     Green Red    Green
key1 key2
a    1        0   1        2
     2        3   4        5
b    1        6   7        8
     2        9  10       11

swaplevel() : 調整某條軸上各級別的順序；sort_index(): 對各級別上的資料值進行排序

In [49]: frame.swaplevel('key1','key2')
Out[49]: 
state      Ohio     Colorado
color     Green Red    Green
key2 key1
1    a        0   1        2
2    a        3   4        5
1    b        6   7        8
2    b        9  10       11

In [51]: frame.swaplevel(0,1)
Out[51]: 
state      Ohio     Colorado
color     Green Red    Green
key2 key1
1    a        0   1        2
2    a        3   4        5
1    b        6   7        8
2    b        9  10       11

In [53]: frame.swaplevel(0,1,axis=1)
Out[53]: 
color     Green  Red    Green
state      Ohio Ohio Colorado
key1 key2
a    1        0    1        2
     2        3    4        5
b    1        6    7        8
     2        9   10       11

In [54]: frame.sortlevel(1)
W:\software\Python\Python35\Scripts\ipython:1: FutureWarning: sortlevel is
 deprecated, use sort_index(level= ...)
Out[54]: 
state      Ohio     Colorado
color     Green Red    Green
key1 key2
a    1        0   1        2
b    1        6   7        8
a    2        3   4        5
b    2        9  10       11

In [55]: frame.sort_index(level=1)
Out[55]: 
state      Ohio     Colorado
color     Green Red    Green
key1 key2
a    1        0   1        2
b    1        6   7        8
a    2        3   4        5
b    2        9  10       11

In [56]: frame.sort_index(level=1,axis=1)
Out[56]: 
state     Colorado  Ohio
color        Green Green Red
key1 key2
a    1           2     0   1
     2           5     3   4
b    1           8     6   7
     2          11     9  10

有的時候我們想要把 Dataframe 的某列或某幾列當做 Dataframe 的索引：

In [59]: frame = pd.DataFrame({'a': range(7), 'b': range(7, 0, -1),
    ...:  'c': ['one', 'one', 'one', 'two', 'two', 'two', 'two'],
    ...: 'd': [0, 1, 2, 0, 1, 2, 3]})

In [60]: frame
Out[60]: 
   a  b    c  d
0  0  7  one  0
1  1  6  one  1
2  2  5  one  2
3  3  4  two  0
4  4  3  two  1
5  5  2  two  2
6  6  1  two  3

In [61]: frame1 = frame.set_index(['c','d'])

In [62]: frame1
Out[62]: 
       a  b
c   d
one 0  0  7
    1  1  6
    2  2  5
two 0  3  4
    1  4  3
    2  5  2
    3  6  1

預設情況下，被當做索引的列會被移除，也可通過drop=False儲存下來：

In [63]: frame.set_index(['c','d'],drop=False)
Out[63]: 
       a  b    c  d
c   d
one 0  0  7  one  0
    1  1  6  one  1
    2  2  5  one  2
two 0  3  4  two  0
    1  4  3  two  1
    2  5  2  two  2
    3  6  1  two  3

reset_index的作用跟 set_index 正好相反：

In [64]: frame1.reset_index()
Out[64]: 
     c  d  a  b
0  one  0  0  7
1  one  1  1  6
2  one  2  2  5
3  two  0  3  4
4  two  1  4  3
5  two  2  5  2
6  two  3  6  1

pandas之基本功能

pandas 的官方文件：1. 重新索引作用：建立一個適應新索引的新物件，會根據新索引對原資料進行重排，如果是新引入的索引，則會引入缺失值(也可用 fill_value 指定填充值)。reindex 的函式引數：indexNew sequence to use as

利用python數據分析panda學習筆記之基本功能

數據分析 method 入行整數 -s cnblogs 3.4 style fill 1 重新生成索引如果某個索引值不存在就引入缺失值 1 from pandas import Series,DataFrame 2 import pandas as pd 3 im

pandas的基本功能(一)

第16天pandas的基本功能(一) 靈活的二進位制操作體現在2個方面支援一維和二維之間的廣播支援缺失值資料處理四則運算支援廣播 +add - sub *mul /div divmod()分割槽和模運算(返回商和

Zookeeper開源客戶端Curator之基本功能講解

簡介 Curator是Netflix公司開源的一套Zookeeper客戶端框架。瞭解過Zookeeper原生API都會清楚其複雜度。Curator幫助我們在其基礎上進行封裝、實現一些開發細節，包括接連重連、反覆註冊Watcher和NodeExistsExcept

適合生產環境的效能監控類庫之基本功能篇

背景 NanoProfiler是一個EF Learning Labs出品的免費效能監控類庫（即將開源）。它的思想和使用方式類似於MiniProfiler的。但是，設計理念有較大差異。 MiniProfiler更像是一個面向開發和測試環境的效能監控類庫，它的關注點（我說的不一定對，僅代表一家之言），更多的是提供

《利用Python進行資料分析》第五章 pandas的基本功能

介紹操作Series和DataFrame中的資料的基本功能重新索引 pandas物件的一個重要方法是reindex，其作用是建立一個適應新索引的新物件。以之前的一個簡單示例來說 In [1]: from pandas import Series,Da

Dyno-queues 分散式延遲佇列之基本功能

# Dyno-queues 分散式延遲佇列之基本功能 [toc] ## 0x00 摘要本系列我們會以設計分散式延遲佇列時重點考慮的模組為主線，穿插灌輸一些訊息佇列的特性實現方法，通過分析Dyno-queues 分散式延遲佇列的原始碼來具體看看設計實現一個分散式延遲佇列的方方面面。 ## 0x01

Pandas基本功能之reindex重新索引

重新索引 reindex重置索引,如果索引值不存在，就引入缺失值引數介紹引數說明 index 用作索引的新序列 method 插值 fill_vlaue 引入

Pandas基本功能之算術運算、排序和排名

算術運算和資料對齊 Series和DataFrame中行運算和列運算有種特徵叫做廣播在將物件相加時，如果存在不同的索引對，則結果的索引就是該索引對的並集。自動的資料對齊操作在不重疊的索引處引入了NA值，NA值在算術運算中過程中傳播。 import pandas as pd from pandas im

Pandas基本功能之層次化索引及層次化彙總

層次化索引層次化也就是在一個軸上擁有多個索引級別 Series的層次化索引 data=Series(np.random.randn(10),index=[ ['a','a','a','b','b','b','c','c','d','d'], [1,2,3,1,2,3,1,2,2,3]

pandas學習筆記5---DataFrame/Series基本功能之計算

OK，繼續學習pandas的基本功能之計算，pandas庫的資料結構幾乎與excel或資料庫的結構完全一樣，非常接近我們日常所用的資料形式。同時也是資料分析/挖掘計算的常用基礎庫，其計算功能的重要性自然不言而喻。本次我們針對pandas的主要資料結構Series/Data

軟工作業 4：結對項目之詞頻統計——基本功能

ati 與他如果 lam req ESS fin 有效 stop 一、基本信息　　1、本次作業的地址：https://edu.cnblogs.com/campus/ntu/Embedded_Application/homework/2088 　 2、項目Git地

軟工作業 4：結對專案之詞頻統計——基本功能

一、基本資訊　　1、本次作業的地址：https://edu.cnblogs.com/campus/ntu/Embedded_Application/homework/2088 　 2、專案Git地址：https://gitee.com/ntucs/Pai

Pandas快速教程-必要的基本功能

一.資料的快速檢視head和tail 這兩個方法可以快速的檢視一組資料的小抽樣,預設的設定是5行,當然也可以設定要返回的數目. t=df.head(2) t Out[134]: First Secend Third 0 2.0 1.0 1

資料結構之二叉樹基本功能的實現

二叉樹的各種性質在這裡不再重複，本文實現二叉樹的基本操作，包括建立、前序輸出、中序輸出、後序輸出、刪除二叉樹、葉子結點個數、葉子節點的值、交換左右子樹 1.首先建立結構體： typedef struct Tree_Node{ char ch; struct

Linux學習之Bash的基本功能

一、命令別名與快捷鍵 1、檢視系統中所有的別名 alias 2、設定臨時別名 alias 別名=‘原命令’ 注意：該方法只是臨時的，重啟就不存在了 3、永久生效別名 vim ~/.bashrc 進入別名環境變數，再新增別名 alias rm=‘rm -i’ 注意：需

Linux學習之Shell基礎——Bash基本功能——萬用字元和其他特殊符號

1、萬用字元萬用字元作用？匹配一個任意字元 * 匹配0個或任意多個任意字元，也就是可以匹配任何內容 [ ] 匹配中括號中任意一個字元。例如：[

Linux學習之Shell基礎——Bash基本功能——多命令順序執行與管道符

1、多命令順序執行（）多命令執行符格式作用；命令1 ；命令2 多個命令順序執行，命令之間沒有任何邏輯聯絡 && 命令1 &am

Linux學習之Shell基礎——Bash基本功能——輸入輸出重定向

1、標準輸入輸出裝置裝置檔名檔案描述符型別鍵盤 /dev/stdin 0 標準輸入顯示器 /dev/sdtout 1

Linux 學習之Shell 基礎——Bash基本功能——別名與快捷鍵

1、命令別名 [[email protected] ~]# alias 別名='原命令’ #設定命令別名 [[email protected] ~]# alias # 查詢命令別名詳細介紹： 1）別名就是給系統中的某個命令起個新名稱，方便使用者根據自

pandas之基本功能

pandas 的官方文件：

1. 重新索引

2. 刪除指定軸上的項

3.索引、選取、過濾

4. 算術運算

5. 函式應用於對映

6. 排序與排名

排名和排序有點類似，排名會有一個排名值（從1開始，一直到陣列中有效資料的數量），它與numpy.argsort的間接排序索引差不多，只不過它可以根據某種規則破壞平級關係。

7. 彙總和計算描述統計

8.唯一值，值計數，成員資格

9.缺失值的處理

10. 層次化索引

相關推薦