Pandas 索引和選擇資料

阿新 • • 發佈：2020-08-15
pandas物件中的軸標籤資訊有許多用途：

使用已知的指標標識資料（即提供元資料），這對於分析，視覺化和互動式控制檯顯示很重要。
啟用自動和顯式資料對齊。
允許直觀地獲取和設定資料集的子集。
在本節中，我們將重點關注最後一點：即如何切片，切塊，以及通常如何獲取和設定熊貓物件的子集。主要重點將放在Series和DataFrame上，因為它們在該領域得到了更多的開發關注。

注意 Python和NumPy索引運算子[]和屬性運算子. 可在各種用例中快速輕鬆地訪問熊貓資料結構。這使互動式工作變得直觀，因為如果您已經知道如何處理Python字典和NumPy陣列，則沒有什麼新的知識要學習。但是，由於事先不知道要訪問的資料型別，因此直接使用標準運算子存在一些優化限制。對於生產程式碼，我們建議您利用本章中介紹的優化的熊貓資料訪問方法。
警告 為設定操作返回副本還是參考，可能取決於上下文。這有時被稱為，應該避免。請參閱返回檢視與複製。chained assignment
警告 在0. 
18.0中已闡明瞭對基於整數的帶有浮點數的索引的索引，有關更改的摘要，請參見此處。
見多指標/高階索引的MultiIndex和更先進的索引檔案。

有關某些高階策略，請參見本食譜。

不同的選擇索引
物件選擇具有許多使用者請求的新增項，以支援更明確的基於位置的索引。熊貓現在支援三種類型的多軸索引。

.loc主要基於標籤，但也可以與布林陣列一起使用。找不到物品時.loc將升高KeyError。允許的輸入為：

單個標籤，例如5或'a'（請注意，它5被解釋為索引的 標籤。此用法不是沿索引的整數位置。）

標籤列表或標籤陣列。['a', 'b', 'c']

帶標籤的切片物件'a':'f'（注意，相反普通的Python片，都開始和停止都包括在內，存在於索引時！見切片用標籤）。

布林陣列

一個callable帶有一個引數的函式（呼叫Series或DataFrame），並且返回用於索引的有效輸出（上述之一）。

 
0.18.1版中的新功能。

有關更多資訊，請參見按標籤選擇。

.iloc主要是整數位置（來自0於 length-1所述軸線的），但也可以用布林陣列使用。 如果請求的索引器超出邊界，.iloc則將增加IndexError，但切片索引器除外，該索引允許越界索引。（這符合Python / NumPy slice 語義）。允許的輸入為：

整數，例如5。

整數列表或陣列。[4, 3, 0]

具有int的slice物件1:7。

布林陣列。

一個callable帶有一個引數的函式（呼叫Series或DataFrame），並且返回用於索引的有效輸出（上述之一）。

0.18.1版中的新功能。

請參閱“ 按位置選擇”，“ 高階索引編制”和“ 高階層次結構”。

.loc，.iloc以及[]索引也可以接受callable作為索引器。有關更多資訊，請參見“ 按呼叫選擇”。

從具有多軸選擇的物件中獲取值使用以下符號（.loc作為示例，但以下內容同樣適用.iloc）。任何軸訪問器都可以為null slice :。軸冷落的規格被假定為:，例如p.loc[ 
'a']相當於 。p.loc['a', :, :]

物件型別    索引器
系列    s.loc[indexer]
資料框    df.loc[row_indexer,column_indexer]
基礎
正如在上一節介紹資料結構時所提到的那樣，使用[]（__getitem__ 對於熟悉在Python中實現類行為的人員而言）進行索引的主要功能是選擇低維切片。下表顯示了使用索引pandas物件時的返回型別值[]：

物件型別    選拔    返回值型別
系列    series[label]    標量值
資料框    frame[colname]    Series 對應于姓
在這裡，我們構造了一個簡單的時間序列資料集，用於說明索引功能：

In [1]: dates = pd.date_range('1/1/2000', periods=8)

In [2]: df = pd.DataFrame(np.random.randn(8, 4),
   ...:                   index=dates, columns=['A', 'B', 'C', 'D'])
   ...: 

In [3]: df
Out[3]: 
                   A         B         C         D
2000-01-01  0.469112 -0.282863 -1.509059 -1.135632
2000-01-02  1.212112 -0.173215  0.119209 -1.044236
2000-01-03 -0.861849 -2.104569 -0.494929  1.071804
2000-01-04  0.721555 -0.706771 -1.039575  0.271860
2000-01-05 -0.424972  0.567020  0.276232 -1.087401
2000-01-06 -0.673690  0.113648 -1.478427  0.524988
2000-01-07  0.404705  0.577046 -1.715002 -1.039268
2000-01-08 -0.370647 -1.157892 -1.344312  0.844885
注意 除非特別說明，否則索引功能都不是特定於時間序列的。
因此，如上所述，我們使用的是最基本的索引[]：

In [4]: s = df['A']

In [5]: s[dates[5]]
Out[5]: -0.6736897080883706
您可以將列列表傳遞[]給以該順序選擇列。如果DataFrame中不包含列，則將引發異常。也可以以這種方式設定多列：

In [6]: df
Out[6]: 
                   A         B         C         D
2000-01-01  0.469112 -0.282863 -1.509059 -1.135632
2000-01-02  1.212112 -0.173215  0.119209 -1.044236
2000-01-03 -0.861849 -2.104569 -0.494929  1.071804
2000-01-04  0.721555 -0.706771 -1.039575  0.271860
2000-01-05 -0.424972  0.567020  0.276232 -1.087401
2000-01-06 -0.673690  0.113648 -1.478427  0.524988
2000-01-07  0.404705  0.577046 -1.715002 -1.039268
2000-01-08 -0.370647 -1.157892 -1.344312  0.844885

In [7]: df[['B', 'A']] = df[['A', 'B']]

In [8]: df
Out[8]: 
                   A         B         C         D
2000-01-01 -0.282863  0.469112 -1.509059 -1.135632
2000-01-02 -0.173215  1.212112  0.119209 -1.044236
2000-01-03 -2.104569 -0.861849 -0.494929  1.071804
2000-01-04 -0.706771  0.721555 -1.039575  0.271860
2000-01-05  0.567020 -0.424972  0.276232 -1.087401
2000-01-06  0.113648 -0.673690 -1.478427  0.524988
2000-01-07  0.577046  0.404705 -1.715002 -1.039268
2000-01-08 -1.157892 -0.370647 -1.344312  0.844885
您可能會發現這對於將轉換（就地）應用於列的子集很有用。

警告 設定Series和DataFrame從.loc和中時，pandas會對齊所有軸.iloc。
這不會修改，df因為列對齊是在賦值之前。

In [9]: df[['A', 'B']]
Out[9]: 
                   A         B
2000-01-01 -0.282863  0.469112
2000-01-02 -0.173215  1.212112
2000-01-03 -2.104569 -0.861849
2000-01-04 -0.706771  0.721555
2000-01-05  0.567020 -0.424972
2000-01-06  0.113648 -0.673690
2000-01-07  0.577046  0.404705
2000-01-08 -1.157892 -0.370647

In [10]: df.loc[:, ['B', 'A']] = df[['A', 'B']]

In [11]: df[['A', 'B']]
Out[11]: 
                   A         B
2000-01-01 -0.282863  0.469112
2000-01-02 -0.173215  1.212112
2000-01-03 -2.104569 -0.861849
2000-01-04 -0.706771  0.721555
2000-01-05  0.567020 -0.424972
2000-01-06  0.113648 -0.673690
2000-01-07  0.577046  0.404705
2000-01-08 -1.157892 -0.370647
交換列值的正確方法是使用原始值：

In [12]: df.loc[:, ['B', 'A']] = df[['A', 'B']].to_numpy()

In [13]: df[['A', 'B']]
Out[13]: 
                   A         B
2000-01-01  0.469112 -0.282863
2000-01-02  1.212112 -0.173215
2000-01-03 -0.861849 -2.104569
2000-01-04  0.721555 -0.706771
2000-01-05 -0.424972  0.567020
2000-01-06 -0.673690  0.113648
2000-01-07  0.404705  0.577046
2000-01-08 -0.370647 -1.157892
屬性訪問
您可以直接將Series或上的索引DataFrame作為屬性訪問：

In [14]: sa = pd.Series([1, 2, 3], index=list('abc'))

In [15]: dfa = df.copy()
In [16]: sa.b
Out[16]: 2

In [17]: dfa.A
Out[17]: 
2000-01-01    0.469112
2000-01-02    1.212112
2000-01-03   -0.861849
2000-01-04    0.721555
2000-01-05   -0.424972
2000-01-06   -0.673690
2000-01-07    0.404705
2000-01-08   -0.370647
Freq: D, Name: A, dtype: float64
In [18]: sa.a = 5

In [19]: sa
Out[19]: 
a    5
b    2
c    3
dtype: int64

In [20]: dfa.A = list(range(len(dfa.index)))  # ok if A already exists

In [21]: dfa
Out[21]: 
            A         B         C         D
2000-01-01  0 -0.282863 -1.509059 -1.135632
2000-01-02  1 -0.173215  0.119209 -1.044236
2000-01-03  2 -2.104569 -0.494929  1.071804
2000-01-04  3 -0.706771 -1.039575  0.271860
2000-01-05  4  0.567020  0.276232 -1.087401
2000-01-06  5  0.113648 -1.478427  0.524988
2000-01-07  6  0.577046 -1.715002 -1.039268
2000-01-08  7 -1.157892 -1.344312  0.844885

In [22]: dfa['A'] = list(range(len(dfa.index)))  # use this form to create a new column

In [23]: dfa
Out[23]: 
            A         B         C         D
2000-01-01  0 -0.282863 -1.509059 -1.135632
2000-01-02  1 -0.173215  0.119209 -1.044236
2000-01-03  2 -2.104569 -0.494929  1.071804
2000-01-04  3 -0.706771 -1.039575  0.271860
2000-01-05  4  0.567020  0.276232 -1.087401
2000-01-06  5  0.113648 -1.478427  0.524988
2000-01-07  6  0.577046 -1.715002 -1.039268
2000-01-08  7 -1.157892 -1.344312  0.844885
警告
僅當index元素是有效的Python識別符號（例如s.1，不允許使用）時，才能使用此訪問許可權。有關有效識別符號的說明，請參見此處。
如果屬性與現有方法名稱衝突（例如s.min不允許），則該屬性將不可用。
同樣，該屬性將不可用，如果它與下列任何名單的衝突：index， major_axis，minor_axis，items。
在任何一種情況下，標準索引仍然可以工作，例如s['1']，s['min']和s['index']將訪問相應的元素或列。
如果使用的是IPython環境，則也可以使用製表符補全來檢視這些可訪問的屬性。

您還可以將分配dict給的一行DataFrame：

In [24]: x = pd.DataFrame({'x': [1, 2, 3], 'y': [3, 4, 5]})

In [25]: x.iloc[1] = {'x': 9, 'y': 99}

In [26]: x
Out[26]: 
   x   y
0  1   3
1  9  99
2  3   5
您可以使用屬性訪問來修改Series或DataFrame列的現有元素，但要小心；如果您嘗試使用屬性訪問來建立新列，則它將建立一個新屬性而不是一個新列。在0.21.0及更高版本中，這將引發UserWarning：

In [1]: df = pd.DataFrame({'one': [1., 2., 3.]})
In [2]: df.two = [4, 5, 6]
UserWarning: Pandas doesn't allow Series to be assigned into nonexistent columns - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute_access
In [3]: df
Out[3]:
   one
0  1.0
1  2.0
2  3.0
切片範圍
在“ 按位置選擇”部分詳細介紹了該.iloc方法，介紹了沿任意軸切片範圍的最可靠，最一致的方法。現在，我們解釋使用[]運算子進行切片的語義。

使用Series時，語法與ndarray完全一樣，返回值的一部分和相應的標籤：

In [27]: s[:5]
Out[27]: 
2000-01-01    0.469112
2000-01-02    1.212112
2000-01-03   -0.861849
2000-01-04    0.721555
2000-01-05   -0.424972
Freq: D, Name: A, dtype: float64

In [28]: s[::2]
Out[28]: 
2000-01-01    0.469112
2000-01-03   -0.861849
2000-01-05   -0.424972
2000-01-07    0.404705
Freq: 2D, Name: A, dtype: float64

In [29]: s[::-1]
Out[29]: 
2000-01-08   -0.370647
2000-01-07    0.404705
2000-01-06   -0.673690
2000-01-05   -0.424972
2000-01-04    0.721555
2000-01-03   -0.861849
2000-01-02    1.212112
2000-01-01    0.469112
Freq: -1D, Name: A, dtype: float64
請注意，設定同樣適用：

In [30]: s2 = s.copy()

In [31]: s2[:5] = 0

In [32]: s2
Out[32]: 
2000-01-01    0.000000
2000-01-02    0.000000
2000-01-03    0.000000
2000-01-04    0.000000
2000-01-05    0.000000
2000-01-06   -0.673690
2000-01-07    0.404705
2000-01-08   -0.370647
Freq: D, Name: A, dtype: float64
使用DataFrame時，[] 在rows中對切片進行切片。由於這是一種常見的操作，因此在很大程度上是為了方便而提供的。

In [33]: df[:3]
Out[33]: 
                   A         B         C         D
2000-01-01  0.469112 -0.282863 -1.509059 -1.135632
2000-01-02  1.212112 -0.173215  0.119209 -1.044236
2000-01-03 -0.861849 -2.104569 -0.494929  1.071804

In [34]: df[::-1]
Out[34]: 
                   A         B         C         D
2000-01-08 -0.370647 -1.157892 -1.344312  0.844885
2000-01-07  0.404705  0.577046 -1.715002 -1.039268
2000-01-06 -0.673690  0.113648 -1.478427  0.524988
2000-01-05 -0.424972  0.567020  0.276232 -1.087401
2000-01-04  0.721555 -0.706771 -1.039575  0.271860
2000-01-03 -0.861849 -2.104569 -0.494929  1.071804
2000-01-02  1.212112 -0.173215  0.119209 -1.044236
2000-01-01  0.469112 -0.282863 -1.509059 -1.135632
按標籤選擇
警告 為設定操作返回副本還是參考，可能取決於上下文。這有時被稱為，應該避免。請參閱返回檢視與複製。chained assignment
警告
.loc當您提供與索引型別不相容（或可轉換）的切片器時，它是嚴格的。例如在中使用整數DatetimeIndex。這些將引發一個TypeError。
In [35]: dfl = pd.DataFrame(np.random.randn(5, 4),
   ....:                    columns=list('ABCD'),
   ....:                    index=pd.date_range('20130101', periods=5))
   ....: 

In [36]: dfl
Out[36]: 
                   A         B         C         D
2013-01-01  1.075770 -0.109050  1.643563 -1.469388
2013-01-02  0.357021 -0.674600 -1.776904 -0.968914
2013-01-03 -1.294524  0.413738  0.276662 -0.472035
2013-01-04 -0.013960 -0.362543 -0.006154 -0.923061
2013-01-05  0.895717  0.805244 -1.206412  2.565646
In [4]: dfl.loc[2:3]
TypeError: cannot do slice indexing on <class 'pandas.tseries.index.DatetimeIndex'> with these indexers [2] of <type 'int'>
切片中的字串喜歡可以轉換為索引的型別並導致自然切片。

In [37]: dfl.loc['20130102':'20130104']
Out[37]: 
                   A         B         C         D
2013-01-02  0.357021 -0.674600 -1.776904 -0.968914
2013-01-03 -1.294524  0.413738  0.276662 -0.472035
2013-01-04 -0.013960 -0.362543 -0.006154 -0.923061
警告 從0.21.0開始，pandas將顯示FutureWarningif索引，其中包含缺少標籤的列表。在未來，這將引發KeyError。請參見類似列表的列表。不推薦使用在列表中缺少鍵的loc。
熊貓提供了一整套方法，以便具有純基於標籤的索引。這是一個基於嚴格包含的協議。要求的每個標籤都必須在索引中，否則KeyError將引發a。切片時，如果索引中存在起始邊界和終止邊界，則都將包括在內。整數是有效的標籤，但它們引用的是標籤，而不是position。

該.loc屬性是主要的訪問方法。以下是有效輸入：

單個標籤，例如5或'a'（請注意，它5被解釋為索引的標籤。此用法不是沿索引的整數位置。）
標籤列表或標籤陣列。['a', 'b', 'c']
帶標籤的切片物件'a':'f'（注意，相反普通的Python片，都開始和停止都包括在內，存在於索引時！見切片用標籤）。
布林陣列。
A callable，請參見按可呼叫選擇。
In [38]: s1 = pd.Series(np.random.randn(6), index=list('abcdef'))

In [39]: s1
Out[39]: 
a    1.431256
b    1.340309
c   -1.170299
d   -0.226169
e    0.410835
f    0.813850
dtype: float64

In [40]: s1.loc['c':]
Out[40]: 
c   -1.170299
d   -0.226169
e    0.410835
f    0.813850
dtype: float64

In [41]: s1.loc['b']
Out[41]: 1.3403088497993827
請注意，設定同樣適用：

In [42]: s1.loc['c':] = 0

In [43]: s1
Out[43]: 
a    1.431256
b    1.340309
c    0.000000
d    0.000000
e    0.000000
f    0.000000
dtype: float64
使用DataFrame：

In [44]: df1 = pd.DataFrame(np.random.randn(6, 4),
   ....:                    index=list('abcdef'),
   ....:                    columns=list('ABCD'))
   ....: 

In [45]: df1
Out[45]: 
          A         B         C         D
a  0.132003 -0.827317 -0.076467 -1.187678
b  1.130127 -1.436737 -1.413681  1.607920
c  1.024180  0.569605  0.875906 -2.211372
d  0.974466 -2.006747 -0.410001 -0.078638
e  0.545952 -1.219217 -1.226825  0.769804
f -1.281247 -0.727707 -0.121306 -0.097883

In [46]: df1.loc[['a', 'b', 'd'], :]
Out[46]: 
          A         B         C         D
a  0.132003 -0.827317 -0.076467 -1.187678
b  1.130127 -1.436737 -1.413681  1.607920
d  0.974466 -2.006747 -0.410001 -0.078638
通過標籤切片訪問：

In [47]: df1.loc['d':, 'A':'C']
Out[47]: 
          A         B         C
d  0.974466 -2.006747 -0.410001
e  0.545952 -1.219217 -1.226825
f -1.281247 -0.727707 -0.121306
要使用標籤（相當於df.xs('a')）獲取橫截面：

In [48]: df1.loc['a']
Out[48]: 
A    0.132003
B   -0.827317
C   -0.076467
D   -1.187678
Name: a, dtype: float64
要使用布林陣列獲取值：

In [49]: df1.loc['a'] > 0
Out[49]: 
A     True
B    False
C    False
D    False
Name: a, dtype: bool

In [50]: df1.loc[:, df1.loc['a'] > 0]
Out[50]: 
          A
a  0.132003
b  1.130127
c  1.024180
d  0.974466
e  0.545952
f -1.281247
為了顯式地獲取值（等效於deprecated df.get_value('a','A')）：

# this is also equivalent to ``df1.at['a','A']``
In [51]: df1.loc['a', 'A']
Out[51]: 0.13200317033032932
帶標籤切片
.loc與切片一起使用時，如果索引中同時包含開始標籤和停止標籤，則返回位於兩者之間的元素（包括它們）：

In [52]: s = pd.Series(list('abcde'), index=[0, 3, 2, 5, 4])

In [53]: s.loc[3:5]
Out[53]: 
3    b
2    c
5    d
dtype: object
如果這兩個中的至少一個不存在的，但該指數進行排序，並能對啟動和停止的標籤進行比較，然後切片仍然會按預期方式工作，通過選擇的標籤，其排名在兩者之間：

In [54]: s.sort_index()
Out[54]: 
0    a
2    c
3    b
4    e
5    d
dtype: object

In [55]: s.sort_index().loc[1:6]
Out[55]: 
2    c
3    b
4    e
5    d
dtype: object
但是，如果不存在兩者中的至少一個並且未對索引進行排序，則將引發錯誤（因為這樣做會導致計算量大，並且可能對混合型別索引造成歧義）。例如，在上面的示例中，s.loc[1:6]將上升KeyError。

按位置選擇
警告 為設定操作返回副本還是參考，可能取決於上下文。這有時被稱為，應該避免。請參閱返回檢視與複製。chained assignment
Pandas提供了一組方法，以便獲得純粹基於整數的索引。語義嚴格遵循Python和NumPy切片。這些正在0-based索引。當切片，開始結合被包含，而上限是排除。嘗試使用非整數，即使有效標籤也將引發IndexError。

該.iloc屬性是主要的訪問方法。以下是有效輸入：

整數，例如5。
整數列表或陣列。[4, 3, 0]
具有int的slice物件1:7。
布林陣列。
A callable，請參見按可呼叫選擇。
In [56]: s1 = pd.Series(np.random.randn(5), index=list(range(0, 10, 2)))

In [57]: s1
Out[57]: 
0    0.695775
2    0.341734
4    0.959726
6   -1.110336
8   -0.619976
dtype: float64

In [58]: s1.iloc[:3]
Out[58]: 
0    0.695775
2    0.341734
4    0.959726
dtype: float64

In [59]: s1.iloc[3]
Out[59]: -1.110336102891167
請注意，設定同樣適用：

In [60]: s1.iloc[:3] = 0

In [61]: s1
Out[61]: 
0    0.000000
2    0.000000
4    0.000000
6   -1.110336
8   -0.619976
dtype: float64
使用DataFrame：

In [62]: df1 = pd.DataFrame(np.random.randn(6, 4),
   ....:                    index=list(range(0, 12, 2)),
   ....:                    columns=list(range(0, 8, 2)))
   ....: 

In [63]: df1
Out[63]: 
           0         2         4         6
0   0.149748 -0.732339  0.687738  0.176444
2   0.403310 -0.154951  0.301624 -2.179861
4  -1.369849 -0.954208  1.462696 -1.743161
6  -0.826591 -0.345352  1.314232  0.690579
8   0.995761  2.396780  0.014871  3.357427
10 -0.317441 -1.236269  0.896171 -0.487602
通過整數切片選擇：

In [64]: df1.iloc[:3]
Out[64]: 
          0         2         4         6
0  0.149748 -0.732339  0.687738  0.176444
2  0.403310 -0.154951  0.301624 -2.179861
4 -1.369849 -0.954208  1.462696 -1.743161

In [65]: df1.iloc[1:5, 2:4]
Out[65]: 
          4         6
2  0.301624 -2.179861
4  1.462696 -1.743161
6  1.314232  0.690579
8  0.014871  3.357427
通過整數列表選擇：

In [66]: df1.iloc[[1, 3, 5], [1, 3]]
Out[66]: 
           2         6
2  -0.154951 -2.179861
6  -0.345352  0.690579
10 -1.236269 -0.487602
In [67]: df1.iloc[1:3, :]
Out[67]: 
          0         2         4         6
2  0.403310 -0.154951  0.301624 -2.179861
4 -1.369849 -0.954208  1.462696 -1.743161
In [68]: df1.iloc[:, 1:3]
Out[68]: 
           2         4
0  -0.732339  0.687738
2  -0.154951  0.301624
4  -0.954208  1.462696
6  -0.345352  1.314232
8   2.396780  0.014871
10 -1.236269  0.896171
# this is also equivalent to ``df1.iat[1,1]``
In [69]: df1.iloc[1, 1]
Out[69]: -0.1549507744249032
要使用整數位置（等於df.xs(1)）獲取橫截面：

In [70]: df1.iloc[1]
Out[70]: 
0    0.403310
2   -0.154951
4    0.301624
6   -2.179861
Name: 2, dtype: float64
超出範圍切片索引的處理方式與Python / Numpy中一樣。

# these are allowed in python/numpy.
In [71]: x = list('abcdef')

In [72]: x
Out[72]: ['a', 'b', 'c', 'd', 'e', 'f']

In [73]: x[4:10]
Out[73]: ['e', 'f']

In [74]: x[8:10]
Out[74]: []

In [75]: s = pd.Series(x)

In [76]: s
Out[76]: 
0    a
1    b
2    c
3    d
4    e
5    f
dtype: object

In [77]: s.iloc[4:10]
Out[77]: 
4    e
5    f
dtype: object

In [78]: s.iloc[8:10]
Out[78]: Series([], dtype: object)
請注意，使用超出範圍的切片可能會導致一個空軸（例如，返回一個空的DataFrame）。

In [79]: dfl = pd.DataFrame(np.random.randn(5, 2), columns=list('AB'))

In [80]: dfl
Out[80]: 
          A         B
0 -0.082240 -2.182937
1  0.380396  0.084844
2  0.432390  1.519970
3 -0.493662  0.600178
4  0.274230  0.132885

In [81]: dfl.iloc[:, 2:3]
Out[81]: 
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4]

In [82]: dfl.iloc[:, 1:3]
Out[82]: 
          B
0 -2.182937
1  0.084844
2  1.519970
3  0.600178
4  0.132885

In [83]: dfl.iloc[4:6]
Out[83]: 
         A         B
4  0.27423  0.132885
超出範圍的單個索引器將引發IndexError。任何元素超出範圍的索引器列表都會引發 IndexError。

>>> dfl.iloc[[4, 5, 6]]
IndexError: positional indexers are out-of-bounds

>>> dfl.iloc[:, 4]
IndexError: single positional indexer is out-of-bounds
通過可呼叫選擇
0.18.1版中的新功能。

.loc，.iloc以及[]索引也可以接受callable作為索引器。在callable必須與一個引數（呼叫系列或資料幀）返回的有效輸出索引功能。

In [84]: df1 = pd.DataFrame(np.random.randn(6, 4),
   ....:                    index=list('abcdef'),
   ....:                    columns=list('ABCD'))
   ....: 

In [85]: df1
Out[85]: 
          A         B         C         D
a -0.023688  2.410179  1.450520  0.206053
b -0.251905 -2.213588  1.063327  1.266143
c  0.299368 -0.863838  0.408204 -1.048089
d -0.025747 -0.988387  0.094055  1.262731
e  1.289997  0.082423 -0.055758  0.536580
f -0.489682  0.369374 -0.034571 -2.484478

In [86]: df1.loc[lambda df: df.A > 0, :]
Out[86]: 
          A         B         C         D
c  0.299368 -0.863838  0.408204 -1.048089
e  1.289997  0.082423 -0.055758  0.536580

In [87]: df1.loc[:, lambda df: ['A', 'B']]
Out[87]: 
          A         B
a -0.023688  2.410179
b -0.251905 -2.213588
c  0.299368 -0.863838
d -0.025747 -0.988387
e  1.289997  0.082423
f -0.489682  0.369374

In [88]: df1.iloc[:, lambda df: [0, 1]]
Out[88]: 
          A         B
a -0.023688  2.410179
b -0.251905 -2.213588
c  0.299368 -0.863838
d -0.025747 -0.988387
e  1.289997  0.082423
f -0.489682  0.369374

In [89]: df1[lambda df: df.columns[0]]
Out[89]: 
a   -0.023688
b   -0.251905
c    0.299368
d   -0.025747
e    1.289997
f   -0.489682
Name: A, dtype: float64
您可以在中使用可呼叫索引Series。

In [90]: df1.A.loc[lambda s: s > 0]
Out[90]: 
c    0.299368
e    1.289997
Name: A, dtype: float64
使用這些方法/索引器，您可以連結資料選擇操作，而無需使用臨時變數。

In [91]: bb = pd.read_csv('data/baseball.csv', index_col='id')

In [92]: (bb.groupby(['year', 'team']).sum()
   ....:    .loc[lambda df: df.r > 100])
   ....: 
Out[92]: 
           stint    g    ab    r    h  X2b  ...     so   ibb   hbp    sh    sf  gidp
year team                                   ...                                     
2007 CIN       6  379   745  101  203   35  ...  127.0  14.0   1.0   1.0  15.0  18.0
     DET       5  301  1062  162  283   54  ...  176.0   3.0  10.0   4.0   8.0  28.0
     HOU       4  311   926  109  218   47  ...  212.0   3.0   9.0  16.0   6.0  17.0
     LAN      11  413  1021  153  293   61  ...  141.0   8.0   9.0   3.0   8.0  29.0
     NYN      13  622  1854  240  509  101  ...  310.0  24.0  23.0  18.0  15.0  48.0
     SFN       5  482  1305  198  337   67  ...  188.0  51.0   8.0  16.0   6.0  41.0
     TEX       2  198   729  115  200   40  ...  140.0   4.0   5.0   2.0   8.0  16.0
     TOR       4  459  1408  187  378   96  ...  265.0  16.0  12.0   4.0  16.0  38.0

[8 rows x 18 columns]
IX Indexer已棄用
警告 在0.20.0開始，.ix索引器已被棄用，贊成更加嚴格.iloc 和.loc索引。
.ix在推斷使用者想要做什麼方面提供了很多魔力。也就是說，.ix可以根據索引的資料型別決定是否通過標籤對位置進行索引。多年來，這引起了相當多的使用者混亂。

推薦的索引編制方法是：

.loc如果要標記索引。
.iloc如果要位置索引。
In [93]: dfd = pd.DataFrame({'A': [1, 2, 3],
   ....:                     'B': [4, 5, 6]},
   ....:                    index=list('abc'))
   ....: 

In [94]: dfd
Out[94]: 
   A  B
a  1  4
b  2  5
c  3  6
以前的行為，您希望從“ A”列的索引中獲取第0個元素和第2個元素。

In [3]: dfd.ix[[0, 2], 'A']
Out[3]:
a    1
c    3
Name: A, dtype: int64
使用.loc。在這裡，我們將從索引中選擇適當的索引，然後使用標籤索引。

In [95]: dfd.loc[dfd.index[[0, 2]], 'A']
Out[95]: 
a    1
c    3
Name: A, dtype: int64
也可以使用來表達這一點.iloc，方法是明確獲取索引器上的位置，並使用 位置索引來選擇事物。

In [96]: dfd.iloc[[0, 2], dfd.columns.get_loc('A')]
Out[96]: 
a    1
c    3
Name: A, dtype: int64
要獲取多個索引器，請使用.get_indexer：

In [97]: dfd.iloc[[0, 2], dfd.columns.get_indexer(['A', 'B'])]
Out[97]: 
   A  B
a  1  4
c  3  6
不建議使用帶有缺少標籤的列表建立索引
警告 從0.21.0開始，不推薦使用.loc或[]帶有一個或多個缺少標籤的列表，而推薦使用.reindex。
在以前的版本中，.loc[list-of-labels]只要找到至少一個鍵，使用就可以工作（否則會引發KeyError）。不建議使用此行為，它將顯示一條警告訊息，指向此部分。推薦的替代方法是使用.reindex()。

例如。

In [98]: s = pd.Series([1, 2, 3])

In [99]: s
Out[99]: 
0    1
1    2
2    3
dtype: int64
找到所有鍵的選擇保持不變。

In [100]: s.loc[[1, 2]]
Out[100]: 
1    2
2    3
dtype: int64
以前的行為

In [4]: s.loc[[1, 2, 3]]
Out[4]:
1    2.0
2    3.0
3    NaN
dtype: float64
當前行為

In [4]: s.loc[[1, 2, 3]]
Passing list-likes to .loc with any non-matching elements will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike

Out[4]:
1    2.0
2    3.0
3    NaN
dtype: float64
重建索引
實現選擇潛在未找到元素的慣用方式是通過.reindex()。另請參閱“ 重新索引編制 ”部分。

In [101]: s.reindex([1, 2, 3])
Out[101]: 
1    2.0
2    3.0
3    NaN
dtype: float64
另外，如果您只想選擇有效的金鑰，則以下是慣用且有效的；保證保留選擇的dtype。

In [102]: labels = [1, 2, 3]

In [103]: s.loc[s.index.intersection(labels)]
Out[103]: 
1    2
2    3
dtype: int64
索引重複會產生.reindex()：

In [104]: s = pd.Series(np.arange(4), index=['a', 'a', 'b', 'c'])

In [105]: labels = ['c', 'd']
In [17]: s.reindex(labels)
ValueError: cannot reindex from a duplicate axis
通常，您可以將所需的標籤與當前軸相交，然後重新索引。

In [106]: s.loc[s.index.intersection(labels)].reindex(labels)
Out[106]: 
c    3.0
d    NaN
dtype: float64
但是，如果您生成的索引重複，這仍然會增加。

In [41]: labels = ['a', 'd']

In [42]: s.loc[s.index.intersection(labels)].reindex(labels)
ValueError: cannot reindex from a duplicate axis
選擇隨機樣本
使用該sample()方法從Series或DataFrame中隨機選擇行或列。該方法預設情況下將對行進行取樣，並接受要返回的特定數量的行/列，或一部分行。

In [107]: s = pd.Series([0, 1, 2, 3, 4, 5])

# When no arguments are passed, returns 1 row.
In [108]: s.sample()
Out[108]: 
4    4
dtype: int64

# One may specify either a number of rows:
In [109]: s.sample(n=3)
Out[109]: 
0    0
4    4
1    1
dtype: int64

# Or a fraction of the rows:
In [110]: s.sample(frac=0.5)
Out[110]: 
5    5
3    3
1    1
dtype: int64
預設情況下，sample將最多返回每一行一次，但也可以使用以下replace選項進行替換取樣：

In [111]: s = pd.Series([0, 1, 2, 3, 4, 5])

# Without replacement (default):
In [112]: s.sample(n=6, replace=False)
Out[112]: 
0    0
1    1
5    5
3    3
2    2
4    4
dtype: int64

# With replacement:
In [113]: s.sample(n=6, replace=True)
Out[113]: 
0    0
4    4
3    3
2    2
4    4
4    4
dtype: int64
預設情況下，每行被選擇的概率相同，但是如果您希望行具有不同的概率，則可以將sample函式取樣權重傳遞為 weights。這些權重可以是列表，NumPy陣列或系列，但是它們的長度必須與要取樣的物件相同。缺少的值將被視為權重為零，並且不允許使用inf值。如果權重不等於1，將通過將所有權重除以權重之和來重新歸一化。例如：

In [114]: s = pd.Series([0, 1, 2, 3, 4, 5])

In [115]: example_weights = [0, 0, 0.2, 0.2, 0.2, 0.4]

In [116]: s.sample(n=3, weights=example_weights)
Out[116]: 
5    5
4    4
3    3
dtype: int64

# Weights will be re-normalized automatically
In [117]: example_weights2 = [0.5, 0, 0, 0, 0, 0]

In [118]: s.sample(n=1, weights=example_weights2)
Out[118]: 
0    0
dtype: int64
應用於DataFrame時，只需將列名作為字串傳遞，就可以將DataFrame的一列用作取樣權重（前提是要對行而不是列進行取樣）。

In [119]: df2 = pd.DataFrame({'col1': [9, 8, 7, 6],
   .....:                     'weight_column': [0.5, 0.4, 0.1, 0]})
   .....: 

In [120]: df2.sample(n=3, weights='weight_column')
Out[120]: 
   col1  weight_column
1     8            0.4
0     9            0.5
2     7            0.1
sample還允許使用者使用axis引數對列而不是行進行取樣。

In [121]: df3 = pd.DataFrame({'col1': [1, 2, 3], 'col2': [2, 3, 4]})

In [122]: df3.sample(n=1, axis=1)
Out[122]: 
   col1
0     1
1     2
2     3
最後，還可以sample使用random_state引數為的隨機數生成器設定種子，該種子將接受整數（作為種子）或NumPy RandomState物件。

In [123]: df4 = pd.DataFrame({'col1': [1, 2, 3], 'col2': [2, 3, 4]})

# With a given seed, the sample will always draw the same rows.
In [124]: df4.sample(n=2, random_state=2)
Out[124]: 
   col1  col2
2     3     4
1     2     3

In [125]: df4.sample(n=2, random_state=2)
Out[125]: 
   col1  col2
2     3     4
1     2     3
放大設定
.loc/[]當為該軸設定不存在的鍵時，這些操作可以執行放大操作。

在這種Series情況下，這實際上是附加操作。

In [126]: se = pd.Series([1, 2, 3])

In [127]: se
Out[127]: 
0    1
1    2
2    3
dtype: int64

In [128]: se[5] = 5.

In [129]: se
Out[129]: 
0    1.0
1    2.0
2    3.0
5    5.0
dtype: float64
阿DataFrame可以在任一軸通過被放大.loc。

In [130]: dfi = pd.DataFrame(np.arange(6).reshape(3, 2),
   .....:                    columns=['A', 'B'])
   .....: 

In [131]: dfi
Out[131]: 
   A  B
0  0  1
1  2  3
2  4  5

In [132]: dfi.loc[:, 'C'] = dfi.loc[:, 'A']

In [133]: dfi
Out[133]: 
   A  B  C
0  0  1  0
1  2  3  2
2  4  5  4
這就像對的append操作DataFrame。

In [134]: dfi.loc[3] = 5

In [135]: dfi
Out[135]: 
   A  B  C
0  0  1  0
1  2  3  2
2  4  5  4
3  5  5  5
快速獲取和設定標量值
由於with索引[]必須處理很多情況（單標籤訪問，切片，布林索引等），因此要弄清楚您要的內容有一點開銷。如果只想訪問標量值，最快的方法是使用at和iat方法，它們在所有資料結構上實現。

與相似loc，at提供基於標籤的標量查詢，而類似於iat提供基於整數的查詢iloc

In [136]: s.iat[5]
Out[136]: 5

In [137]: df.at[dates[5], 'A']
Out[137]: -0.6736897080883706

In [138]: df.iat[3, 0]
Out[138]: 0.7215551622443669
您也可以使用這些相同的索引器進行設定。

In [139]: df.at[dates[5], 'E'] = 7

In [140]: df.iat[3, 0] = 7
at 如果缺少索引器，可能會像上面一樣就地放大物件。

In [141]: df.at[dates[-1] + pd.Timedelta('1 day'), 0] = 7

In [142]: df
Out[142]: 
                   A         B         C         D    E    0
2000-01-01  0.469112 -0.282863 -1.509059 -1.135632  NaN  NaN
2000-01-02  1.212112 -0.173215  0.119209 -1.044236  NaN  NaN
2000-01-03 -0.861849 -2.104569 -0.494929  1.071804  NaN  NaN
2000-01-04  7.000000 -0.706771 -1.039575  0.271860  NaN  NaN
2000-01-05 -0.424972  0.567020  0.276232 -1.087401  NaN  NaN
2000-01-06 -0.673690  0.113648 -1.478427  0.524988  7.0  NaN
2000-01-07  0.404705  0.577046 -1.715002 -1.039268  NaN  NaN
2000-01-08 -0.370647 -1.157892 -1.344312  0.844885  NaN  NaN
2000-01-09       NaN       NaN       NaN       NaN  NaN  7.0
布林索引
另一個常見的操作是使用布林向量來過濾資料。運算子是：|for or，&for and和~for not。這些必須通過使用括號中，由於預設Python會計算表示式如被分組為 ，而所希望的評價順序是 。df.A > 2 & df.B < 3df.A > (2 & df.B) < 3(df.A > 2) & (df.B < 3)

使用布林向量為Series編制索引的方式與NumPy ndarray完全相同：

In [143]: s = pd.Series(range(-3, 4))

In [144]: s
Out[144]: 
0   -3
1   -2
2   -1
3    0
4    1
5    2
6    3
dtype: int64

In [145]: s[s > 0]
Out[145]: 
4    1
5    2
6    3
dtype: int64

In [146]: s[(s < -1) | (s > 0.5)]
Out[146]: 
0   -3
1   -2
4    1
5    2
6    3
dtype: int64

In [147]: s[~(s < 0)]
Out[147]: 
3    0
4    1
5    2
6    3
dtype: int64
您可以使用布林向量從DataFrame中選擇行，該布林向量的長度與DataFrame的索引相同（例如，從DataFrame的列之一派生的值）：

In [148]: df[df['A'] > 0]
Out[148]: 
                   A         B         C         D   E   0
2000-01-01  0.469112 -0.282863 -1.509059 -1.135632 NaN NaN
2000-01-02  1.212112 -0.173215  0.119209 -1.044236 NaN NaN
2000-01-04  7.000000 -0.706771 -1.039575  0.271860 NaN NaN
2000-01-07  0.404705  0.577046 -1.715002 -1.039268 NaN NaN
列表推導和mapSeries方法也可以用於產生更復雜的條件：

In [149]: df2 = pd.DataFrame({'a': ['one', 'one', 'two', 'three', 'two', 'one', 'six'],
   .....:                     'b': ['x', 'y', 'y', 'x', 'y', 'x', 'x'],
   .....:                     'c': np.random.randn(7)})
   .....: 

# only want 'two' or 'three'
In [150]: criterion = df2['a'].map(lambda x: x.startswith('t'))

In [151]: df2[criterion]
Out[151]: 
       a  b         c
2    two  y  0.041290
3  three  x  0.361719
4    two  y -0.238075

# equivalent but slower
In [152]: df2[[x.startswith('t') for x in df2['a']]]
Out[152]: 
       a  b         c
2    two  y  0.041290
3  three  x  0.361719
4    two  y -0.238075

# Multiple criteria
In [153]: df2[criterion & (df2['b'] == 'x')]
Out[153]: 
       a  b         c
3  three  x  0.361719
隨著選擇方法通過標籤選擇，選擇的位置，和高階索引你可能比一個軸使用布林向量與其他索引表示式中合併一起更選擇。

In [154]: df2.loc[criterion & (df2['b'] == 'x'), 'b':'c']
Out[154]: 
   b         c
3  x  0.361719
用isin索引
考慮isin()方法Series，該方法將返回一個布林向量，該布林向量Series在傳遞的列表中存在的任何元素處都為真。這使您可以選擇一行或多列具有所需值的行：

In [155]: s = pd.Series(np.arange(5), index=np.arange(5)[::-1], dtype='int64')

In [156]: s
Out[156]: 
4    0
3    1
2    2
1    3
0    4
dtype: int64

In [157]: s.isin([2, 4, 6])
Out[157]: 
4    False
3    False
2     True
1    False
0     True
dtype: bool

In [158]: s[s.isin([2, 4, 6])]
Out[158]: 
2    2
0    4
dtype: int64
相同的方法可用於Index物件，並且在您不知道實際上找到了哪個標籤的情況下很有用：

In [159]: s[s.index.isin([2, 4, 6])]
Out[159]: 
4    0
2    2
dtype: int64

# compare it to the following
In [160]: s.reindex([2, 4, 6])
Out[160]: 
2    2.0
4    0.0
6    NaN
dtype: float64
除此之外，還MultiIndex允許選擇一個單獨的級別以用於成員資格檢查：

In [161]: s_mi = pd.Series(np.arange(6),
   .....:                  index=pd.MultiIndex.from_product([[0, 1], ['a', 'b', 'c']]))
   .....: 

In [162]: s_mi
Out[162]: 
0  a    0
   b    1
   c    2
1  a    3
   b    4
   c    5
dtype: int64

In [163]: s_mi.iloc[s_mi.index.isin([(1, 'a'), (2, 'b'), (0, 'c')])]
Out[163]: 
0  c    2
1  a    3
dtype: int64

In [164]: s_mi.iloc[s_mi.index.isin(['a', 'c', 'e'], level=1)]
Out[164]: 
0  a    0
   c    2
1  a    3
   c    5
dtype: int64
DataFrame也有一個isin()方法。呼叫時isin，將一組值作為陣列或dict傳遞。如果values是一個數組，則isin返回一個與原始DataFrame形狀相同的布林值DataFrame，無論元素在值序列中的什麼位置都為True。

In [165]: df = pd.DataFrame({'vals': [1, 2, 3, 4], 'ids': ['a', 'b', 'f', 'n'],
   .....:                    'ids2': ['a', 'n', 'c', 'n']})
   .....: 

In [166]: values = ['a', 'b', 1, 3]

In [167]: df.isin(values)
Out[167]: 
    vals    ids   ids2
0   True   True   True
1  False   True  False
2   True  False  False
3  False  False  False
通常，您需要將某些值與某些列匹配。只需將值設定為a即可dict，鍵是列，而值是要檢查的專案列表。

In [168]: values = {'ids': ['a', 'b'], 'vals': [1, 3]}

In [169]: df.isin(values)
Out[169]: 
    vals    ids   ids2
0   True   True  False
1  False   True  False
2   True  False  False
3  False  False  False
將DataFrame isin與any()和all()方法結合使用，可以快速選擇滿足給定條件的資料子集。要選擇每一列均符合其條件的行：

In [170]: values = {'ids': ['a', 'b'], 'ids2': ['a', 'c'], 'vals': [1, 3]}

In [171]: row_mask = df.isin(values).all(1)

In [172]: df[row_mask]
Out[172]: 
   vals ids ids2
0     1   a    a
該where()方法和遮蔽
從具有布林向量的Series中選擇值通常會返回資料的子集。為了確保選擇輸出具有與原始資料相同的形狀，可以where在Series和中使用方法DataFrame。

要僅返回選定的行：

In [173]: s[s > 0]
Out[173]: 
3    1
2    2
1    3
0    4
dtype: int64
要返回與原始形狀相同的系列：

In [174]: s.where(s > 0)
Out[174]: 
4    NaN
3    1.0
2    2.0
1    3.0
0    4.0
dtype: float64
現在，從具有布林條件的DataFrame中選擇值還可以保留輸入資料的形狀。where在後臺使用作為實現。以下程式碼等效於。df.where(df < 0)

In [175]: df[df < 0]
Out[175]: 
                   A         B         C         D
2000-01-01 -2.104139 -1.309525       NaN       NaN
2000-01-02 -0.352480       NaN -1.192319       NaN
2000-01-03 -0.864883       NaN -0.227870       NaN
2000-01-04       NaN -1.222082       NaN -1.233203
2000-01-05       NaN -0.605656 -1.169184       NaN
2000-01-06       NaN -0.948458       NaN -0.684718
2000-01-07 -2.670153 -0.114722       NaN -0.048048
2000-01-08       NaN       NaN -0.048788 -0.808838
此外，在返回的副本中，使用where一個可選other引數替換條件為False的值。

In [176]: df.where(df < 0, -df)
Out[176]: 
                   A         B         C         D
2000-01-01 -2.104139 -1.309525 -0.485855 -0.245166
2000-01-02 -0.352480 -0.390389 -1.192319 -1.655824
2000-01-03 -0.864883 -0.299674 -0.227870 -0.281059
2000-01-04 -0.846958 -1.222082 -0.600705 -1.233203
2000-01-05 -0.669692 -0.605656 -1.169184 -0.342416
2000-01-06 -0.868584 -0.948458 -2.297780 -0.684718
2000-01-07 -2.670153 -0.114722 -0.168904 -0.048048
2000-01-08 -0.801196 -1.392071 -0.048788 -0.808838
您可能希望根據某些布林條件設定值。可以像這樣直觀地完成：

In [177]: s2 = s.copy()

In [178]: s2[s2 < 0] = 0

In [179]: s2
Out[179]: 
4    0
3    1
2    2
1    3
0    4
dtype: int64

In [180]: df2 = df.copy()

In [181]: df2[df2 < 0] = 0

In [182]: df2
Out[182]: 
                   A         B         C         D
2000-01-01  0.000000  0.000000  0.485855  0.245166
2000-01-02  0.000000  0.390389  0.000000  1.655824
2000-01-03  0.000000  0.299674  0.000000  0.281059
2000-01-04  0.846958  0.000000  0.600705  0.000000
2000-01-05  0.669692  0.000000  0.000000  0.342416
2000-01-06  0.868584  0.000000  2.297780  0.000000
2000-01-07  0.000000  0.000000  0.168904  0.000000
2000-01-08  0.801196  1.392071  0.000000  0.000000
預設情況下，where返回資料的修改後的副本。有一個可選引數，inplace以便可以在不建立副本的情況下修改原始資料：

In [183]: df_orig = df.copy()

In [184]: df_orig.where(df > 0, -df, inplace=True)

In [185]: df_orig
Out[185]: 
                   A         B         C         D
2000-01-01  2.104139  1.309525  0.485855  0.245166
2000-01-02  0.352480  0.390389  1.192319  1.655824
2000-01-03  0.864883  0.299674  0.227870  0.281059
2000-01-04  0.846958  1.222082  0.600705  1.233203
2000-01-05  0.669692  0.605656  1.169184  0.342416
2000-01-06  0.868584  0.948458  2.297780  0.684718
2000-01-07  2.670153  0.114722  0.168904  0.048048
2000-01-08  0.801196  1.392071  0.048788  0.808838
注意 的簽名DataFrame.where()不同於numpy.where()。大致相當於。df1.where(m, df2)np.where(m, df1, df2)
In [186]: df.where(df < 0, -df) == np.where(df < 0, df, -df)
Out[186]: 
               A     B     C     D
2000-01-01  True  True  True  True
2000-01-02  True  True  True  True
2000-01-03  True  True  True  True
2000-01-04  True  True  True  True
2000-01-05  True  True  True  True
2000-01-06  True  True  True  True
2000-01-07  True  True  True  True
2000-01-08  True  True  True  True
對準

此外，where對齊輸入的布林條件（ndarray或DataFrame），以便可以通過設定進行部分選擇。這類似於通過進行部分設定.loc（但在內容而非軸標籤上）。

In [187]: df2 = df.copy()

In [188]: df2[df2[1:4] > 0] = 3

In [189]: df2
Out[189]: 
                   A         B         C         D
2000-01-01 -2.104139 -1.309525  0.485855  0.245166
2000-01-02 -0.352480  3.000000 -1.192319  3.000000
2000-01-03 -0.864883  3.000000 -0.227870  3.000000
2000-01-04  3.000000 -1.222082  3.000000 -1.233203
2000-01-05  0.669692 -0.605656 -1.169184  0.342416
2000-01-06  0.868584 -0.948458  2.297780 -0.684718
2000-01-07 -2.670153 -0.114722  0.168904 -0.048048
2000-01-08  0.801196  1.392071 -0.048788 -0.808838
執行時，where還可以接受axis和level引數以對齊輸入where。

In [190]: df2 = df.copy()

In [191]: df2.where(df2 > 0, df2['A'], axis='index')
Out[191]: 
                   A         B         C         D
2000-01-01 -2.104139 -2.104139  0.485855  0.245166
2000-01-02 -0.352480  0.390389 -0.352480  1.655824
2000-01-03 -0.864883  0.299674 -0.864883  0.281059
2000-01-04  0.846958  0.846958  0.600705  0.846958
2000-01-05  0.669692  0.669692  0.669692  0.342416
2000-01-06  0.868584  0.868584  2.297780  0.868584
2000-01-07 -2.670153 -2.670153  0.168904 -2.670153
2000-01-08  0.801196  1.392071  0.801196  0.801196
這等效於（但比以下速度更快）。

In [192]: df2 = df.copy()

In [193]: df.apply(lambda x, y: x.where(x > 0, y), y=df['A'])
Out[193]: 
                   A         B         C         D
2000-01-01 -2.104139 -2.104139  0.485855  0.245166
2000-01-02 -0.352480  0.390389 -0.352480  1.655824
2000-01-03 -0.864883  0.299674 -0.864883  0.281059
2000-01-04  0.846958  0.846958  0.600705  0.846958
2000-01-05  0.669692  0.669692  0.669692  0.342416
2000-01-06  0.868584  0.868584  2.297780  0.868584
2000-01-07 -2.670153 -2.670153  0.168904 -2.670153
2000-01-08  0.801196  1.392071  0.801196  0.801196
0.18.1版中的新功能。

在哪裡可以接受一個callable作為條件和other引數。該函式必須帶有一個引數（呼叫Series或DataFrame），並且返回有效輸出作為條件和other引數。

In [194]: df3 = pd.DataFrame({'A': [1, 2, 3],
   .....:                     'B': [4, 5, 6],
   .....:                     'C': [7, 8, 9]})
   .....: 

In [195]: df3.where(lambda x: x > 4, lambda x: x + 10)
Out[195]: 
    A   B  C
0  11  14  7
1  12   5  8
2  13   6  9
遮罩
mask()是的逆布林運算where。

In [196]: s.mask(s >= 0)
Out[196]: 
4   NaN
3   NaN
2   NaN
1   NaN
0   NaN
dtype: float64

In [197]: df.mask(df >= 0)
Out[197]: 
                   A         B         C         D
2000-01-01 -2.104139 -1.309525       NaN       NaN
2000-01-02 -0.352480       NaN -1.192319       NaN
2000-01-03 -0.864883       NaN -0.227870       NaN
2000-01-04       NaN -1.222082       NaN -1.233203
2000-01-05       NaN -0.605656 -1.169184       NaN
2000-01-06       NaN -0.948458       NaN -0.684718
2000-01-07 -2.670153 -0.114722       NaN -0.048048
2000-01-08       NaN       NaN -0.048788 -0.808838
該query()方法
DataFrame物件具有query() 允許使用表示式進行選擇的方法。

您可以獲取框架的值，其中column b的值在column a和的值之間c。例如：

In [198]: n = 10

In [199]: df = pd.DataFrame(np.random.rand(n, 3), columns=list('abc'))

In [200]: df
Out[200]: 
          a         b         c
0  0.438921  0.118680  0.863670
1  0.138138  0.577363  0.686602
2  0.595307  0.564592  0.520630
3  0.913052  0.926075  0.616184
4  0.078718  0.854477  0.898725
5  0.076404  0.523211  0.591538
6  0.792342  0.216974  0.564056
7  0.397890  0.454131  0.915716
8  0.074315  0.437913  0.019794
9  0.559209  0.502065  0.026437

# pure python
In [201]: df[(df.a < df.b) & (df.b < df.c)]
Out[201]: 
          a         b         c
1  0.138138  0.577363  0.686602
4  0.078718  0.854477  0.898725
5  0.076404  0.523211  0.591538
7  0.397890  0.454131  0.915716

# query
In [202]: df.query('(a < b) & (b < c)')
Out[202]: 
          a         b         c
1  0.138138  0.577363  0.686602
4  0.078718  0.854477  0.898725
5  0.076404  0.523211  0.591538
7  0.397890  0.454131  0.915716
做同樣的事情，但是如果沒有名稱為的列，則使用命名索引a。

In [203]: df = pd.DataFrame(np.random.randint(n / 2, size=(n, 2)), columns=list('bc'))

In [204]: df.index.name = 'a'

In [205]: df
Out[205]: 
   b  c
a      
0  0  4
1  0  1
2  3  4
3  4  3
4  1  4
5  0  3
6  0  1
7  3  4
8  2  3
9  1  1

In [206]: df.query('a < b and b < c')
Out[206]: 
   b  c
a      
2  3  4
相反，如果您不想或無法命名索引，則可以index在查詢表示式中使用該名稱 ：

In [207]: df = pd.DataFrame(np.random.randint(n, size=(n, 2)), columns=list('bc'))

In [208]: df
Out[208]: 
   b  c
0  3  1
1  3  0
2  5  6
3  5  2
4  7  4
5  0  1
6  2  5
7  0  1
8  6  0
9  7  9

In [209]: df.query('index < b < c')
Out[209]: 
   b  c
2  5  6
注意 如果索引名與列名重疊，則列名優先。例如，
In [210]: df = pd.DataFrame({'a': np.random.randint(5, size=5)})

In [211]: df.index.name = 'a'

In [212]: df.query('a > 2')  # uses the column 'a', not the index
Out[212]: 
   a
a   
1  3
3  3
您仍然可以通過使用特殊識別符號'index'在查詢表示式中使用索引：

In [213]: df.query('index > 2')
Out[213]: 
   a
a   
3  3
4  2
如果由於某種原因您有一個名為的列index，那麼您也可以引用該索引ilevel_0，但是在這一點上，您應該考慮將列重新命名為不太模糊的名稱。

MultiIndex query()語法
您也可以將a的級別與a DataFrame一起使用 MultiIndex，就好像它們是框架中的列一樣：

In [214]: n = 10

In [215]: colors = np.random.choice(['red', 'green'], size=n)

In [216]: foods = np.random.choice(['eggs', 'ham'], size=n)

In [217]: colors
Out[217]: 
array(['red', 'red', 'red', 'green', 'green', 'green', 'green', 'green',
       'green', 'green'], dtype='<U5')

In [218]: foods
Out[218]: 
array(['ham', 'ham', 'eggs', 'eggs', 'eggs', 'ham', 'ham', 'eggs', 'eggs',
       'eggs'], dtype='<U4')

In [219]: index = pd.MultiIndex.from_arrays([colors, foods], names=['color', 'food'])

In [220]: df = pd.DataFrame(np.random.randn(n, 2), index=index)

In [221]: df
Out[221]: 
                   0         1
color food                    
red   ham   0.194889 -0.381994
      ham   0.318587  2.089075
      eggs -0.728293 -0.090255
green eggs -0.748199  1.318931
      eggs -2.029766  0.792652
      ham   0.461007 -0.542749
      ham  -0.305384 -0.479195
      eggs  0.095031 -0.270099
      eggs -0.707140 -0.773882
      eggs  0.229453  0.304418

In [222]: df.query('color == "red"')
Out[222]: 
                   0         1
color food                    
red   ham   0.194889 -0.381994
      ham   0.318587  2.089075
      eggs -0.728293 -0.090255
如果的級別MultiIndex未命名，則可以使用特殊名稱來引用它們：

In [223]: df.index.names = [None, None]

In [224]: df
Out[224]: 
                   0         1
red   ham   0.194889 -0.381994
      ham   0.318587  2.089075
      eggs -0.728293 -0.090255
green eggs -0.748199  1.318931
      eggs -2.029766  0.792652
      ham   0.461007 -0.542749
      ham  -0.305384 -0.479195
      eggs  0.095031 -0.270099
      eggs -0.707140 -0.773882
      eggs  0.229453  0.304418

In [225]: df.query('ilevel_0 == "red"')
Out[225]: 
                 0         1
red ham   0.194889 -0.381994
    ham   0.318587  2.089075
    eggs -0.728293 -0.090255
約定為ilevel_0，表示的第0級為“索引級別0” index。

query()用例
一個用例query()是當您有一組 DataFrame物件，這些物件具有共同的列名稱（或索引級別/名稱）的子集。您可以將相同的查詢傳遞給兩個框架，而 不必指定要查詢的框架

In [226]: df = pd.DataFrame(np.random.rand(n, 3), columns=list('abc'))

In [227]: df
Out[227]: 
          a         b         c
0  0.224283  0.736107  0.139168
1  0.302827  0.657803  0.713897
2  0.611185  0.136624  0.984960
3  0.195246  0.123436  0.627712
4  0.618673  0.371660  0.047902
5  0.480088  0.062993  0.185760
6  0.568018  0.483467  0.445289
7  0.309040  0.274580  0.587101
8  0.258993  0.477769  0.370255
9  0.550459  0.840870  0.304611

In [228]: df2 = pd.DataFrame(np.random.rand(n + 2, 3), columns=df.columns)

In [229]: df2
Out[229]: 
           a         b         c
0   0.357579  0.229800  0.596001
1   0.309059  0.957923  0.965663
2   0.123102  0.336914  0.318616
3   0.526506  0.323321  0.860813
4   0.518736  0.486514  0.384724
5   0.190804  0.505723  0.614533
6   0.891939  0.623977  0.676639
7   0.480559  0.378528  0.460858
8   0.420223  0.136404  0.141295
9   0.732206  0.419540  0.604675
10  0.604466  0.848974  0.896165
11  0.589168  0.920046  0.732716

In [230]: expr = '0.0 <= a <= c <= 0.5'

In [231]: map(lambda frame: frame.query(expr), [df, df2])
Out[231]: <map at 0x7fb06bd71cf8>
query()Python與Pandas語法比較
完整的類似numpy的語法：

In [232]: df = pd.DataFrame(np.random.randint(n, size=(n, 3)), columns=list('abc'))

In [233]: df
Out[233]: 
   a  b  c
0  7  8  9
1  1  0  7
2  2  7  2
3  6  2  2
4  2  6  3
5  3  8  2
6  1  7  2
7  5  1  5
8  9  8  0
9  1  5  0

In [234]: df.query('(a < b) & (b < c)')
Out[234]: 
   a  b  c
0  7  8  9

In [235]: df[(df.a < df.b) & (df.b < df.c)]
Out[235]: 
   a  b  c
0  7  8  9
刪除括號會稍微好一點（通過繫結使比較運算子比&和更緊密地繫結|）。

In [236]: df.query('a < b & b < c')
Out[236]: 
   a  b  c
0  7  8  9
使用英語代替符號：

In [237]: df.query('a < b and b < c')
Out[237]: 
   a  b  c
0  7  8  9
非常接近您在紙上書寫的方式：

In [238]: df.query('a < b < c')
Out[238]: 
   a  b  c
0  7  8  9
在in與運營商not in
query()還支援Python in和 比較運算子的特殊用法，為呼叫a 或方法提供了簡潔的語法 。not inisinSeriesDataFrame

# get all rows where columns "a" and "b" have overlapping values
In [239]: df = pd.DataFrame({'a': list('aabbccddeeff'), 'b': list('aaaabbbbcccc'),
   .....:                    'c': np.random.randint(5, size=12),
   .....:                    'd': np.random.randint(9, size=12)})
   .....: 

In [240]: df
Out[240]: 
    a  b  c  d
0   a  a  2  6
1   a  a  4  7
2   b  a  1  6
3   b  a  2  1
4   c  b  3  6
5   c  b  0  2
6   d  b  3  3
7   d  b  2  1
8   e  c  4  3
9   e  c  2  0
10  f  c  0  6
11  f  c  1  2

In [241]: df.query('a in b')
Out[241]: 
   a  b  c  d
0  a  a  2  6
1  a  a  4  7
2  b  a  1  6
3  b  a  2  1
4  c  b  3  6
5  c  b  0  2

# How you'd do it in pure Python
In [242]: df[df.a.isin(df.b)]
Out[242]: 
   a  b  c  d
0  a  a  2  6
1  a  a  4  7
2  b  a  1  6
3  b  a  2  1
4  c  b  3  6
5  c  b  0  2

In [243]: df.query('a not in b')
Out[243]: 
    a  b  c  d
6   d  b  3  3
7   d  b  2  1
8   e  c  4  3
9   e  c  2  0
10  f  c  0  6
11  f  c  1  2

# pure Python
In [244]: df[~df.a.isin(df.b)]
Out[244]: 
    a  b  c  d
6   d  b  3  3
7   d  b  2  1
8   e  c  4  3
9   e  c  2  0
10  f  c  0  6
11  f  c  1  2
您可以將其與其他表示式結合使用以進行非常簡潔的查詢：

# rows where cols a and b have overlapping values
# and col c's values are less than col d's
In [245]: df.query('a in b and c < d')
Out[245]: 
   a  b  c  d
0  a  a  2  6
1  a  a  4  7
2  b  a  1  6
4  c  b  3  6
5  c  b  0  2

# pure Python
In [246]: df[df.b.isin(df.a) & (df.c < df.d)]
Out[246]: 
    a  b  c  d
0   a  a  2  6
1   a  a  4  7
2   b  a  1  6
4   c  b  3  6
5   c  b  0  2
10  f  c  0  6
11  f  c  1  2
注意 請注意，在Python中對in和進行了評估，因為 該操作不等效。但是，在香草Python中僅對 / 表示式本身進行求值。例如，在表示式中not innumexpr innot in
df.query('a in b + c + d')
(b + c + d)通過評估numexpr和然後的in 操作在普通的Python評價。通常，將使用可以評估的任何操作numexpr。

==運算子與list物件的特殊用法
list使用==/ 將值a與列進行比較!=類似於in/ 。not in

In [247]: df.query('b == ["a", "b", "c"]')
Out[247]: 
    a  b  c  d
0   a  a  2  6
1   a  a  4  7
2   b  a  1  6
3   b  a  2  1
4   c  b  3  6
5   c  b  0  2
6   d  b  3  3
7   d  b  2  1
8   e  c  4  3
9   e  c  2  0
10  f  c  0  6
11  f  c  1  2

# pure Python
In [248]: df[df.b.isin(["a", "b", "c"])]
Out[248]: 
    a  b  c  d
0   a  a  2  6
1   a  a  4  7
2   b  a  1  6
3   b  a  2  1
4   c  b  3  6
5   c  b  0  2
6   d  b  3  3
7   d  b  2  1
8   e  c  4  3
9   e  c  2  0
10  f  c  0  6
11  f  c  1  2

In [249]: df.query('c == [1, 2]')
Out[249]: 
    a  b  c  d
0   a  a  2  6
2   b  a  1  6
3   b  a  2  1
7   d  b  2  1
9   e  c  2  0
11  f  c  1  2

In [250]: df.query('c != [1, 2]')
Out[250]: 
    a  b  c  d
1   a  a  4  7
4   c  b  3  6
5   c  b  0  2
6   d  b  3  3
8   e  c  4  3
10  f  c  0  6

# using in/not in
In [251]: df.query('[1, 2] in c')
Out[251]: 
    a  b  c  d
0   a  a  2  6
2   b  a  1  6
3   b  a  2  1
7   d  b  2  1
9   e  c  2  0
11  f  c  1  2

In [252]: df.query('[1, 2] not in c')
Out[252]: 
    a  b  c  d
1   a  a  4  7
4   c  b  3  6
5   c  b  0  2
6   d  b  3  3
8   e  c  4  3
10  f  c  0  6

# pure Python
In [253]: df[df.c.isin([1, 2])]
Out[253]: 
    a  b  c  d
0   a  a  2  6
2   b  a  1  6
3   b  a  2  1
7   d  b  2  1
9   e  c  2  0
11  f  c  1  2
布林運算子
您可以使用單詞not或~運算子取反布林表示式。

In [254]: df = pd.DataFrame(np.random.rand(n, 3), columns=list('abc'))

In [255]: df['bools'] = np.random.rand(len(df)) > 0.5

In [256]: df.query('~bools')
Out[256]: 
          a         b         c  bools
2  0.697753  0.212799  0.329209  False
7  0.275396  0.691034  0.826619  False
8  0.190649  0.558748  0.262467  False

In [257]: df.query('not bools')
Out[257]: 
          a         b         c  bools
2  0.697753  0.212799  0.329209  False
7  0.275396  0.691034  0.826619  False
8  0.190649  0.558748  0.262467  False

In [258]: df.query('not bools') == df[~df.bools]
Out[258]: 
      a     b     c  bools
2  True  True  True   True
7  True  True  True   True
8  True  True  True   True
當然，表示式也可以任意複雜：

# short query syntax
In [259]: shorter = df.query('a < b < c and (not bools) or bools > 2')

# equivalent in pure Python
In [260]: longer = df[(df.a < df.b) & (df.b < df.c) & (~df.bools) | (df.bools > 2)]

In [261]: shorter
Out[261]: 
          a         b         c  bools
7  0.275396  0.691034  0.826619  False

In [262]: longer
Out[262]: 
          a         b         c  bools
7  0.275396  0.691034  0.826619  False

In [263]: shorter == longer
Out[263]: 
      a     b     c  bools
7  True  True  True   True
效能query()
DataFrame.query()numexpr對於大型框架，使用速度比Python快。

../_images/query-perf.png
注意 僅當您的框架中有大約200,000行以上時，才可以看到使用numexpr引擎的效能優勢DataFrame.query()。
../_images/query-perf-small.png
此圖是使用組成的DataFrame，其中包含3列，每個列包含使用生成的浮點值numpy.random.randn()。

重複資料
如果要標識和刪除DataFrame中的重複行，有兩種方法會有所幫助：duplicated和drop_duplicates。每個引數都以用於標識重複行的列作為引數。

duplicated 返回一個布林向量，其長度為行數，並指示是否重複一行。
drop_duplicates 刪除重複的行。
預設情況下，觀察到的重複集的第一行被認為是唯一的，但是每種方法都有一個keep引數來指定要保留的目標。

keep='first' （預設）：標記/刪除重複項，但第一次出現除外。
keep='last'：標記/刪除重複項（最後一次除外）。
keep=False：標記/刪除所有重複項。
In [264]: df2 = pd.DataFrame({'a': ['one', 'one', 'two', 'two', 'two', 'three', 'four'],
   .....:                     'b': ['x', 'y', 'x', 'y', 'x', 'x', 'x'],
   .....:                     'c': np.random.randn(7)})
   .....: 

In [265]: df2
Out[265]: 
       a  b         c
0    one  x -1.067137
1    one  y  0.309500
2    two  x -0.211056
3    two  y -1.842023
4    two  x -0.390820
5  three  x -1.964475
6   four  x  1.298329

In [266]: df2.duplicated('a')
Out[266]: 
0    False
1     True
2    False
3     True
4     True
5    False
6    False
dtype: bool

In [267]: df2.duplicated('a', keep='last')
Out[267]: 
0     True
1    False
2     True
3     True
4    False
5    False
6    False
dtype: bool

In [268]: df2.duplicated('a', keep=False)
Out[268]: 
0     True
1     True
2     True
3     True
4     True
5    False
6    False
dtype: bool

In [269]: df2.drop_duplicates('a')
Out[269]: 
       a  b         c
0    one  x -1.067137
2    two  x -0.211056
5  three  x -1.964475
6   four  x  1.298329

In [270]: df2.drop_duplicates('a', keep='last')
Out[270]: 
       a  b         c
1    one  y  0.309500
4    two  x -0.390820
5  three  x -1.964475
6   four  x  1.298329

In [271]: df2.drop_duplicates('a', keep=False)
Out[271]: 
       a  b         c
5  three  x -1.964475
6   four  x  1.298329
另外，您可以傳遞列列表以標識重複項。

In [272]: df2.duplicated(['a', 'b'])
Out[272]: 
0    False
1    False
2    False
3    False
4     True
5    False
6    False
dtype: bool

In [273]: df2.drop_duplicates(['a', 'b'])
Out[273]: 
       a  b         c
0    one  x -1.067137
1    one  y  0.309500
2    two  x -0.211056
3    two  y -1.842023
5  three  x -1.964475
6   four  x  1.298329
要按索引值刪除重複項，請使用，Index.duplicated然後執行切片。該keep引數具有相同的選項集。

In [274]: df3 = pd.DataFrame({'a': np.arange(6),
   .....:                     'b': np.random.randn(6)},
   .....:                    index=['a', 'a', 'b', 'c', 'b', 'a'])
   .....: 

In [275]: df3
Out[275]: 
   a         b
a  0  1.440455
a  1  2.456086
b  2  1.038402
c  3 -0.894409
b  4  0.683536
a  5  3.082764

In [276]: df3.index.duplicated()
Out[276]: array([False,  True, False, False,  True,  True])

In [277]: df3[~df3.index.duplicated()]
Out[277]: 
   a         b
a  0  1.440455
b  2  1.038402
c  3 -0.894409

In [278]: df3[~df3.index.duplicated(keep='last')]
Out[278]: 
   a         b
c  3 -0.894409
b  4  0.683536
a  5  3.082764

In [279]: df3[~df3.index.duplicated(keep=False)]
Out[279]: 
   a         b
c  3 -0.894409
類似字典的get()方法
每個Series或DataFrame都有一個get可以返回預設值的方法。

In [280]: s = pd.Series([1, 2, 3], index=['a', 'b', 'c'])

In [281]: s.get('a')  # equivalent to s['a']
Out[281]: 1

In [282]: s.get('x', default=-1)
Out[282]: -1
該lookup()方法
有時，您需要提取給定一系列行標籤和列標籤的一組值，並且該lookup方法允許這樣做並返回NumPy陣列。例如：

In [283]: dflookup = pd.DataFrame(np.random.rand(20, 4), columns = ['A', 'B', 'C', 'D'])

In [284]: dflookup.lookup(list(range(0, 10, 2)), ['B', 'C', 'A', 'B', 'D'])
Out[284]: array([0.3506, 0.4779, 0.4825, 0.9197, 0.5019])
索引物件
pandas Index類及其子類可以看作實現了有序的多集。允許重複。但是，如果您嘗試將Index具有重複條目的物件轉換為 set，則會引發異常。

Index還提供了查詢，資料對齊和重新索引所需的基礎結構。Index直接建立一個最簡單的方法 是將a list或其他序列傳遞給 Index：

In [285]: index = pd.Index(['e', 'd', 'a', 'b'])

In [286]: index
Out[286]: Index(['e', 'd', 'a', 'b'], dtype='object')

In [287]: 'd' in index
Out[287]: True
您還可以傳遞name要儲存在索引中的：

In [288]: index = pd.Index(['e', 'd', 'a', 'b'], name='something')

In [289]: index.name
Out[289]: 'something'
名稱（如果已設定）將顯示在控制檯顯示屏中：

In [290]: index = pd.Index(list(range(5)), name='rows')

In [291]: columns = pd.Index(['A', 'B', 'C'], name='cols')

In [292]: df = pd.DataFrame(np.random.randn(5, 3), index=index, columns=columns)

In [293]: df
Out[293]: 
cols         A         B         C
rows                              
0     1.295989  0.185778  0.436259
1     0.678101  0.311369 -0.528378
2    -0.674808 -1.103529 -0.656157
3     1.889957  2.076651 -1.102192
4    -1.211795 -0.791746  0.634724

In [294]: df['A']
Out[294]: 
rows
0    1.295989
1    0.678101
2   -0.674808
3    1.889957
4   -1.211795
Name: A, dtype: float64
設定的元資料
索引是“不可改變的大多是”，但它可以設定和改變它們的元資料，如指數name（或為MultiIndex，levels和 codes）。

您可以使用rename，set_names，set_levels，和set_codes 直接設定這些屬性。他們預設返回一個副本。但是，您可以指定inplace=True將資料更改到位。

有關MultiIndexes的用法，請參閱高階索引。

In [295]: ind = pd.Index([1, 2, 3])

In [296]: ind.rename("apple")
Out[296]: Int64Index([1, 2, 3], dtype='int64', name='apple')

In [297]: ind
Out[297]: Int64Index([1, 2, 3], dtype='int64')

In [298]: ind.set_names(["apple"], inplace=True)

In [299]: ind.name = "bob"

In [300]: ind
Out[300]: Int64Index([1, 2, 3], dtype='int64', name='bob')
set_names，set_levels以及set_codes還需要一個可選的 level引數

In [301]: index = pd.MultiIndex.from_product([range(3), ['one', 'two']], names=['first', 'second'])

In [302]: index
Out[302]: 
MultiIndex(levels=[[0, 1, 2], ['one', 'two']],
           codes=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]],
           names=['first', 'second'])

In [303]: index.levels[1]
Out[303]: Index(['one', 'two'], dtype='object', name='second')

In [304]: index.set_levels(["a", "b"], level=1)
Out[304]: 
MultiIndex(levels=[[0, 1, 2], ['a', 'b']],
           codes=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]],
           names=['first', 'second'])
在索引物件設定操作
兩個主要操作是和。這些可以直接稱為例項方法，也可以通過過載運算子使用。通過該方法提供差異。union (|)intersection (&).difference()

In [305]: a = pd.Index(['c', 'b', 'a'])

In [306]: b = pd.Index(['c', 'e', 'd'])

In [307]: a | b
Out[307]: Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [308]: a & b
Out[308]: Index(['c'], dtype='object')

In [309]: a.difference(b)
Out[309]: Index(['a', 'b'], dtype='object')
該操作也可用，該操作返回出現在或中的元素，但不同時出現在這兩個元素中。這等效於所建立的索引，其中刪除了重複項。symmetric_difference (^)idx1idx2idx1.difference(idx2).union(idx2.difference(idx1))

In [310]: idx1 = pd.Index([1, 2, 3, 4])

In [311]: idx2 = pd.Index([2, 3, 4, 5])

In [312]: idx1.symmetric_difference(idx2)
Out[312]: Int64Index([1, 5], dtype='int64')

In [313]: idx1 ^ idx2
Out[313]: Int64Index([1, 5], dtype='int64')
注意 設定操作產生的索引將按升序排序。
缺失值
重要 即使Index可以保留缺少的值（NaN），如果您不希望有任何意外的結果，也應避免使用。例如，某些操作會隱式排除缺失值。
Index.fillna 用指定的標量值填充缺少的值。

In [314]: idx1 = pd.Index([1, np.nan, 3, 4])

In [315]: idx1
Out[315]: Float64Index([1.0, nan, 3.0, 4.0], dtype='float64')

In [316]: idx1.fillna(2)
Out[316]: Float64Index([1.0, 2.0, 3.0, 4.0], dtype='float64')

In [317]: idx2 = pd.DatetimeIndex([pd.Timestamp('2011-01-01'),
   .....:                          pd.NaT,
   .....:                          pd.Timestamp('2011-01-03')])
   .....: 

In [318]: idx2
Out[318]: DatetimeIndex(['2011-01-01', 'NaT', '2011-01-03'], dtype='datetime64[ns]', freq=None)

In [319]: idx2.fillna(pd.Timestamp('2011-01-02'))
Out[319]: DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03'], dtype='datetime64[ns]', freq=None)
設定/重置索引
有時候，您會在DataFrame中載入或建立資料集，並希望在完成後新增索引。有幾種不同的方法。

設定索引
DataFrame有一個set_index()方法，該方法採用列名（對於常規Index）或列名列表（對於MultiIndex）。要建立一個新的，重新索引的DataFrame：

In [320]: data
Out[320]: 
     a    b  c    d
0  bar  one  z  1.0
1  bar  two  y  2.0
2  foo  one  x  3.0
3  foo  two  w  4.0

In [321]: indexed1 = data.set_index('c')

In [322]: indexed1
Out[322]: 
     a    b    d
c               
z  bar  one  1.0
y  bar  two  2.0
x  foo  one  3.0
w  foo  two  4.0

In [323]: indexed2 = data.set_index(['a', 'b'])

In [324]: indexed2
Out[324]: 
         c    d
a   b          
bar one  z  1.0
    two  y  2.0
foo one  x  3.0
    two  w  4.0
該append關鍵字選項讓你保持現有索引並追加給列一個多指標：

In [325]: frame = data.set_index('c', drop=False)

In [326]: frame = frame.set_index(['a', 'b'], append=True)

In [327]: frame
Out[327]: 
           c    d
c a   b          
z bar one  z  1.0
y bar two  y  2.0
x foo one  x  3.0
w foo two  w  4.0
中的其他選項set_index允許您不要刪除索引列或就地新增索引（無需建立新物件）：

In [328]: data.set_index('c', drop=False)
Out[328]: 
     a    b  c    d
c                  
z  bar  one  z  1.0
y  bar  two  y  2.0
x  foo  one  x  3.0
w  foo  two  w  4.0

In [329]: data.set_index(['a', 'b'], inplace=True)

In [330]: data
Out[330]: 
         c    d
a   b          
bar one  z  1.0
    two  y  2.0
foo one  x  3.0
    two  w  4.0
重置指數
為方便起見，DataFrame上有一個新函式，reset_index()該函式 將索引值傳輸到DataFrame的列中並設定一個簡單的整數索引。這是的逆運算set_index()。

In [331]: data
Out[331]: 
         c    d
a   b          
bar one  z  1.0
    two  y  2.0
foo one  x  3.0
    two  w  4.0

In [332]: data.reset_index()
Out[332]: 
     a    b  c    d
0  bar  one  z  1.0
1  bar  two  y  2.0
2  foo  one  x  3.0
3  foo  two  w  4.0
輸出與SQL表或記錄陣列更相似。從索引派生的列的名稱是儲存在names屬性中的名稱。

您可以使用level關鍵字僅刪除部分索引：

In [333]: frame
Out[333]: 
           c    d
c a   b          
z bar one  z  1.0
y bar two  y  2.0
x foo one  x  3.0
w foo two  w  4.0

In [334]: frame.reset_index(level=1)
Out[334]: 
         a  c    d
c b               
z one  bar  z  1.0
y two  bar  y  2.0
x one  foo  x  3.0
w two  foo  w  4.0
reset_index接受一個可選引數drop，如果為true，則該引數將簡單地丟棄索引，而不是將索引值放在DataFrame的列中。

新增臨時索引
如果您自己建立索引，則可以將其分配給該index欄位：

data.index = index
返回檢視與副本
在熊貓物件中設定值時，必須注意避免所謂的 。這是一個例子。chained indexing

In [335]: dfmi = pd.DataFrame([list('abcd'),
   .....:                      list('efgh'),
   .....:                      list('ijkl'),
   .....:                      list('mnop')],
   .....:                     columns=pd.MultiIndex.from_product([['one', 'two'],
   .....:                                                         ['first', 'second']]))
   .....: 

In [336]: dfmi
Out[336]: 
    one          two       
  first second first second
0     a      b     c      d
1     e      f     g      h
2     i      j     k      l
3     m      n     o      p
比較這兩種訪問方法：

In [337]: dfmi['one']['second']
Out[337]: 
0    b
1    f
2    j
3    n
Name: second, dtype: object
In [338]: dfmi.loc[:, ('one', 'second')]
Out[338]: 
0    b
1    f
2    j
3    n
Name: (one, second), dtype: object
兩者都會產生相同的結果，那麼您應該使用哪個呢？瞭解這些操作的順序以及為什麼方法2（.loc）比方法1（連結[]）更可取是很有啟發性的。

dfmi['one']選擇列的第一級並返回一個單獨索引的DataFrame。然後另一個Python操作dfmi_with_one['second']選擇由索引的系列'second'。這由變數指示，dfmi_with_one因為熊貓將這些操作視為單獨的事件。例如，對的單獨呼叫__getitem__，因此必須將它們視為線性操作，它們接連發生。

與之形成對照的是df.loc[:,('one','second')]，將巢狀的元組傳遞(slice(None),('one','second'))給的單個呼叫 __getitem__。這使大熊貓可以將其作為一個整體來處理。此外，這種操作順序可以明顯更快，並且如果需要的話，可以使兩個軸分度。

為什麼使用連結索引時分配失敗？
上一節中的問題僅僅是效能問題。有什麼用的了SettingWithCopy警告？當您執行可能需要花費幾毫秒的時間時，我們通常不會發出警告！

但是事實證明，分配給鏈式索引的產品具有固有的不可預測的結果。要檢視此內容，請考慮Python直譯器如何執行此程式碼：

dfmi.loc[:, ('one', 'second')] = value
# becomes
dfmi.loc.__setitem__((slice(None), ('one', 'second')), value)
但是此程式碼的處理方式不同：

dfmi['one']['second'] = value
# becomes
dfmi.__getitem__('one').__setitem__('second', value)
看到__getitem__那裡嗎？在簡單情況之外，很難預測它是否將返回檢視或副本（這取決於陣列的記憶體佈局，有關大熊貓對此不做任何保證），因此很難確定將__setitem__要修改dfmi還是返回一個臨時物件。之後立即扔出去。那什麼SettingWithCopy是警告你！

注意 您可能想知道loc 在第一個示例中我們是否應該關注該屬性。但dfmi.loc要保證dfmi 自身具有修改的索引行為，因此dfmi.loc.__getitem__/ 直接dfmi.loc.__setitem__進行操作dfmi。當然， dfmi.loc.__getitem__(idx)可以是的檢視或副本dfmi。
有時SettingWithCopy在沒有明顯的連結索引進行時，有時會發出警告。這些SettingWithCopy是旨在捕獲的錯誤 ！熊貓可能正在嘗試警告您，您已經這樣做了：

def do_something(df):
    foo = df[['bar', 'baz']]  # Is foo a view? A copy? Nobody knows!
    # ... many lines here ...
    # We don't know whether this will modify df or not!
    foo['quux'] = value
    return foo
kes！

評估順序很重要
使用鏈式索引時，索引操作的順序和型別將部分確定結果是原始物件的切片還是該切片的副本。

Pandas之所以這樣，是SettingWithCopyWarning因為分配分片的副本通常不是故意的，而是由鏈式索引導致的錯誤，該錯誤將原本應有分片的副本返回。

如果您希望熊貓在某種程度上信任鏈式索引表示式的分配，可以將選項 設定mode.chained_assignment為以下值之一：

'warn'，即預設值，表示SettingWithCopyWarning已列印a。
'raise'意味著大熊貓會籌集一個SettingWithCopyException 你必須處理的。
None 將完全消除警告。
In [339]: dfb = pd.DataFrame({'a': ['one', 'one', 'two',
   .....:                           'three', 'two', 'one', 'six'],
   .....:                     'c': np.arange(7)})
   .....: 

# This will show the SettingWithCopyWarning
# but the frame values will be set
In [340]: dfb['c'][dfb.a.str.startswith('o')] = 42
但是，此操作正在副本上，將無法使用。

>>> pd.set_option('mode.chained_assignment','warn')
>>> dfb[dfb.a.str.startswith('o')]['c'] = 42
Traceback (most recent call last)
     ...
SettingWithCopyWarning:
     A value is trying to be set on a copy of a slice from a DataFrame.
     Try using .loc[row_index,col_indexer] = value instead
連結的分配也可以在混合dtype框架中進行設定。

注意 這些設定規則適用於.loc/.iloc。
這是正確的訪問方法：

In [341]: dfc = pd.DataFrame({'A': ['aaa', 'bbb', 'ccc'], 'B': [1, 2, 3]})

In [342]: dfc.loc[0, 'A'] = 11

In [343]: dfc
Out[343]: 
     A  B
0   11  1
1  bbb  2
2  ccc  3
這可以在次工作，但它不能保證，因此應避免：

In [344]: dfc = dfc.copy()

In [345]: dfc['A'][0] = 111

In [346]: dfc
Out[346]: 
     A  B
0  111  1
1  bbb  2
2  ccc  3
這根本不起作用，因此應避免：

>>> pd.set_option('mode.chained_assignment','raise')
>>> dfc.loc[0]['A'] = 1111
Traceback (most recent call last)
     ...
SettingWithCopyException:
     A value is trying to be set on a copy of a slice from a DataFrame.
     Try using .loc[row_index,col_indexer] = value instead
警告 連結的作業警告/異常旨在通知使用者可能無效的作業。可能存在誤報；意外報告連結分配的情況
Pandas 索引和選擇資料

pandas物件中的軸標籤資訊有許多用途：使用已知的指標標識資料（即提供元資料），這對於分析，視覺化和互動式控制檯顯示很重要。
pandas 學習第14篇：索引和選擇資料

資料框和序列結構中都有軸標籤，軸標籤的資訊儲存在Index物件中，軸標籤的最重要的作用是：
Python 中pandas索引切片讀取資料缺失資料處理問題

引入　　numpy已經能夠幫助我們處理資料，能夠結合matplotlib解決我們資料分析的問題，那麼pandas學習的目的在什麼地方呢？ numpy能夠幫我們處理處理數值型資料，但是這還不夠很多時候，我們的資料除了數值之外，還
Pandas之三選擇資料

前文介紹瞭如何檢視dataframe資料，現在再來看看怎麼樣定位和修改pandas的具體資料。
pandas-DataFrame通過標籤索引loc和位置索引 iloc獲取資料

技術標籤：pandasPython基礎知識pandas標籤索引--loc位置索引--iloc 程式碼示例： import pandas as pd
初探pandas——安裝和了解pandas資料結構

安裝pandas 通過python pip安裝pandas pip install pandas pandas資料結構 pandas常用資料結構包括：Series和DataFrame
pandas | 如何在DataFrame中通過索引高效獲取資料？

本文始發於個人公眾號：TechFlow，原創不易，求個關注今天是pandas資料處理專題的第四篇文章，我們一起來聊聊DataFrame中的索引。
pandas | 如何在DataFrame中通過索引高效獲取資料？（轉載）

今天是pandas資料處理專題的第四篇文章，我們一起來聊聊DataFrame中的索引。上一篇文章當中我們介紹了DataFrame資料結構當中一些常用的索引的使用方法，比如iloc、loc以及邏輯索引等等。今天的文章我們來看看DataFr
python資料分析（八） python pandas--series和dataframe的方法，排序，統計

排序根據條件對結果進行排序，是pandas當中的一個重要方法，pandas提供了兩種排序方式，根據index值，或是根據其中的value進行排序
織夢站內選擇資料夾和圖片檔案排序後臺模板管理按名稱排序

織夢站內選擇圖片排序是直接read()讀取直接輸出，如果我們同一時間上傳了多個圖片，在沒有經過排序的情況下，我們去選擇圖片很難快速分辨哪個是剛剛上傳的，解決方法是讀取該目錄的檔案列表,用\"檔名、修改時間\"做鍵
Pandas：Series和DataFrame資料結構詳解

前言 pandas中包含的資料結構共有三種： 1、Series 2、DataFrame 3、Time-series 其中Series和DataFrame是兩種常見的資料結構，Time-series為時間序列，這裡暫且不去詳細講解。
pandas-DataFrame增加行和列資料、刪除行和列資料（append、drop）

技術標籤：pandasPython基礎知識pandasDataFrameappenddrop 程式碼示例： import pandas as pd
如何選擇普通索引和唯一索引《死磕MySQL系列五》

系列文章一、原來一條select語句在MySQL是這樣執行的《死磕MySQL系列一》二、一生摯友redo log、binlog《死磕MySQL系列二》
mysql學習實踐(1) 普通索引和唯一索引的選擇

問題 1、在不同的業務場景下，應該選擇普通索引，還是唯一索引？假設你在維護一個市民系統，每個人都有一個唯一的身份證號，而且業務程式碼已經保證了不會寫入兩個重複的身份證號。如果市民系統需要按照身份證號查
普通索引和唯一索引，應該怎麼選擇？

1）什麼是唯一索引？不允許具有索引值相同的行，比如身份證唯一的案例：假設你在維護一個市民系統，每個人都有一個唯一的身份證號，而且業務程式碼已經保證了不會寫入兩個重複的身份證號。如果市民系統需要按
資料結構--插入排序和選擇排序

直接插入排序的思想是：是將n個待排序的元素由一個有序表和一個無序表組成，開始時有序表中只包含一個元素。排序過程中，每次從無序表中取出第一個元素，將其插入到有序表中的適當位置，使有序表的長度不斷加長，完成
Redis快取和MySQL資料一致性方案詳解

需求起因在高併發的業務場景下，資料庫大多數情況都是使用者併發訪問最薄弱的環節。所以，就需要使用redis做一個緩衝操作，讓請求先訪問到redis，而不是直接訪問MySQL等資料庫。
Elasticsearch的檔案、索引和rest api

基本概念檔案（document） elasticsearch 是面向檔案的，檔案是所有可搜尋資料的最小單位
Elasticsearch入門(1)-倒排索引和分詞器

這部分檔案主要包含：倒排索引 Analyzer分詞倒排索引舉例類比做個類比，看書時，我們看到了哪個章節，根據章節標題去目錄中檢索具體的內容。但是當我們回憶起一些隻言片語，一些句子，一些情節時，去定位它出
Elasticsearch 7.x 之檔案、索引和 REST API 【基礎入門篇】

前幾天寫過一篇《Elasticsearch 7.x 最詳細安裝及配置》，今天繼續最新版基礎入門內容。這一篇簡單總結了 Elasticsearch 7.x 之檔案、索引和 REST API。
Pandas 索引和選擇資料

相關推薦