1. 程式人生 > 實用技巧 >pandas百題筆記

pandas百題筆記

Pandas百題筆記

1.匯入 Pandas:

import pandas as pd

2.檢視 Pandas 版本資訊:

print(pd.__version__)  ==>1.0.1

Pandas 的資料結構:Pandas 主要有 Series(一維陣列),DataFrame(二維陣列),Panel(三維陣列),Panel4D(四維陣列),PanelND(更多維陣列)等資料結構。其中 Series 和 DataFrame 應用的最為廣泛。

#Series 是一維帶標籤的陣列,它可以包含任何資料型別。包括整數,字串,浮點數,Python 物件等。Series 可以通過標籤來定位。

#DataFrame 是二維的帶標籤的資料結構。我們可以通過標籤來定位資料。這是 NumPy 所沒有的。

建立 Series 資料型別

建立 Series 語法:s = pd.Series(data, index=index),可以通過多種方式進行建立,以下介紹了 3 個常用方法。

3.從列表建立 Series:

arr = [0, 1, 2, 3, 4]
s1 = pd.Series(arr)  # 如果不指定索引,則預設從 0 開始
s1  
==> 
0    0
1    1
2    2
3    3
4    4
dtype: int64

4.從 Ndarray 建立 Series:

import numpy as np

n = np.random.randn(5)  # 建立一個隨機 Ndarray 陣列

index = ['a', 'b', 'c', 'd', 'e']
s2 = pd.Series(n, index=index)
s2
==>
a   -0.766282
b    0.134975
c    0.175090
d    0.298047
e    0.171916
dtype: float64

5.從字典建立 Series:

d = {'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5}  # 定義示例字典
s3 = pd.Series(d)
s3
==>
a    1
b    2
c    3
d    4
e    5
dtype: int64

Series 基本操作

6.修改 Series 索引:

print(s1)  # 以 s1 為例

s1.index = ['A', 'B', 'C', 'D', 'E']  # 修改後的索引
s1
==>
0    0
1    1
2    2
3    3
4    4
dtype: int64
A    0
B    1
C    2
D    3
E    4
dtype: int64

7.Series 縱向拼接:

s4 = s3.append(s1)  # 將 s1 拼接到 s3
s4
==>
a    1
b    2
c    3
d    4
e    5
A    0
B    1
C    2
D    3
E    4
dtype: int64

8.Series 按指定索引刪除元素:

print(s4)
s4 = s4.drop('e')  # 刪除索引為 e 的值
s4
==>
a    1
b    2
c    3
d    4
e    5
A    0
B    1
C    2
D    3
E    4
dtype: int64
a    1
b    2
c    3
d    4
A    0
B    1
C    2
D    3
E    4
dtype: int64

9.Series 修改指定索引元素:

s4['A'] = 6  # 修改索引為 A 的值 = 6
s4
==>
a    1
b    2
c    3
d    4
A    6
B    1
C    2
D    3
E    4
dtype: int64

10.Series 按指定索引查詢元素:

s4['B']
==>  1

11.Series 切片操作:

例如對s4的前 3 個數據訪問
s4[:3]
==>
a    1
b    2
c    3
dtype: int64

Series 運算

12.Series 加法運算:

Series 的加法運算是按照索引計算,如果索引不同則填充為 NaN(空值)。

s4.add(s3)
==>
A    NaN
B    NaN
C    NaN
D    NaN
E    NaN
a    2.0
b    4.0
c    6.0
d    8.0
e    NaN
dtype: float64

13.Series 減法運算:

Series的減法運算是按照索引對應計算,如果不同則填充為 NaN(空值)。

s4.sub(s3)
==>
A    NaN
B    NaN
C    NaN
D    NaN
E    NaN
a    0.0
b    0.0
c    0.0
d    0.0
e    NaN
dtype: float64

14.Series 乘法運算:

Series 的乘法運算是按照索引對應計算,如果索引不同則填充為 NaN(空值)。

s4.mul(s3)
==>
A     NaN
B     NaN
C     NaN
D     NaN
E     NaN
a     1.0
b     4.0
c     9.0
d    16.0
e     NaN
dtype: float64

15.Series 除法運算:

Series 的除法運算是按照索引對應計算,如果索引不同則填充為 NaN(空值)。

s4.div(s3)
==>
A    NaN
B    NaN
C    NaN
D    NaN
E    NaN
a    1.0
b    1.0
c    1.0
d    1.0
e    NaN
dtype: float64

16.Series 求中位數:

s4.median() ==> 3.0

17.Series 求和:

s4.sum()  ==> 26

18.Series 求最大值:

s4.max()  ==>  6

19.Series 求最小值:

s4.min()  ==> 1

建立 DataFrame 資料型別

與 Sereis 不同,DataFrame 可以存在多列資料。一般情況下,DataFrame 也更加常用。

20.通過 NumPy 陣列建立 DataFrame:

dates = pd.date_range('today', periods=6)  # 定義時間序列作為 index
num_arr = np.random.randn(6, 4)  # 傳入 numpy 隨機陣列
columns = ['A', 'B', 'C', 'D']  # 將列表作為列名
df1 = pd.DataFrame(num_arr, index=dates, columns=columns)
df1
==>
                                   A         B         C         D
2020-07-05 13:58:34.723797 -0.820141  0.205872 -0.928024 -1.828410
2020-07-06 13:58:34.723797  0.750014 -0.340494  1.190786 -0.204266
2020-07-07 13:58:34.723797 -2.062106 -1.520711  1.414341  1.057326
2020-07-08 13:58:34.723797 -0.821653  0.564271 -1.274913  2.340385
2020-07-09 13:58:34.723797 -1.936687  0.447897 -0.108420  0.133166
2020-07-10 13:58:34.723797  0.707222 -1.251812 -0.235982  0.340147

21.通過字典陣列建立 DataFrame:

data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
        'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
        'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
        'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df2 = pd.DataFrame(data, index=labels)
df2
==>
  animal  age  visits priority
a    cat  2.5       1      yes
b    cat  3.0       3      yes
c  snake  0.5       2       no
d    dog  NaN       3      yes
e    dog  5.0       2       no
f    cat  2.0       3       no
g  snake  4.5       1       no
h    cat  NaN       1      yes
i    dog  7.0       2       no
j    dog  3.0       1       no
#字典中的鍵值直接變為列名

22.檢視 DataFrame 的資料型別:

df2.dtypes
==>
animal       object
age         float64
visits        int64
priority     object
dtype: object

DataFrame 基本操作

23.預覽 DataFrame 的前 5 行資料:

此方法對快速瞭解陌生資料集結構十分有用。

df2.head()  # 預設為顯示 5 行,可根據需要在括號中填入希望預覽的行數
==>
    animal  age    visits priority
a    cat    2.5       1      yes
b    cat    3.0       3      yes
c  snake    0.5       2       no
d    dog    NaN       3      yes
e    dog    5.0       2       no

24.檢視 DataFrame 的後 3 行資料:

df2.tail(3)
==>
    animal  age  visits priority
h    cat  NaN       1      yes
i    dog  7.0       2       no
j    dog  3.0       1       no

25.檢視 DataFrame 的索引:

df2.index
==>  Index(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'], dtype='object')

26.檢視 DataFrame 的列名:

df2.columns  
==>Index(['animal', 'age', 'visits', 'priority'], dtype='object')

27.檢視 DataFrame 的數值:

df2.values
==>
array([['cat', 2.5, 1, 'yes'],
       ['cat', 3.0, 3, 'yes'],
       ['snake', 0.5, 2, 'no'],
       ['dog', nan, 3, 'yes'],
       ['dog', 5.0, 2, 'no'],
       ['cat', 2.0, 3, 'no'],
       ['snake', 4.5, 1, 'no'],
       ['cat', nan, 1, 'yes'],
       ['dog', 7.0, 2, 'no'],
       ['dog', 3.0, 1, 'no']], dtype=object)

28.檢視 DataFrame 的統計資料:

df2.describe()
==>
            age     visits
count  8.000000  10.000000
mean   3.437500   1.900000
std    2.007797   0.875595
min    0.500000   1.000000
25%    2.375000   1.000000
50%    3.000000   2.000000
75%    4.625000   2.750000
max    7.000000   3.000000

29.DataFrame 轉置操作:

   df2.T
    ==>
            a    b      c    d    e    f      g    h    i    j
    animal    cat  cat  snake  dog  dog  cat  snake  cat  dog  dog
    age       2.5    3    0.5  NaN    5    2    4.5  NaN    7    3
    visits      1    3      2    3    2    3      1    1    2    1
    priority  yes  yes     no  yes   no   no     no  yes   no   no

30.對 DataFrame 進行按列排序:

df2.sort_values(by='age')  # 按 age 升序排列
    ==>
      animal  age  visits priority
    c  snake  0.5       2       no
    f    cat  2.0       3       no
    a    cat  2.5       1      yes
    b    cat  3.0       3      yes
    j    dog  3.0       1       no
    g  snake  4.5       1       no
    e    dog  5.0       2       no
    i    dog  7.0       2       no
    d    dog  NaN       3      yes
    h    cat  NaN       1      yes

31.對 DataFrame 資料切片:

df2[1:3]
==>
    animal  age  visits priority
b    cat  3.0       3      yes
c  snake  0.5       2       no

32.對 DataFrame 通過標籤查詢(單列):

df2['age']
==>
a    2.5
b    3.0
c    0.5
d    NaN
e    5.0
f    2.0
g    4.5
h    NaN
i    7.0
j    3.0
Name: age, dtype: float64

df2.age  # 等價於 df2['age']

33.對 DataFrame 通過標籤查詢(多列):

df2[['age', 'animal']]  # 傳入一個列名組成的列表
==>
   age   animal
a  2.5    cat
b  3.0    cat
c  0.5  snake
d  NaN    dog
e  5.0    dog
f  2.0    cat
g  4.5  snake
h  NaN    cat
i  7.0    dog
j  3.0    dog

34.對 DataFrame 通過位置查詢:

df2.iloc[1:3]  # 查詢 2,3 行
==>
    animal  age  visits priority
b    cat  3.0       3      yes
c  snake  0.5       2       no

35.DataFrame 副本拷貝:

生成 DataFrame 副本,方便資料集被多個不同流程使用

df3 = df2.copy()
df3
==>
  animal  age  visits priority
a    cat  2.5       1      yes
b    cat  3.0       3      yes
c  snake  0.5       2       no
d    dog  NaN       3      yes
e    dog  5.0       2       no
f    cat  2.0       3       no
g  snake  4.5       1       no
h    cat  NaN       1      yes
i    dog  7.0       2       no
j    dog  3.0       1       no

36.判斷 DataFrame 元素是否為空:

df3.isnull()  # 如果為空則返回為 True
==>
      animal    age  visits  priority
a   False  False   False     False
b   False  False   False     False
c   False  False   False     False
d   False   True   False     False
e   False  False   False     False
f   False  False   False     False
g   False  False   False     False
h   False   True   False     False
i   False  False   False     False
j   False  False   False     False

37.新增列資料:

num = pd.Series([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], index=df3.index)

df3['No.'] = num  # 新增以 'No.' 為列名的新資料列
df3
==>
  animal  age  visits priority  No.
a    cat  2.5       1      yes    0
b    cat  3.0       3      yes    1
c  snake  0.5       2       no    2
d    dog  NaN       3      yes    3
e    dog  5.0       2       no    4
f    cat  2.0       3       no    5
g  snake  4.5       1       no    6
h    cat  NaN       1      yes    7
i    dog  7.0       2       no    8
j    dog  3.0       1       no    9

38.根據 DataFrame 的下標值進行更改:

修改第 2 行與第 2 列對應的值 3.0 → 2.0

df3.iat[1, 1] = 2  # 索引序號從 0 開始,這裡為 1, 1
df3
==>
  animal  age  visits priority  No.
a    cat  2.5       1      yes    0
b    cat  2.0       3      yes    1
c  snake  0.5       2       no    2
d    dog  NaN       3      yes    3
e    dog  5.0       2       no    4
f    cat  2.0       3       no    5
g  snake  4.5       1       no    6
h    cat  NaN       1      yes    7
i    dog  7.0       2       no    8
j    dog  3.0       1       no    9

39.根據 DataFrame 的標籤對資料進行修改:

df3.loc['f', 'age'] = 1.5
df3
==>
    animal  age  visits priority  No.
a    cat  2.5       1      yes    0
b    cat  2.0       3      yes    1
c  snake  0.5       2       no    2
d    dog  NaN       3      yes    3
e    dog  5.0       2       no    4
f    cat  1.5       3       no    5
g  snake  4.5       1       no    6
h    cat  NaN       1      yes    7
i    dog  7.0       2       no    8
j    dog  3.0       1       no    9

40.DataFrame 求平均值操作:

df3.mean()
==>
age       3.25
visits    1.90
No.       4.50
dtype: float64

41.對 DataFrame 中任意列做求和操作:

df3['visits'].sum()  
==> 19

字串操作

42.將字串轉化為小寫字母:

string = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca',
                    np.nan, 'CABA', 'dog', 'cat'])
print(string)
string.str.lower()
==>
0       A
1       B
2       C
3    Aaba
4    Baca
5     NaN
6    CABA
7     dog
8     cat
dtype: object
0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

43.將字串轉化為大寫字母:

string.str.upper()
==>
0       A
1       B
2       C
3    AABA
4    BACA
5     NaN
6    CABA
7     DOG
8     CAT
dtype: object

DataFrame 缺失值操作

44.對缺失值進行填充:

df4 = df3.copy()
print(df4)
df4.fillna(value=3)
==>
    animal  age  visits priority  No.
a    cat  2.5       1      yes    0
b    cat  2.0       3      yes    1
c  snake  0.5       2       no    2
d    dog  NaN       3      yes    3
e    dog  5.0       2       no    4
f    cat  1.5       3       no    5
g  snake  4.5       1       no    6
h    cat  NaN       1      yes    7
i    dog  7.0       2       no    8
j    dog  3.0       1       no    9

  animal  age  visits priority  No.
a    cat  2.5       1      yes    0
b    cat  2.0       3      yes    1
c  snake  0.5       2       no    2
d    dog  3.0       3      yes    3
e    dog  5.0       2       no    4
f    cat  1.5       3       no    5
g  snake  4.5       1       no    6
h    cat  3.0       1      yes    7
i    dog  7.0       2       no    8
j    dog  3.0       1       no    9

45.刪除存在缺失值的行:

df5 = df3.copy()
print(df5)
df5.dropna(how='any')  # 任何存在 NaN 的行都將被刪除
==>
    animal  age  visits priority  No.
a    cat  2.5       1      yes    0
b    cat  2.0       3      yes    1
c  snake  0.5       2       no    2
d    dog  NaN       3      yes    3
e    dog  5.0       2       no    4
f    cat  1.5       3       no    5
g  snake  4.5       1       no    6
h    cat  NaN       1      yes    7
i    dog  7.0       2       no    8
j    dog  3.0       1       no    9

  animal  age  visits priority  No.
a    cat  2.5       1      yes    0
b    cat  2.0       3      yes    1
c  snake  0.5       2       no    2
e    dog  5.0       2       no    4
f    cat  1.5       3       no    5
g  snake  4.5       1       no    6
i    dog  7.0       2       no    8
j    dog  3.0       1       no    9

46.DataFrame 按指定列對齊:

left = pd.DataFrame({'key': ['foo1', 'foo2'], 'one': [1, 2]})
right = pd.DataFrame({'key': ['foo2', 'foo3'], 'two': [4, 5]})

print(left)
print(right)

按照 key 列對齊連線,只存在 foo2 相同,所以最後變成一行

pd.merge(left, right, on='key')
==>
    key  one
0  foo1    1
1  foo2    2

​     key  two

0  foo2    4
1  foo3    5

​      key    one   two

0      foo2    2     4

DataFrame 檔案操作

47.CSV 檔案寫入:

df3.to_csv('animal.csv')
print("寫入成功.")
==>  寫入成功.

48.CSV 檔案讀取:

df_animal = pd.read_csv('animal.csv')
df_animal
==>
    Unnamed: 0 animal  age  visits priority  No.
0          a    cat  2.5       1      yes    0
1          b    cat  2.0       3      yes    1
2          c  snake  0.5       2       no    2
3          d    dog  NaN       3      yes    3
4          e    dog  5.0       2       no    4
5          f    cat  1.5       3       no    5
6          g  snake  4.5       1       no    6
7          h    cat  NaN       1      yes    7
8          i    dog  7.0       2       no    8
9          j    dog  3.0       1       no    9

49.Excel 寫入操作:

df3.to_excel('animal.xlsx', sheet_name='Sheet1')
print("寫入成功.")
==>  寫入成功.

50.Excel 讀取操作:

pd.read_excel('animal.xlsx', 'Sheet1', index_col=None, na_values=['NA'])  
==>
    Unnamed: 0 animal  age  visits priority  No.
0          a    cat  2.5       1      yes    0
1          b    cat  2.0       3      yes    1
2          c  snake  0.5       2       no    2
3          d    dog  NaN       3      yes    3
4          e    dog  5.0       2       no    4
5          f    cat  1.5       3       no    5
6          g  snake  4.5       1       no    6
7          h    cat  NaN       1      yes    7
8          i    dog  7.0       2       no    8
9          j    dog  3.0       1       no    9

進階部分

時間序列索引

51.建立一個以 2018 年每一天為索引,值為隨機數的 Series:

dti = pd.date_range(start='2018-01-01', end='2018-12-31', freq='D')
s = pd.Series(np.random.rand(len(dti)), index=dti)
s 
==>
2018-01-01    0.441330
2018-01-02    0.182571
2018-01-03    0.141348
2018-01-04    0.604700
2018-01-05    0.300351
                ...   
2018-12-27    0.499318
2018-12-28    0.530867
2018-12-29    0.183895
2018-12-30    0.163899
2018-12-31    0.173812
Freq: D, Length: 365, dtype: float64

52.統計s 中每一個週三對應值的和:

週一從 0 開始

s[s.index.weekday == 2].sum()
==>  22.592391213957054

53.統計s中每個月值的平均值:

s.resample('M').mean()
==>
2018-01-31    0.441100
2018-02-28    0.506476
2018-03-31    0.501672
2018-04-30    0.510073
2018-05-31    0.416773
2018-06-30    0.525039
2018-07-31    0.433221
2018-08-31    0.472530
2018-09-30    0.388529
2018-10-31    0.550011
2018-11-30    0.486513
2018-12-31    0.443012
Freq: M, dtype: float64

54.將 Series 中的時間進行轉換(秒轉分鐘):

s = pd.date_range('today', periods=100, freq='S')

ts = pd.Series(np.random.randint(0, 500, len(s)), index=s)

ts.resample('Min').sum()
==>
2020-07-05 14:48:00    15836
2020-07-05 14:49:00     9298
Freq: T, dtype: int64

55.UTC 世界時間標準:

s = pd.date_range('today', periods=1, freq='D')  # 獲取當前時間
ts = pd.Series(np.random.randn(len(s)), s)  # 隨機數值
ts_utc = ts.tz_localize('UTC')  # 轉換為 UTC 時間
ts_utc
==>
2020-07-05 14:48:38.609382+00:00   -0.348899
Freq: D, dtype: float64

56.轉換為上海所在時區:

ts_utc.tz_convert('Asia/Shanghai')
==>
2020-07-05 22:48:38.609382+08:00   -0.348899
Freq: D, dtype: float64

57.不同時間表示方式的轉換:

rng = pd.date_range('1/1/2018', periods=5, freq='M')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
print(ts)
ps = ts.to_period()
print(ps)
ps.to_timestamp()
==>
2018-01-31    0.621688
2018-02-28   -1.937715
2018-03-31    0.081314
2018-04-30   -1.308769
2018-05-31   -0.075345
Freq: M, dtype: float64
2018-01    0.621688
2018-02   -1.937715
2018-03    0.081314
2018-04   -1.308769
2018-05   -0.075345
Freq: M, dtype: float64
2018-01-01    0.621688
2018-02-01   -1.937715
2018-03-01    0.081314
2018-04-01   -1.308769
2018-05-01   -0.075345
Freq: MS, dtype: float64

Series 多重索引

58.建立多重索引 Series:

構建一個 letters = ['A', 'B', 'C'] 和 numbers = list(range(10))為索引,值為隨機數的多重索引 Series。

letters = ['A', 'B', 'C']
numbers = list(range(10))

mi = pd.MultiIndex.from_product([letters, numbers])  # 設定多重索引
s = pd.Series(np.random.rand(30), index=mi)  # 隨機數
s
==>
A  0    0.698046
   1    0.380276
   2    0.873395
   3    0.628864
   4    0.528025
   5    0.677856
   6    0.194495
   7    0.164484
   8    0.018238
   9    0.747468
B  0    0.623616
   1    0.560504
   2    0.731296
   3    0.760307
   4    0.807663
   5    0.347980
   6    0.005892
   7    0.807262
   8    0.650353
   9    0.803976
C  0    0.387503
   1    0.943305
   2    0.215817
   3    0.128086
   4    0.252103
   5    0.048908
   6    0.779633
   7    0.825234
   8    0.624257
   9    0.263373
dtype: float64

59.多重索引 Series 查詢:

查詢索引為 1,3,6 的值

s.loc[:, [1, 3, 6]]
==>
A  1    0.380276
   3    0.628864
   6    0.194495
B  1    0.560504
   3    0.760307
   6    0.005892
C  1    0.943305
   3    0.128086
   6    0.779633
dtype: float64

60.多重索引 Series 切片:

s.loc[pd.IndexSlice[:'B', 5:]]
==>
A  5    0.677856
      6    0.194495
      7    0.164484
      8    0.018238
      9    0.747468
B  5    0.347980
      6    0.005892
      7    0.807262
      8    0.650353
      9    0.803976
dtype: float64

DataFrame 多重索引

61.根據多重索引建立 DataFrame:

建立一個以 letters = ['A', 'B'] 和 numbers = list(range(6))為索引,值為隨機資料的多重索引 DataFrame。

frame = pd.DataFrame(np.arange(12).reshape(6, 2),
                     index=[list('AAABBB'), list('123123')],
                     columns=['hello', 'shiyanlou'])
frame
==>
     hello  shiyanlou
A 1      0          1
    2      2          3
    3      4          5
B 1      6          7
    2      8          9
    3     10         11

62.多重索引設定列名稱:

frame.index.names = ['first', 'second']
frame
==>
                 hello    shiyanlou
first second                  
A       1           0          1
        2           2          3
        3           4          5
B       1           6          7
        2           8          9
        3          10         11

63.DataFrame 多重索引分組求和:

frame.groupby('first').sum()
==>
       hello  shiyanlou
first                  
A          6          9
B         24         27

64.DataFrame 行列名稱轉換:

print(frame)
frame.stack()
==>
              hello  shiyanlou
first second                  
A     1           0          1
      2           2          3
      3           4          5
B     1           6          7
      2           8          9
      3          10         11
first  second           
A      1       hello         0
               shiyanlou     1
       2       hello         2
               shiyanlou     3
       3       hello         4
               shiyanlou     5
B      1       hello         6
               shiyanlou     7
       2       hello         8
               shiyanlou     9
       3       hello        10
               shiyanlou    11
dtype: int64

65.DataFrame 索引轉換:

print(frame)
frame.unstack()
==>
              hello  shiyanlou
first second                  
A     1           0          1
      2           2          3
      3           4          5
B     1           6          7
      2           8          9
      3          10         11
       hello        shiyanlou       
second     1  2   3         1  2   3
first                               
A          0  2   4         1  3   5
B          6  8  10         7  9  11

66.DataFrame 條件查詢:

示例資料

data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
        'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
        'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
        'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(data, index=labels)

查詢 age 大於 3 的全部資訊

df[df['age'] > 3]
==>
  animal  age  visits priority
e    dog  5.0       2       no
g  snake  4.5       1       no
i    dog  7.0       2       no

67.根據行列索引切片:

df.iloc[2:4, 1:3]
==>
      age  visits
c  0.5       2
d  NaN       3

68.DataFrame 多重條件查詢:

查詢 age<3 且為 cat 的全部資料。

df = pd.DataFrame(data, index=labels)

df[(df['animal'] == 'cat') & (df['age'] < 3)]
==>
  animal  age  visits priority
a    cat  2.5       1      yes
f    cat  2.0       3       no

69.DataFrame 按關鍵字查詢:

df3[df3['animal'].isin(['cat', 'dog'])]
==>
    animal  age  visits priority  No.
a    cat  2.5       1      yes    0
b    cat  2.0       3      yes    1
d    dog  NaN       3      yes    3
e    dog  5.0       2       no    4
f    cat  1.5       3       no    5
h    cat  NaN       1      yes    7
i    dog  7.0       2       no    8
j    dog  3.0       1       no    9

70.DataFrame 按標籤及列名查詢:

df.loc[df2.index[[3, 4, 8]], ['animal', 'age']]
==>
    animal  age
d    dog  NaN
e    dog  5.0
i    dog  7.0

71.DataFrame 多條件排序:

按照 age 降序,visits 升序排列

df.sort_values(by=['age', 'visits'], ascending=[False, True])
==>
    animal  age  visits priority
i    dog  7.0       2       no
e    dog  5.0       2       no
g  snake  4.5       1       no
j    dog  3.0       1       no
b    cat  3.0       3      yes
a    cat  2.5       1      yes
f    cat  2.0       3       no
c  snake  0.5       2       no
h    cat  NaN       1      yes
d    dog  NaN       3      yes

72.DataFrame 多值替換:

將 priority 列的 yes 值替換為 True,no 值替換為 False。

df['priority'].map({'yes': True, 'no': False})
==>
a     True
b     True
c    False
d     True
e    False
f    False
g    False
h     True
i    False
j    False
Name: priority, dtype: bool

73.DataFrame 分組求和:

df4.groupby('animal').sum()
==>
         age  visits  No.
animal                   
cat      6.0       8   13
dog     15.0       8   24
snake    5.0       3    8

74.使用列表拼接多個 DataFrame:

temp_df1 = pd.DataFrame(np.random.randn(5, 4))  # 生成由隨機陣列成的 DataFrame 1
temp_df2 = pd.DataFrame(np.random.randn(5, 4))  # 生成由隨機陣列成的 DataFrame 2
temp_df3 = pd.DataFrame(np.random.randn(5, 4))  # 生成由隨機陣列成的 DataFrame 3

print(temp_df1)
print(temp_df2)
print(temp_df3)

pieces = [temp_df1, temp_df2, temp_df3]
pd.concat(pieces)
==>
       0         1         2         3
0  1.061349  0.927805 -0.270724  0.232218
1 -2.049875 -0.896899 -0.738298  0.547709
2  0.084709 -1.801844  0.610220 -1.304246
3  1.384591  0.872657 -0.829547 -0.332316
4 -0.255004  2.177881  0.615079  0.767592

​     0          1         2        3

0  0.009016  1.181569 -1.403829 -0.745604
1 -0.270313 -0.258377 -1.067346  1.465726
2 -1.619676 -0.324374 -0.433600  0.211323
3  0.163223  0.144191  0.717129 -0.555298
4 -0.718321  1.688866 -0.607994  1.731248

​      0         1         2         3

0 -1.178622  0.415409  0.496004  1.368869
1  0.724433 -0.262059  0.514689 -1.666051
2 -0.325606  0.013015  1.010961  2.075196
3  2.212960 -0.132432 -1.603347 -1.182487
4  0.102536  1.384535  0.411434 -0.175592

​    0          1          2         3

0  1.061349  0.927805 -0.270724  0.232218
1 -2.049875 -0.896899 -0.738298  0.547709
2  0.084709 -1.801844  0.610220 -1.304246
3  1.384591  0.872657 -0.829547 -0.332316
4 -0.255004  2.177881  0.615079  0.767592
0  0.009016  1.181569 -1.403829 -0.745604
1 -0.270313 -0.258377 -1.067346  1.465726
2 -1.619676 -0.324374 -0.433600  0.211323
3  0.163223  0.144191  0.717129 -0.555298
4 -0.718321  1.688866 -0.607994  1.731248
0 -1.178622  0.415409  0.496004  1.368869
1  0.724433 -0.262059  0.514689 -1.666051
2 -0.325606  0.013015  1.010961  2.075196
3  2.212960 -0.132432 -1.603347 -1.182487
4  0.102536  1.384535  0.411434 -0.175592

75.找出 DataFrame 表中和最小的列:

df = pd.DataFrame(np.random.random(size=(5, 10)), columns=list('abcdefghij'))
print(df)
df.sum().idxmin()  # idxmax(), idxmin() 為 Series 函式返回最大最小值的索引值
==>
          a         b         c         d         e         f         g  \
0  0.931149  0.641776  0.758608  0.630512  0.170375  0.211306  0.973363   
1  0.730186  0.682949  0.554609  0.356089  0.399012  0.939087  0.908047   
2  0.261405  0.434525  0.490395  0.368307  0.832568  0.571115  0.936016   
3  0.161993  0.132176  0.852158  0.140710  0.165902  0.564976  0.656718   
4  0.810233  0.385639  0.127849  0.166585  0.302643  0.947498  0.164274   

​            h                 i                j  

0  0.223378  0.115285  0.161207  
1  0.765946  0.206518  0.951096  
2  0.891956  0.430530  0.045640  
3  0.955571  0.962989  0.123037  
4  0.391810  0.696404  0.561719  
'd'

76.DataFrame 中每個元素減去每一行的平均值:

df = pd.DataFrame(np.random.random(size=(5, 3)))
print(df)
df.sub(df.mean(axis=1), axis=0)
==>
          0         1         2
0  0.028539  0.555065  0.166588
1  0.781335  0.086089  0.616780
2  0.022462  0.047383  0.476410
3  0.796853  0.850955  0.765398
4  0.208298  0.858031  0.264920

​      0         1         2

0 -0.221525  0.305001 -0.083476
1  0.286600 -0.408646  0.122046
2 -0.159623 -0.134702  0.294325
3 -0.007549  0.046553 -0.039004
4 -0.235452  0.414281 -0.178830

  

77.DataFrame 分組,並得到每一組中最大三個數之和:

df = pd.DataFrame({'A': list('aaabbcaabcccbbc'),
                   'B': [12, 345, 3, 1, 45, 14, 4, 52, 54, 23, 235, 21, 57, 3, 87]})
print(df)
df.groupby('A')['B'].nlargest(3).sum(level=0)
==>
    A    B
0   a   12
1   a  345
2   a    3
3   b    1
4   b   45
5   c   14
6   a    4
7   a   52
8   b   54
9   c   23
10  c  235
11  c   21
12  b   57
13  b    3
14  c   87
A
a    409
b    156
c    345
Name: B, dtype: int64

  

透視表

當分析龐大的資料時,為了更好的發掘資料特徵之間的關係,且不破壞原資料,就可以利用透視表 pivot_table 進行操作。

78.透視表的建立:

新建表將 A, B, C 列作為索引進行聚合。

df = pd.DataFrame({'A': ['one', 'one', 'two', 'three'] * 3,
                   'B': ['A', 'B', 'C'] * 4,
                   'C': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
                   'D': np.random.randn(12),
                   'E': np.random.randn(12)})

print(df)

print(pd.pivot_table(df, index=['A', 'B']))
==>
        A  B    C         D         E
0     one  A  foo -2.718717  1.749056
1     one  B  foo -0.710776  0.442023
2     two  C  foo -0.824951  2.244523
3   three  A  bar  0.300916  1.709200
4     one  B  bar -2.590790  0.292709
5     one  C  bar  0.908543 -0.598258
6     two  A  foo -0.521278  0.204491
7   three  B  foo -3.302320 -1.762640
8     one  C  foo -1.311013 -0.722187
9     one  A  bar  0.785471 -0.231635
10    two  B  bar -1.758329 -0.031603
11  three  C  bar  1.236829  1.235032

​                     D         E

A     B                    
one   A -0.966623  0.758711
      B -1.650783  0.367366
      C -0.201235 -0.660222
three A  0.300916  1.709200
      B -3.302320 -1.762640
      C  1.236829  1.235032
two   A -0.521278  0.204491
      B -1.758329 -0.031603
      C -0.824951  2.244523

79.透視表按指定行進行聚合:

將該 DataFrame 的 D 列聚合,按照 A,B 列為索引進行聚合,聚合的方式為預設求均值。

pd.pivot_table(df, values=['D'], index=['A', 'B'])
==>
                D
A     B          
one   A -0.966623
      B -1.650783
      C -0.201235
three A  0.300916
      B -3.302320
      C  1.236829
two   A -0.521278
      B -1.758329
      C -0.824951

80.透視表聚合方式定義:

上一題中 D 列聚合時,採用預設求均值的方法,若想使用更多的方式可以在 aggfunc 中實現。

pd.pivot_table(df, values=['D'], index=['A', 'B'], aggfunc=[np.sum, len])
==>
              sum  len
                D    D
A     B               
one   A -1.933246  2.0
      B -3.301567  2.0
      C -0.402470  2.0
three A  0.300916  1.0
      B -3.302320  1.0
      C  1.236829  1.0
two   A -0.521278  1.0
      B -1.758329  1.0
      C -0.824951  1.0

81.透視表利用額外列進行輔助分割:

D 列按照 A,B 列進行聚合時,若關心 C 列對 D 列的影響,可以加入 columns 值進行分析。

pd.pivot_table(df, values=['D'], index=['A', 'B'],
               columns=['C'], aggfunc=np.sum)
==>
                D          
C             bar       foo
A     B                    
one   A  0.785471 -2.718717
      B -2.590790 -0.710776
      C  0.908543 -1.311013
three A  0.300916       NaN
      B       NaN -3.302320
      C  1.236829       NaN
two   A       NaN -0.521278
      B -1.758329       NaN
      C       NaN -0.824951

82.透視表的預設值處理:

在透視表中由於不同的聚合方式,相應缺少的組合將為預設值,可以加入 fill_value 對預設值處理。

pd.pivot_table(df, values=['D'], index=['A', 'B'],
               columns=['C'], aggfunc=np.sum, fill_value=0)
==>
                D          
C             bar       foo
A     B                    
one   A  0.785471 -2.718717
      B -2.590790 -0.710776
      C  0.908543 -1.311013
three A  0.300916  0.000000
      B  0.000000 -3.302320
      C  1.236829  0.000000
two   A  0.000000 -0.521278
      B -1.758329  0.000000
      C  0.000000 -0.824951

絕對型別¶

在資料的形式上主要包括數量型和性質型,數量型表示著資料可數範圍可變,而性質型表示範圍已經確定不可改變,絕對型資料就是性質型資料的一種。

83.絕對型資料定義:

df = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6], "raw_grade": [
                  'a', 'b', 'b', 'a', 'a', 'e']})
df["grade"] = df["raw_grade"].astype("category")
df
==>
      id raw_grade grade
0   1         a     a
1   2         b     b
2   3         b     b
3   4         a     a
4   5         a     a
5   6         e     e

84.對絕對型資料重新命名:

df["grade"].cat.categories = ["very good", "good", "very bad"]
df
==>
      id raw_grade      grade
0   1         a  very good
1   2         b       good
2   3         b       good
3   4         a  very good
4   5         a  very good
5   6         e   very bad

85.重新排列絕對型資料並補充相應的預設值:

df["grade"] = df["grade"].cat.set_categories(
    ["very bad", "bad", "medium", "good", "very good"])
df
==>
      id raw_grade      grade
0   1         a  very good
1   2         b       good
2   3         b       good
3   4         a  very good
4   5         a  very good
5   6         e   very bad

86.對絕對型資料進行排序:

df.sort_values(by="grade")
==>
      id raw_grade      grade
5   6         e   very bad
1   2         b       good
2   3         b       good
0   1         a  very good
3   4         a  very good
4   5         a  very good

87.對絕對型資料進行分組:

df.groupby("grade").size()
==>
grade
very bad     1
bad          0
medium       0
good         2
very good    3
dtype: int64

資料清洗

常常我們得到的資料是不符合我們最終處理的資料要求,包括許多預設值以及壞的資料,需要我們對資料進行清洗。

88.缺失值擬合:

在FilghtNumber中有數值缺失,其中數值為按 10 增長,補充相應的預設值使得資料完整,並讓資料為 int 型別。

df = pd.DataFrame({'From_To': ['LoNDon_paris', 'MAdrid_miLAN', 'londON_StockhOlm',
                               'Budapest_PaRis', 'Brussels_londOn'],
                   'FlightNumber': [10045, np.nan, 10065, np.nan, 10085],
                   'RecentDelays': [[23, 47], [], [24, 43, 87], [13], [67, 32]],
                   'Airline': ['KLM(!)', '<Air France> (12)', '(British Airways. )',
                               '12. Air France', '"Swiss Air"']})
df['FlightNumber'] = df['FlightNumber'].interpolate().astype(int)
df
==>
            From_To  FlightNumber  RecentDelays              Airline
0      LoNDon_paris         10045      [23, 47]               KLM(!)
1      MAdrid_miLAN         10055            []    <Air France> (12)
2  londON_StockhOlm         10065  [24, 43, 87]  (British Airways. )
3    Budapest_PaRis         10075          [13]       12. Air France
4   Brussels_londOn         10085      [67, 32]          "Swiss Air"

89.資料列拆分:

其中From_to應該為兩獨立的兩列From和To,將From_to依照_拆分為獨立兩列建立為一個新表。

temp = df.From_To.str.split('_', expand=True)
temp.columns = ['From', 'To']
temp
==>
       From         To
0    LoNDon      paris
1    MAdrid      miLAN
2    londON  StockhOlm
3  Budapest      PaRis
4  Brussels     londOn

90.字元標準化:

其中注意到地點的名字都不規範(如:londON應該為London)需要對資料進行標準化處理。

temp['From'] = temp['From'].str.capitalize()
temp['To'] = temp['To'].str.capitalize()

91.刪除壞資料加入整理好的資料:

將最開始的 From_to 列刪除,加入整理好的 From 和 to 列。

df = df.drop('From_To', axis=1)
df = df.join(temp)
print(df)
==>
      FlightNumber  RecentDelays              Airline      From         To
0         10045      [23, 47]               KLM(!)    London      Paris
1         10055            []    <Air France> (12)    Madrid      Milan
2         10065  [24, 43, 87]  (British Airways. )    London  Stockholm
3         10075          [13]       12. Air France  Budapest      Paris
4         10085      [67, 32]          "Swiss Air"  Brussels     London

92.去除多餘字元:

如同 airline 列中許多資料有許多其他字元,會對後期的資料分析有較大影響,需要對這類資料進行修正。

df['Airline'] = df['Airline'].str.extract(
    '([a-zA-Z\s]+)', expand=False).str.strip()
df
==>
      FlightNumber          Airline      From         To  delay_1  delay_2  \
0         10045              KLM    London      Paris     23.0     47.0   
1         10055       Air France    Madrid      Milan      NaN      NaN   
2         10065  British Airways    London  Stockholm     24.0     43.0   
3         10075       Air France  Budapest      Paris     13.0      NaN   
4         10085        Swiss Air  Brussels     London     67.0     32.0   

   delay_3  
0      NaN  
1      NaN  
2     87.0  
3      NaN  
4      NaN  

93.格式規範:

在 RecentDelays 中記錄的方式為列表型別,由於其長度不一,這會為後期資料分析造成很大麻煩。這裡將 RecentDelays 的列表拆開,取出列表中的相同位置元素作為一列,若為空值即用 NaN 代替。

delays = df['RecentDelays'].apply(pd.Series)

delays.columns = ['delay_{}'.format(n)
                  for n in range(1, len(delays.columns)+1)]

df = df.drop('RecentDelays', axis=1).join(delays)
df
==>
    FlightNumber	Airline	     From	         To	    delay_1	delay_2	delay_3
0       10045         KLM	    London          Paris     23.0    47.0    NaN
1       10055   Air France      Madrid          Milan     NaN     NaN     NaN
2       10065   British Airways London        Stockholm   24.0    43.0    87.0
3       10075   Air France      Budapest        Paris     13.0    NaN     NaN
4       10085   Swiss Air       Brussels        London    67.0    32.0    NaN

資料預處理

94.資訊區間劃分:

班級一部分同學的數學成績表,如下圖所示

df=pd.DataFrame({'name':['Alice','Bob','Candy','Dany','Ella','Frank','Grace','Jenny'],
                 'grades':[58,83,79,65,93,45,61,88]})

但我們更加關心的是該同學是否及格,將該數學成績按照是否>60來進行劃分。

df = pd.DataFrame({'name': ['Alice', 'Bob', 'Candy', 'Dany', 'Ella',
                            'Frank', 'Grace', 'Jenny'],
                   'grades': [58, 83, 79, 65, 93, 45, 61, 88]})


def choice(x):
    if x > 60:
        return 1
    else:
        return 0


df.grades = pd.Series(map(lambda x: choice(x), df.grades))
df
==>
    name  grades
0  Alice       0
1    Bob       1
2  Candy       1
3   Dany       1
4   Ella       1
5  Frank       0
6  Grace       1
7  Jenny       1

95.資料去重:

一個列為A的 DataFrame 資料,如下圖所示

df = pd.DataFrame({'A': [1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7]})

嘗試將 A 列中連續重複的資料清除。

df = pd.DataFrame({'A': [1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7]})
df.loc[df['A'].shift() != df['A']]
==>
      A
0  1
1  2
3  3
4  4
5  5
8  6
9  7

96.資料歸一化:

有時候,DataFrame 中不同列之間的資料差距太大,需要對其進行歸一化處理。其中,Max-Min 歸一化是簡單而常見的一種方式,公式如下:

def normalization(df):
    numerator = df.sub(df.min())
    denominator = (df.max()).sub(df.min())
    Y = numerator.div(denominator)
    return Y

df = pd.DataFrame(np.random.random(size=(5, 3)))
print(df)
normalization(df)
==>
          0         1         2
0  0.923325  0.925392  0.203170
1  0.770389  0.050410  0.605788
2  0.146447  0.542584  0.056240
3  0.161917  0.841527  0.547914
4  0.948175  0.814426  0.980268

​      0         1         2

0  0.969004  1.000000  0.159009
1  0.778247  0.000000  0.594731
2  0.000000  0.562496  0.000000
3  0.019297  0.904153  0.532098
4  1.000000  0.873179  1.000000

  


Pandas 繪圖操作

為了更好的瞭解資料包含的資訊,最直觀的方法就是將其繪製成圖。

97.Series 視覺化:

%matplotlib inline
ts = pd.Series(np.random.randn(100), index=pd.date_range('today', periods=100))
ts = ts.cumsum()
ts.plot()

==>輸出影象:

98.DataFrame 折線圖:

df = pd.DataFrame(np.random.randn(100, 4), index=ts.index,
                  columns=['A', 'B', 'C', 'D'])
df = df.cumsum()
df.plot()

==>輸出影象:

99.DataFrame 散點圖:

df = pd.DataFrame({"xs": [1, 5, 2, 8, 1], "ys": [4, 2, 1, 9, 6]})
df = df.cumsum()
df.plot.scatter("xs", "ys", color='red', marker="*")

==>輸出影象:

100.DataFrame 柱形圖:

df = pd.DataFrame({"revenue": [57, 68, 63, 71, 72, 90, 80, 62, 59, 51, 47, 52],
                   "advertising": [2.1, 1.9, 2.7, 3.0, 3.6, 3.2, 2.7, 2.4, 1.8, 1.6, 1.3, 1.9],
                   "month": range(12)
                   })

ax = df.plot.bar("month", "revenue", color="yellow")
df.plot("month", "advertising", secondary_y=True, ax=ax)

==>輸出影象: