1. 程式人生 > 其它 >第二章 pandas基礎

第二章 pandas基礎

技術標籤:Pandas機器學習

第二章 pandas基礎

一、檔案的讀取和寫入

1. 檔案讀取

【a】pandas讀取csv, excel, txt檔案:

import numpy as np
import pandas as pd
df_csv = pd.read_csv('../data/my_csv.csv')
df_csv
col1col2col3col4col5
02a1.4apple2020/1/1
13b3.4banana2020/1/2
26c2.5orange2020/1/5
35d3.2lemon2020/1/7
df_txt = pd.read_table('../data/my_table.txt')
df_txt
col1col2col3col4
02a1.4apple 2020/1/1
13b3.4banana 2020/1/2
26c2.5orange 2020/1/5
35d3.2lemon 2020/1/7
df_excel = pd.read_excel('../data/my_excel.xlsx')
df_excel
col1col2col3col4col5
02a1.4apple2020/1/1
13b3.4banana2020/1/2
26c2.5orange2020/1/5
35d3.2lemon2020/1/7

【b】常用引數:headerusecolsnrows,對 txt、csv、excel 均適用。

header=None表示第一行不作為列名:

pd.read_table('../data/my_table.txt', header=None)
0123
0col1col2col3col4
12a1.4apple 2020/1/1
23b3.4banana 2020/1/2
36c2.5orange 2020/1/5
45d3.2lemon 2020/1/7

usecols表示讀取列的集合,預設讀取所有的列:

pd.read_table('../data/my_table.txt', usecols=['col1', 'col2'])
col1col2
02a
13b
26c
35d

nrows表示讀取的資料行數:

pd.read_table('../data/my_table.txt'
, nrows=2)
col1col2col3col4
02a1.4apple 2020/1/1
13b3.4banana 2020/1/2

【c】對於分隔符非空格的txt檔案,分割引數sep可以自定義分隔符號。

pd.read_table('../data/my_table_special_sep.txt')
col1 |||| col2
0TS |||| This is an apple.
1GQ |||| My name is Bob.
2WT |||| Well done!
3PT |||| May I help you?

使用sep設定分隔符,同時指定引擎為python

PS:使用read_table的時,引數sep中使用的是正則表示式,因此需要對|進行轉義變成\|,否則無法讀取到正確的結果。

pd.read_table('../data/my_table_special_sep.txt', sep=' \|\|\|\| ', engine='python')
col1col2
0TSThis is an apple.
1GQMy name is Bob.
2WTWell done!
3PTMay I help you?

2. 檔案寫入

【a】index=False,在儲存檔案的時候去除索引:

df_csv.to_csv('../data/my_csv_saved.csv', index=True)
df_excel.to_excel('../data/my_excel_saved.xlsx', index=False)
無索引

在這裡插入圖片描述

沒有索引

在這裡插入圖片描述

【b】pandas中沒有定義to_table函式,txt檔案也使用to_csv進行儲存:

df_txt.to_csv('../data/my_txt_saved.txt', sep='\t', index=False)

【c】表格轉markdownlatex

df_csv
col1col2col3col4col5
02a1.4apple2020/1/1
13b3.4banana2020/1/2
26c2.5orange2020/1/5
35d3.2lemon2020/1/7
print(df_csv.to_markdown())
|    |   col1 | col2   |   col3 | col4   | col5     |
|---:|-------:|:-------|-------:|:-------|:---------|
|  0 |      2 | a      |    1.4 | apple  | 2020/1/1 |
|  1 |      3 | b      |    3.4 | banana | 2020/1/2 |
|  2 |      6 | c      |    2.5 | orange | 2020/1/5 |
|  3 |      5 | d      |    3.2 | lemon  | 2020/1/7 |
print(df_csv.to_latex())
\begin{tabular}{lrlrll}
\toprule
{} &  col1 & col2 &  col3 &    col4 &      col5 \\
\midrule
0 &     2 &    a &   1.4 &   apple &  2020/1/1 \\
1 &     3 &    b &   3.4 &  banana &  2020/1/2 \\
2 &     6 &    c &   2.5 &  orange &  2020/1/5 \\
3 &     5 &    d &   3.2 &   lemon &  2020/1/7 \\
\bottomrule
\end{tabular}

二、基本資料結構

1. Series

【a】Series儲存一維values,由序列的值data、索引index、儲存型別dtype、序列的名字name組成:

s = pd.Series(data = [['dawang','yiyi'],{'dawang':21},'yiyi_19'],
              index = ['list','dict','str'],
              # index = pd.Index(['list','dict','str'], name='index_name'),
              dtype = 'object',
              name = 'dawang_yiyi')  # `object`是一種混合型別,正如次例中儲存了列表、字典和字串。
s
list    [dawang, yiyi]
dict    {'dawang': 21}
str            yiyi_19
Name: dawang_yiyi, dtype: object

【b】獲取 Series 屬性:

s.values
array([list(['dawang', 'yiyi']), {'dawang': 21}, 'yiyi_19'], dtype=object)
s.index
Index(['list', 'dict', 'str'], dtype='object')
s.dtype
dtype('O')
s.name
'dawang_yiyi'
s.shape  # 獲取序列的長度
(3,)

2. DataFrame

DataFrame儲存二維values,在Series的基礎上增加了列索引columns

【a】二維陣列 + 行、列索引構造DataFrame

data = [['dawang',21,'image identification'], ['yiyi',19,'clinical medicine']]
df = pd.DataFrame(data = data,
                  index = ['row_%d'%i for i in range(2)],
                  columns=['col_0', 'col_1', 'col_2'])
df
col_0col_1col_2
row_0dawang21image identification
row_1yiyi19clinical medicine

【b】列索引名→資料,對映構造DataFrame,同時加上行索引:

df = pd.DataFrame(data = {'col_0': ['dawang','yiyi'],
                          'col_1': [21, 19],
                          'col_2': ['imageidentification', 'clinicalmedicine']},
                  index = ['row_%d'%i for i in range(2)])
df
col_0col_1col_2
row_0dawang21imageidentification
row_1yiyi19clinicalmedicine

【c】DataFrame + [col_name]取出相應列得到 Series

df['col_0']
row_0    dawang
row_1      yiyi
Name: col_0, dtype: object

【d】DataFrame + [col_list]取出多列得到 DataFrame

df[['col_0', 'col_1']]
col_0col_1
row_0dawang21
row_1yiyi19

【e】獲取 DataFrame 屬性:

df.values
array([['dawang', 21, 'imageidentification'],
       ['yiyi', 19, 'clinicalmedicine']], dtype=object)
df.index
Index(['row_0', 'row_1'], dtype='object')
df.columns
Index(['col_0', 'col_1', 'col_2'], dtype='object')
df.dtypes  # 返回的是值為相應列資料型別的Series
col_0    object
col_1     int64
col_2    object
dtype: object
df.shape
(2, 3)
df.T  # `DataFrame`轉置
row_0row_1
col_0dawangyiyi
col_12119
col_2imageidentificationclinicalmedicine

三、常用基本函式

為了舉例說明,下面使用一份learn_pandas.csv的虛擬資料集,它記錄了四所學校學生的體測個人資訊。

df = pd.read_csv('../data/learn_pandas.csv')
df.columns
Index(['School', 'Grade', 'Name', 'Gender', 'Height', 'Weight', 'Transfer',
       'Test_Number', 'Test_Date', 'Time_Record'],
      dtype='object')

上述列名依次代表學校、年級、姓名、性別、身高、體重、是否為轉系生、體測場次、測試時間、1000米成績,本章僅使用其中的前七列。

df = df[df.columns[:7]]
df
SchoolGradeNameGenderHeightWeightTransfer
0Shanghai Jiao Tong UniversityFreshmanGaopeng YangFemale158.946.0N
1Peking UniversityFreshmanChangqiang YouMale166.570.0N
2Shanghai Jiao Tong UniversitySeniorMei SunMale188.989.0N
3Fudan UniversitySophomoreXiaojuan SunFemaleNaN41.0N
4Fudan UniversitySophomoreGaojuan YouMale174.074.0N
........................
195Fudan UniversityJuniorXiaojuan SunFemale153.946.0N
196Tsinghua UniversitySeniorLi ZhaoFemale160.950.0N
197Shanghai Jiao Tong UniversitySeniorChengqiang ChuFemale153.945.0N
198Shanghai Jiao Tong UniversitySeniorChengmei ShenMale175.371.0N
199Tsinghua UniversitySophomoreChunpeng LvMale155.751.0N

200 rows × 7 columns

1. 彙總函式

【a】head, tail函式分別返回表的前n行和後n行,其中n預設為5:

df.head(2)
SchoolGradeNameGenderHeightWeightTransfer
0Shanghai Jiao Tong UniversityFreshmanGaopeng YangFemale158.946.0N
1Peking UniversityFreshmanChangqiang YouMale166.570.0N
df.tail(2)
SchoolGradeNameGenderHeightWeightTransfer
198Shanghai Jiao Tong UniversitySeniorChengmei ShenMale175.371.0N
199Tsinghua UniversitySophomoreChunpeng LvMale155.751.0N

【b】info返回表的資訊概況:

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   School    200 non-null    object 
 1   Grade     200 non-null    object 
 2   Name      200 non-null    object 
 3   Gender    200 non-null    object 
 4   Height    183 non-null    float64
 5   Weight    189 non-null    float64
 6   Transfer  188 non-null    object 
dtypes: float64(2), object(5)
memory usage: 11.1+ KB

【c】describe返回表中 “數值” 列對應的主要統計量,此處 “數值” 列僅有 “Height” 和 “Weight” :

df.describe()
HeightWeight
count183.000000189.000000
mean163.21803355.015873
std8.60887912.824294
min145.40000034.000000
25%157.15000046.000000
50%161.90000051.000000
75%167.50000065.000000
max193.90000089.000000

2. 特徵統計函式

sum, mean, median, var, std, max, min 為例:

df_demo = df[['Height', 'Weight']]
df_demo
HeightWeight
0158.946.0
1166.570.0
2188.989.0
3NaN41.0
4174.074.0
.........
195153.946.0
196160.950.0
197153.945.0
198175.371.0
199155.751.0

200 rows × 2 columns

df_demo.mean()  # 平均值
Height    163.218033
Weight     55.015873
dtype: float64
df_demo.max()  # 最大值
Height    193.9
Weight     89.0
dtype: float64
df_demo.quantile(0.75)  # 75%分位數
Height    167.5
Weight     65.0
Name: 0.75, dtype: float64
df_demo.count()  # 非缺失值個數
Height    183
Weight    189
dtype: int64
df_demo.idxmax() # 最大值對應的索引值
Height    193
Weight      2
dtype: int64

上述函式都有一個聚合引數axisaxis = 0代表逐列聚合,axis = 1代表逐行聚合,axis預設為0:

df_demo.mean(axis=1).head(2) # 表示計算前兩行的平均值
0    102.45
1    118.25
dtype: float64

3. 唯一值函式

【a】 unique:找出單個屬性唯一值,nunique:統計唯一值個數。

df['School']  # 所有人的 School 屬性 
0      Shanghai Jiao Tong University
1                  Peking University
2      Shanghai Jiao Tong University
3                   Fudan University
4                   Fudan University
                   ...              
195                 Fudan University
196              Tsinghua University
197    Shanghai Jiao Tong University
198    Shanghai Jiao Tong University
199              Tsinghua University
Name: School, Length: 200, dtype: object
df['School'].unique()  # 找出 School 屬性的唯一值
array(['Shanghai Jiao Tong University', 'Peking University',
       'Fudan University', 'Tsinghua University'], dtype=object)
df['School'].nunique()  # 統計 School 屬性唯一值的個數
4

【b】value_counts:得到唯一值和其對應出現的頻數。

df['School'].value_counts()
Tsinghua University              69
Shanghai Jiao Tong University    57
Fudan University                 40
Peking University                34
Name: School, dtype: int64

【c】 drop_duplicates找出單個屬性唯一值或者多個屬性組合的唯一值。

關鍵引數是keepfirst表示每個組合保留第一次出現的所在行,last表示保留最後一次出現的所在行,False表示把所有重複組合所在的行剔除。

對於單個屬性唯一值:

df['School'].drop_duplicates(keep='first')
0    Shanghai Jiao Tong University
1                Peking University
3                 Fudan University
5              Tsinghua University
Name: School, dtype: object

對於多個屬性組合的唯一值:

df_demo = df[['Gender','Transfer','Name']]
df_demo
GenderTransferName
0FemaleNGaopeng Yang
1MaleNChangqiang You
2MaleNMei Sun
3FemaleNXiaojuan Sun
4MaleNGaojuan You
............
195FemaleNXiaojuan Sun
196FemaleNLi Zhao
197FemaleNChengqiang Chu
198MaleNChengmei Shen
199MaleNChunpeng Lv

200 rows × 3 columns

df_demo.drop_duplicates(['Gender', 'Transfer'], keep='first')
GenderTransferName
0FemaleNGaopeng Yang
1MaleNChangqiang You
12FemaleNaNPeng You
21MaleNaNXiaopeng Shen
36MaleYXiaojuan Qin
43FemaleYGaoli Feng
df_demo.drop_duplicates(['Gender', 'Transfer'], keep='last')
GenderTransferName
147MaleNaNJuan You
150MaleYChengpeng You
169FemaleYChengquan Qin
194FemaleNaNYanmei Qian
197FemaleNChengqiang Chu
199MaleNChunpeng Lv
df_demo.drop_duplicates(['Gender', 'Transfer'], keep=False)
GenderTransferName

4. 替換函式

替換操作通常針對單個屬性進行,因此下面的例子都以Series舉例。

【a】對映替換:replace

df['Grade'].head()
0     Freshman
1     Freshman
2       Senior
3    Sophomore
4    Sophomore
Name: Grade, dtype: object
df['Grade'].replace({'Freshman':1, 'Sophomore':2, 'Junior':3, 'Senior':4}).head()  # 字典傳參
0    1
1    1
2    4
3    2
4    2
Name: Grade, dtype: int64
df['Grade'].replace(['Freshman','Sophomore','Junior','Senior'], [1,2,3,4]).head()  # 列表傳參
0    1
1    1
2    4
3    2
4    2
Name: Grade, dtype: int64

特殊引數:method

s = pd.Series(['dawang','man','yiyi','woman','woman','dawang'])  # 下面展示用 'dawang'和'yiyi'替換'man'和'woman'
s.replace(['man','woman'], method='ffill')  # `method` = `ffill`,則為用前面一個最近的未被替換的值進行替換
0    dawang
1    dawang
2      yiyi
3      yiyi
4      yiyi
5    dawang
dtype: object
s.replace(['man','woman'], method='bfill')  # `method` = `bfill`,則為使用後面最近的未被替換的值進行替換
0    dawang
1      yiyi
2      yiyi
3    dawang
4    dawang
5    dawang
dtype: object

【b】邏輯替換:wheremask

s = pd.Series([-1,1,-1,1,1,-1])

where:在傳入條件為False的對應行進行替換

s.where(s<0)  # 不指定替換值,替換為缺失值NaN
0   -1.0
1    NaN
2   -1.0
3    NaN
4    NaN
5   -1.0
dtype: float64
s.where(s<0,'此處大於0')  # 指定替換值
0       -1
1    此處大於0
2       -1
3    此處大於0
4    此處大於0
5       -1
dtype: object

mask:在傳入條件為True的對應行進行替換

s.mask(s<0)
0    NaN
1    1.0
2    NaN
3    1.0
4    1.0
5    NaN
dtype: float64
s.mask(s<0, '此處小於0')
0    此處小於0
1        1
2    此處小於0
3        1
4        1
5    此處小於0
dtype: object

【c】數值替換:roundabsclip

s = 0.666 * pd.Series(range(-2,3))
s
0   -1.332
1   -0.666
2    0.000
3    0.666
4    1.332
dtype: float64
s.clip(-0.5, 1) # 小於-0.5的值替換為0.5,大於1的值替換為1,在-0.5→1之間的值不變
0   -0.500
1   -0.500
2    0.000
3    0.666
4    1.000
dtype: float64
s.round(2)  # 保留兩位小數
0   -1.33
1   -0.67
2    0.00
3    0.67
4    1.33
dtype: float64
s.abs()  # 取絕對值
0    1.332
1    0.666
2    0.000
3    0.666
4    1.332
dtype: float64

5. 排序函式

排序函式常包括值排序和索引排序,下面先用set_index方法把年級和姓名兩列作為索引:

df_demo = df[['Grade', 'Name', 'Height', 'Weight']].set_index(['Grade','Name'])
df_demo.head()
HeightWeight
GradeName
FreshmanGaopeng Yang158.946.0
Changqiang You166.570.0
SeniorMei Sun188.989.0
SophomoreXiaojuan SunNaN41.0
Gaojuan You174.074.0

【a】值排序:sort_values

df_demo.sort_values('Height').head()  # 對身高進行升序排序
HeightWeight
GradeName
JuniorXiaoli Chu145.434.0
SeniorGaomei Lv147.334.0
SophomorePeng Han147.834.0
SeniorChangli Lv148.741.0
SophomoreChangjuan You150.540.0
df_demo.sort_values('Height', ascending=False).head()  # 對身高進行降序排序
HeightWeight
GradeName
SeniorXiaoqiang Qin193.979.0
Mei Sun188.989.0
Gaoli Zhao186.583.0
FreshmanQiang Han185.387.0
SeniorQiang Zheng183.987.0

多列排序:體重不同時,按體重降序排序;體重相同時,按身高升序排列。

df_demo.sort_values(['Weight','Height'],ascending=[False,True]).head()
HeightWeight
GradeName
SeniorMei Sun188.989.0
Qiang Zheng183.987.0
FreshmanQiang Han185.387.0
Chunli Zhao180.283.0
Changpeng Zhao181.383.0

【b】索引排序:sort_index

索引排序,元素的值在索引中,用引數level指定索引層的名字或者層號。PS:字串的排列順序由英文字母順序決定。

df_demo.sort_index(level=['Grade','Name'],ascending=[False,True]).head()  # 'Grade'降序排序(z→a),'Name'升序排序(a→z)
HeightWeight
GradeName
SophomoreChangjuan You150.540.0
Changmei Xu151.643.0
Changqiang Qian167.664.0
Chengli You164.157.0
Chengqiang Lv166.853.0

6. apply方法

使用 apply 方法,可以同時應用多種運算:

df_demo.apply(lambda x:(x-x.mean()).std())  # 先取均值,然後與原來值相減,最後求標準差
Height     8.608879
Weight    12.824294
dtype: float64

若指定axis=1,那麼每次傳入函式的就是行元素組成的Series,其結果與之前的逐行均值結果一致。

df_demo.apply(lambda x:(x-x.mean()).std(), axis=1).head()
0    79.832356
1    68.235804
2    70.639967
3          NaN
4    70.710678
dtype: float64

四、視窗物件

pandas中有3類視窗,用得較多的是滑動視窗rolling和擴張視窗expanding

1. 滑窗物件

.rolling:得到滑窗物件,引數 window 為視窗大小。

s = pd.Series([1,2,3,4,5])
roller = s.rolling(window = 2)
roller
Rolling [window=2,center=False,axis=0]

在得到了滑窗物件後,能夠使用相應的聚合函式進行計算。每次選取元素的個數即為window的值。下面以求均值的步驟舉例:

第一個位置,視窗還未完全進入,故為缺失值

第二個位置,視窗包含第一、二個元素,求均值為(1+2)/2=1.5

第三個位置,視窗包含第二、三個元素,求均值為(2+3)/2=2.5

以此類推。

roller.mean()
0    NaN
1    1.5
2    2.5
3    3.5
4    4.5
dtype: float64
roller.apply(lambda x:x.mean())  # 等價於求均值
0    NaN
1    1.5
2    2.5
3    3.5
4    4.5
dtype: float64
roller.sum()  # 與求均值的流程一致
0    NaN
1    3.0
2    5.0
3    7.0
4    9.0
dtype: float64

2. 擴張視窗

擴張視窗又稱累計視窗,其視窗的大小就是從序列開始處到具體操作的對應位置,其使用的聚合函式會作用於這些逐步擴張的視窗上。下面仍以求均值的步驟舉例:

第一個位置求均值,視窗大小為1均值為:1/1=1

第二個位置求均值,視窗大小為2,均值為:(1+3)/2=2

第三個位置求均值,視窗大小為3,均值為:(1+3+6)/3=3.3333

以此類推。

s = pd.Series([1, 3, 6, 10])
s.expanding().mean()
0    1.000000
1    2.000000
2    3.333333
3    5.000000
dtype: float64
s.ewm(alpha=0.2).mean().head()
0   -1.000000
1   -1.000000
2   -1.409836
3   -1.609756
4   -1.725845
dtype: float64

本文參考datawhale組最學習相關資料!!!