第二章 pandas基礎
第二章 pandas基礎
一、檔案的讀取和寫入
1. 檔案讀取
【a】pandas
讀取csv, excel, txt
檔案:
import numpy as np
import pandas as pd
df_csv = pd.read_csv('../data/my_csv.csv')
df_csv
col1 | col2 | col3 | col4 | col5 | |
---|---|---|---|---|---|
0 | 2 | a | 1.4 | apple | 2020/1/1 |
1 | 3 | b | 3.4 | banana | 2020/1/2 |
2 | 6 | c | 2.5 | orange | 2020/1/5 |
3 | 5 | d | 3.2 | lemon | 2020/1/7 |
df_txt = pd.read_table('../data/my_table.txt')
df_txt
col1 | col2 | col3 | col4 | |
---|---|---|---|---|
0 | 2 | a | 1.4 | apple 2020/1/1 |
1 | 3 | b | 3.4 | banana 2020/1/2 |
2 | 6 | c | 2.5 | orange 2020/1/5 |
3 | 5 | d | 3.2 | lemon 2020/1/7 |
df_excel = pd.read_excel('../data/my_excel.xlsx')
df_excel
col1 | col2 | col3 | col4 | col5 | |
---|---|---|---|---|---|
0 | 2 | a | 1.4 | apple | 2020/1/1 |
1 | 3 | b | 3.4 | banana | 2020/1/2 |
2 | 6 | c | 2.5 | orange | 2020/1/5 |
3 | 5 | d | 3.2 | lemon | 2020/1/7 |
【b】常用引數:header
、usecols
、nrows
,對 txt、csv、excel 均適用。
header=None
表示第一行不作為列名:
pd.read_table('../data/my_table.txt', header=None)
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | col1 | col2 | col3 | col4 |
1 | 2 | a | 1.4 | apple 2020/1/1 |
2 | 3 | b | 3.4 | banana 2020/1/2 |
3 | 6 | c | 2.5 | orange 2020/1/5 |
4 | 5 | d | 3.2 | lemon 2020/1/7 |
usecols
表示讀取列的集合,預設讀取所有的列:
pd.read_table('../data/my_table.txt', usecols=['col1', 'col2'])
col1 | col2 | |
---|---|---|
0 | 2 | a |
1 | 3 | b |
2 | 6 | c |
3 | 5 | d |
nrows
表示讀取的資料行數:
pd.read_table('../data/my_table.txt' , nrows=2)
col1 | col2 | col3 | col4 | |
---|---|---|---|---|
0 | 2 | a | 1.4 | apple 2020/1/1 |
1 | 3 | b | 3.4 | banana 2020/1/2 |
【c】對於分隔符非空格的txt檔案,分割引數sep
可以自定義分隔符號。
pd.read_table('../data/my_table_special_sep.txt')
col1 |||| col2 | |
---|---|
0 | TS |||| This is an apple. |
1 | GQ |||| My name is Bob. |
2 | WT |||| Well done! |
3 | PT |||| May I help you? |
使用sep
設定分隔符,同時指定引擎為python
:
PS:使用read_table
的時,引數sep
中使用的是正則表示式,因此需要對|
進行轉義變成\|
,否則無法讀取到正確的結果。
pd.read_table('../data/my_table_special_sep.txt', sep=' \|\|\|\| ', engine='python')
col1 | col2 | |
---|---|---|
0 | TS | This is an apple. |
1 | GQ | My name is Bob. |
2 | WT | Well done! |
3 | PT | May I help you? |
2. 檔案寫入
【a】index
=False
,在儲存檔案的時候去除索引:
df_csv.to_csv('../data/my_csv_saved.csv', index=True)
df_excel.to_excel('../data/my_excel_saved.xlsx', index=False)
【b】pandas
中沒有定義to_table
函式,txt
檔案也使用to_csv
進行儲存:
df_txt.to_csv('../data/my_txt_saved.txt', sep='\t', index=False)
【c】表格轉markdown
、latex
:
df_csv
col1 | col2 | col3 | col4 | col5 | |
---|---|---|---|---|---|
0 | 2 | a | 1.4 | apple | 2020/1/1 |
1 | 3 | b | 3.4 | banana | 2020/1/2 |
2 | 6 | c | 2.5 | orange | 2020/1/5 |
3 | 5 | d | 3.2 | lemon | 2020/1/7 |
print(df_csv.to_markdown())
| | col1 | col2 | col3 | col4 | col5 |
|---:|-------:|:-------|-------:|:-------|:---------|
| 0 | 2 | a | 1.4 | apple | 2020/1/1 |
| 1 | 3 | b | 3.4 | banana | 2020/1/2 |
| 2 | 6 | c | 2.5 | orange | 2020/1/5 |
| 3 | 5 | d | 3.2 | lemon | 2020/1/7 |
print(df_csv.to_latex())
\begin{tabular}{lrlrll}
\toprule
{} & col1 & col2 & col3 & col4 & col5 \\
\midrule
0 & 2 & a & 1.4 & apple & 2020/1/1 \\
1 & 3 & b & 3.4 & banana & 2020/1/2 \\
2 & 6 & c & 2.5 & orange & 2020/1/5 \\
3 & 5 & d & 3.2 & lemon & 2020/1/7 \\
\bottomrule
\end{tabular}
二、基本資料結構
1. Series
【a】Series
儲存一維values
,由序列的值data
、索引index
、儲存型別dtype
、序列的名字name
組成:
s = pd.Series(data = [['dawang','yiyi'],{'dawang':21},'yiyi_19'],
index = ['list','dict','str'],
# index = pd.Index(['list','dict','str'], name='index_name'),
dtype = 'object',
name = 'dawang_yiyi') # `object`是一種混合型別,正如次例中儲存了列表、字典和字串。
s
list [dawang, yiyi]
dict {'dawang': 21}
str yiyi_19
Name: dawang_yiyi, dtype: object
【b】獲取 Series
屬性:
s.values
array([list(['dawang', 'yiyi']), {'dawang': 21}, 'yiyi_19'], dtype=object)
s.index
Index(['list', 'dict', 'str'], dtype='object')
s.dtype
dtype('O')
s.name
'dawang_yiyi'
s.shape # 獲取序列的長度
(3,)
2. DataFrame
DataFrame
儲存二維values
,在Series
的基礎上增加了列索引columns
。
【a】二維陣列 + 行、列索引構造DataFrame
:
data = [['dawang',21,'image identification'], ['yiyi',19,'clinical medicine']]
df = pd.DataFrame(data = data,
index = ['row_%d'%i for i in range(2)],
columns=['col_0', 'col_1', 'col_2'])
df
col_0 | col_1 | col_2 | |
---|---|---|---|
row_0 | dawang | 21 | image identification |
row_1 | yiyi | 19 | clinical medicine |
【b】列索引名→資料,對映構造DataFrame
,同時加上行索引:
df = pd.DataFrame(data = {'col_0': ['dawang','yiyi'],
'col_1': [21, 19],
'col_2': ['imageidentification', 'clinicalmedicine']},
index = ['row_%d'%i for i in range(2)])
df
col_0 | col_1 | col_2 | |
---|---|---|---|
row_0 | dawang | 21 | imageidentification |
row_1 | yiyi | 19 | clinicalmedicine |
【c】DataFrame
+ [col_name]
取出相應列得到 Series
:
df['col_0']
row_0 dawang
row_1 yiyi
Name: col_0, dtype: object
【d】DataFrame
+ [col_list]
取出多列得到 DataFrame
:
df[['col_0', 'col_1']]
col_0 | col_1 | |
---|---|---|
row_0 | dawang | 21 |
row_1 | yiyi | 19 |
【e】獲取 DataFrame
屬性:
df.values
array([['dawang', 21, 'imageidentification'],
['yiyi', 19, 'clinicalmedicine']], dtype=object)
df.index
Index(['row_0', 'row_1'], dtype='object')
df.columns
Index(['col_0', 'col_1', 'col_2'], dtype='object')
df.dtypes # 返回的是值為相應列資料型別的Series
col_0 object
col_1 int64
col_2 object
dtype: object
df.shape
(2, 3)
df.T # `DataFrame`轉置
row_0 | row_1 | |
---|---|---|
col_0 | dawang | yiyi |
col_1 | 21 | 19 |
col_2 | imageidentification | clinicalmedicine |
三、常用基本函式
為了舉例說明,下面使用一份learn_pandas.csv
的虛擬資料集,它記錄了四所學校學生的體測個人資訊。
df = pd.read_csv('../data/learn_pandas.csv')
df.columns
Index(['School', 'Grade', 'Name', 'Gender', 'Height', 'Weight', 'Transfer',
'Test_Number', 'Test_Date', 'Time_Record'],
dtype='object')
上述列名依次代表學校、年級、姓名、性別、身高、體重、是否為轉系生、體測場次、測試時間、1000米成績,本章僅使用其中的前七列。
df = df[df.columns[:7]]
df
School | Grade | Name | Gender | Height | Weight | Transfer | |
---|---|---|---|---|---|---|---|
0 | Shanghai Jiao Tong University | Freshman | Gaopeng Yang | Female | 158.9 | 46.0 | N |
1 | Peking University | Freshman | Changqiang You | Male | 166.5 | 70.0 | N |
2 | Shanghai Jiao Tong University | Senior | Mei Sun | Male | 188.9 | 89.0 | N |
3 | Fudan University | Sophomore | Xiaojuan Sun | Female | NaN | 41.0 | N |
4 | Fudan University | Sophomore | Gaojuan You | Male | 174.0 | 74.0 | N |
... | ... | ... | ... | ... | ... | ... | ... |
195 | Fudan University | Junior | Xiaojuan Sun | Female | 153.9 | 46.0 | N |
196 | Tsinghua University | Senior | Li Zhao | Female | 160.9 | 50.0 | N |
197 | Shanghai Jiao Tong University | Senior | Chengqiang Chu | Female | 153.9 | 45.0 | N |
198 | Shanghai Jiao Tong University | Senior | Chengmei Shen | Male | 175.3 | 71.0 | N |
199 | Tsinghua University | Sophomore | Chunpeng Lv | Male | 155.7 | 51.0 | N |
200 rows × 7 columns
1. 彙總函式
【a】head, tail
函式分別返回表的前n
行和後n
行,其中n
預設為5:
df.head(2)
School | Grade | Name | Gender | Height | Weight | Transfer | |
---|---|---|---|---|---|---|---|
0 | Shanghai Jiao Tong University | Freshman | Gaopeng Yang | Female | 158.9 | 46.0 | N |
1 | Peking University | Freshman | Changqiang You | Male | 166.5 | 70.0 | N |
df.tail(2)
School | Grade | Name | Gender | Height | Weight | Transfer | |
---|---|---|---|---|---|---|---|
198 | Shanghai Jiao Tong University | Senior | Chengmei Shen | Male | 175.3 | 71.0 | N |
199 | Tsinghua University | Sophomore | Chunpeng Lv | Male | 155.7 | 51.0 | N |
【b】info
返回表的資訊概況:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 School 200 non-null object
1 Grade 200 non-null object
2 Name 200 non-null object
3 Gender 200 non-null object
4 Height 183 non-null float64
5 Weight 189 non-null float64
6 Transfer 188 non-null object
dtypes: float64(2), object(5)
memory usage: 11.1+ KB
【c】describe
返回表中 “數值” 列對應的主要統計量,此處 “數值” 列僅有 “Height” 和 “Weight” :
df.describe()
Height | Weight | |
---|---|---|
count | 183.000000 | 189.000000 |
mean | 163.218033 | 55.015873 |
std | 8.608879 | 12.824294 |
min | 145.400000 | 34.000000 |
25% | 157.150000 | 46.000000 |
50% | 161.900000 | 51.000000 |
75% | 167.500000 | 65.000000 |
max | 193.900000 | 89.000000 |
2. 特徵統計函式
以 sum, mean, median, var, std, max, min
為例:
df_demo = df[['Height', 'Weight']]
df_demo
Height | Weight | |
---|---|---|
0 | 158.9 | 46.0 |
1 | 166.5 | 70.0 |
2 | 188.9 | 89.0 |
3 | NaN | 41.0 |
4 | 174.0 | 74.0 |
... | ... | ... |
195 | 153.9 | 46.0 |
196 | 160.9 | 50.0 |
197 | 153.9 | 45.0 |
198 | 175.3 | 71.0 |
199 | 155.7 | 51.0 |
200 rows × 2 columns
df_demo.mean() # 平均值
Height 163.218033
Weight 55.015873
dtype: float64
df_demo.max() # 最大值
Height 193.9
Weight 89.0
dtype: float64
df_demo.quantile(0.75) # 75%分位數
Height 167.5
Weight 65.0
Name: 0.75, dtype: float64
df_demo.count() # 非缺失值個數
Height 183
Weight 189
dtype: int64
df_demo.idxmax() # 最大值對應的索引值
Height 193
Weight 2
dtype: int64
上述函式都有一個聚合引數axis
,axis
= 0代表逐列聚合,axis
= 1代表逐行聚合,axis
預設為0:
df_demo.mean(axis=1).head(2) # 表示計算前兩行的平均值
0 102.45
1 118.25
dtype: float64
3. 唯一值函式
【a】 unique
:找出單個屬性唯一值,nunique
:統計唯一值個數。
df['School'] # 所有人的 School 屬性
0 Shanghai Jiao Tong University
1 Peking University
2 Shanghai Jiao Tong University
3 Fudan University
4 Fudan University
...
195 Fudan University
196 Tsinghua University
197 Shanghai Jiao Tong University
198 Shanghai Jiao Tong University
199 Tsinghua University
Name: School, Length: 200, dtype: object
df['School'].unique() # 找出 School 屬性的唯一值
array(['Shanghai Jiao Tong University', 'Peking University',
'Fudan University', 'Tsinghua University'], dtype=object)
df['School'].nunique() # 統計 School 屬性唯一值的個數
4
【b】value_counts
:得到唯一值和其對應出現的頻數。
df['School'].value_counts()
Tsinghua University 69
Shanghai Jiao Tong University 57
Fudan University 40
Peking University 34
Name: School, dtype: int64
【c】 drop_duplicates
找出單個屬性唯一值或者多個屬性組合的唯一值。
關鍵引數是keep
,first
表示每個組合保留第一次出現的所在行,last
表示保留最後一次出現的所在行,False
表示把所有重複組合所在的行剔除。
對於單個屬性唯一值:
df['School'].drop_duplicates(keep='first')
0 Shanghai Jiao Tong University
1 Peking University
3 Fudan University
5 Tsinghua University
Name: School, dtype: object
對於多個屬性組合的唯一值:
df_demo = df[['Gender','Transfer','Name']]
df_demo
Gender | Transfer | Name | |
---|---|---|---|
0 | Female | N | Gaopeng Yang |
1 | Male | N | Changqiang You |
2 | Male | N | Mei Sun |
3 | Female | N | Xiaojuan Sun |
4 | Male | N | Gaojuan You |
... | ... | ... | ... |
195 | Female | N | Xiaojuan Sun |
196 | Female | N | Li Zhao |
197 | Female | N | Chengqiang Chu |
198 | Male | N | Chengmei Shen |
199 | Male | N | Chunpeng Lv |
200 rows × 3 columns
df_demo.drop_duplicates(['Gender', 'Transfer'], keep='first')
Gender | Transfer | Name | |
---|---|---|---|
0 | Female | N | Gaopeng Yang |
1 | Male | N | Changqiang You |
12 | Female | NaN | Peng You |
21 | Male | NaN | Xiaopeng Shen |
36 | Male | Y | Xiaojuan Qin |
43 | Female | Y | Gaoli Feng |
df_demo.drop_duplicates(['Gender', 'Transfer'], keep='last')
Gender | Transfer | Name | |
---|---|---|---|
147 | Male | NaN | Juan You |
150 | Male | Y | Chengpeng You |
169 | Female | Y | Chengquan Qin |
194 | Female | NaN | Yanmei Qian |
197 | Female | N | Chengqiang Chu |
199 | Male | N | Chunpeng Lv |
df_demo.drop_duplicates(['Gender', 'Transfer'], keep=False)
Gender | Transfer | Name |
---|
4. 替換函式
替換操作通常針對單個屬性進行,因此下面的例子都以Series
舉例。
【a】對映替換:replace
df['Grade'].head()
0 Freshman
1 Freshman
2 Senior
3 Sophomore
4 Sophomore
Name: Grade, dtype: object
df['Grade'].replace({'Freshman':1, 'Sophomore':2, 'Junior':3, 'Senior':4}).head() # 字典傳參
0 1
1 1
2 4
3 2
4 2
Name: Grade, dtype: int64
df['Grade'].replace(['Freshman','Sophomore','Junior','Senior'], [1,2,3,4]).head() # 列表傳參
0 1
1 1
2 4
3 2
4 2
Name: Grade, dtype: int64
特殊引數:method
s = pd.Series(['dawang','man','yiyi','woman','woman','dawang']) # 下面展示用 'dawang'和'yiyi'替換'man'和'woman'
s.replace(['man','woman'], method='ffill') # `method` = `ffill`,則為用前面一個最近的未被替換的值進行替換
0 dawang
1 dawang
2 yiyi
3 yiyi
4 yiyi
5 dawang
dtype: object
s.replace(['man','woman'], method='bfill') # `method` = `bfill`,則為使用後面最近的未被替換的值進行替換
0 dawang
1 yiyi
2 yiyi
3 dawang
4 dawang
5 dawang
dtype: object
【b】邏輯替換:where
、mask
s = pd.Series([-1,1,-1,1,1,-1])
where
:在傳入條件為False
的對應行進行替換
s.where(s<0) # 不指定替換值,替換為缺失值NaN
0 -1.0
1 NaN
2 -1.0
3 NaN
4 NaN
5 -1.0
dtype: float64
s.where(s<0,'此處大於0') # 指定替換值
0 -1
1 此處大於0
2 -1
3 此處大於0
4 此處大於0
5 -1
dtype: object
mask
:在傳入條件為True
的對應行進行替換
s.mask(s<0)
0 NaN
1 1.0
2 NaN
3 1.0
4 1.0
5 NaN
dtype: float64
s.mask(s<0, '此處小於0')
0 此處小於0
1 1
2 此處小於0
3 1
4 1
5 此處小於0
dtype: object
【c】數值替換:round
、abs
、clip
s = 0.666 * pd.Series(range(-2,3))
s
0 -1.332
1 -0.666
2 0.000
3 0.666
4 1.332
dtype: float64
s.clip(-0.5, 1) # 小於-0.5的值替換為0.5,大於1的值替換為1,在-0.5→1之間的值不變
0 -0.500
1 -0.500
2 0.000
3 0.666
4 1.000
dtype: float64
s.round(2) # 保留兩位小數
0 -1.33
1 -0.67
2 0.00
3 0.67
4 1.33
dtype: float64
s.abs() # 取絕對值
0 1.332
1 0.666
2 0.000
3 0.666
4 1.332
dtype: float64
5. 排序函式
排序函式常包括值排序和索引排序,下面先用set_index
方法把年級和姓名兩列作為索引:
df_demo = df[['Grade', 'Name', 'Height', 'Weight']].set_index(['Grade','Name'])
df_demo.head()
Height | Weight | ||
---|---|---|---|
Grade | Name | ||
Freshman | Gaopeng Yang | 158.9 | 46.0 |
Changqiang You | 166.5 | 70.0 | |
Senior | Mei Sun | 188.9 | 89.0 |
Sophomore | Xiaojuan Sun | NaN | 41.0 |
Gaojuan You | 174.0 | 74.0 |
【a】值排序:sort_values
df_demo.sort_values('Height').head() # 對身高進行升序排序
Height | Weight | ||
---|---|---|---|
Grade | Name | ||
Junior | Xiaoli Chu | 145.4 | 34.0 |
Senior | Gaomei Lv | 147.3 | 34.0 |
Sophomore | Peng Han | 147.8 | 34.0 |
Senior | Changli Lv | 148.7 | 41.0 |
Sophomore | Changjuan You | 150.5 | 40.0 |
df_demo.sort_values('Height', ascending=False).head() # 對身高進行降序排序
Height | Weight | ||
---|---|---|---|
Grade | Name | ||
Senior | Xiaoqiang Qin | 193.9 | 79.0 |
Mei Sun | 188.9 | 89.0 | |
Gaoli Zhao | 186.5 | 83.0 | |
Freshman | Qiang Han | 185.3 | 87.0 |
Senior | Qiang Zheng | 183.9 | 87.0 |
多列排序:體重不同時,按體重降序排序;體重相同時,按身高升序排列。
df_demo.sort_values(['Weight','Height'],ascending=[False,True]).head()
Height | Weight | ||
---|---|---|---|
Grade | Name | ||
Senior | Mei Sun | 188.9 | 89.0 |
Qiang Zheng | 183.9 | 87.0 | |
Freshman | Qiang Han | 185.3 | 87.0 |
Chunli Zhao | 180.2 | 83.0 | |
Changpeng Zhao | 181.3 | 83.0 |
【b】索引排序:sort_index
索引排序,元素的值在索引中,用引數level
指定索引層的名字或者層號。PS:字串的排列順序由英文字母順序決定。
df_demo.sort_index(level=['Grade','Name'],ascending=[False,True]).head() # 'Grade'降序排序(z→a),'Name'升序排序(a→z)
Height | Weight | ||
---|---|---|---|
Grade | Name | ||
Sophomore | Changjuan You | 150.5 | 40.0 |
Changmei Xu | 151.6 | 43.0 | |
Changqiang Qian | 167.6 | 64.0 | |
Chengli You | 164.1 | 57.0 | |
Chengqiang Lv | 166.8 | 53.0 |
6. apply方法
使用 apply
方法,可以同時應用多種運算:
df_demo.apply(lambda x:(x-x.mean()).std()) # 先取均值,然後與原來值相減,最後求標準差
Height 8.608879
Weight 12.824294
dtype: float64
若指定axis=1
,那麼每次傳入函式的就是行元素組成的Series
,其結果與之前的逐行均值結果一致。
df_demo.apply(lambda x:(x-x.mean()).std(), axis=1).head()
0 79.832356
1 68.235804
2 70.639967
3 NaN
4 70.710678
dtype: float64
四、視窗物件
pandas
中有3類視窗,用得較多的是滑動視窗rolling
和擴張視窗expanding
。
1. 滑窗物件
.rolling
:得到滑窗物件,引數 window
為視窗大小。
s = pd.Series([1,2,3,4,5])
roller = s.rolling(window = 2)
roller
Rolling [window=2,center=False,axis=0]
在得到了滑窗物件後,能夠使用相應的聚合函式進行計算。每次選取元素的個數即為window
的值。下面以求均值的步驟舉例:
第一個位置,視窗還未完全進入,故為缺失值
第二個位置,視窗包含第一、二個元素,求均值為(1+2)/2=1.5
第三個位置,視窗包含第二、三個元素,求均值為(2+3)/2=2.5
以此類推。
roller.mean()
0 NaN
1 1.5
2 2.5
3 3.5
4 4.5
dtype: float64
roller.apply(lambda x:x.mean()) # 等價於求均值
0 NaN
1 1.5
2 2.5
3 3.5
4 4.5
dtype: float64
roller.sum() # 與求均值的流程一致
0 NaN
1 3.0
2 5.0
3 7.0
4 9.0
dtype: float64
2. 擴張視窗
擴張視窗又稱累計視窗,其視窗的大小就是從序列開始處到具體操作的對應位置,其使用的聚合函式會作用於這些逐步擴張的視窗上。下面仍以求均值的步驟舉例:
第一個位置求均值,視窗大小為1均值為:1/1=1
第二個位置求均值,視窗大小為2,均值為:(1+3)/2=2
第三個位置求均值,視窗大小為3,均值為:(1+3+6)/3=3.3333
以此類推。
s = pd.Series([1, 3, 6, 10])
s.expanding().mean()
0 1.000000
1 2.000000
2 3.333333
3 5.000000
dtype: float64
s.ewm(alpha=0.2).mean().head()
0 -1.000000
1 -1.000000
2 -1.409836
3 -1.609756
4 -1.725845
dtype: float64