1. 程式人生 > 其它 >動手學資料分析 Task3 學習筆記

動手學資料分析 Task3 學習筆記

複習:在前面我們已經學習了Pandas基礎,第二章我們開始進入資料分析的業務部分,在第二章第一節的內容中,我們學習了資料的清洗,這一部分十分重要,只有資料變得相對乾淨,我們之後對資料的分析才可以更有力。而這一節,我們要做的是資料重構,資料重構依舊屬於資料理解(準備)的範圍。

開始之前,匯入numpy、pandas包和資料

# 匯入基本庫
import numpy as np
import pandas as pd
# 載入data檔案中的:train-left-up.csv
data=pd.read_csv("data/train-left-up.csv")

2 第二章:資料重構

2.4 資料的合併

2.4.1 任務一:將data資料夾裡面的所有資料都載入,觀察資料的之間的關係

#寫入程式碼
dleftup=pd.read_csv('data/train-left-up.csv')
dleftdown=pd.read_csv('data/train-left-down.csv')
drightup=pd.read_csv('data/train-right-up.csv')
drightdown=pd.read_csv('data/train-right-down.csv')

drightup
Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 male 22.0 1 0 A/5 21171 7.2500 NaN S
1 female 38.0 1 0 PC 17599 71.2833 C85 C
2 female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 female 35.0 1 0 113803 53.1000 C123 S
4 male 35.0 0 0 373450 8.0500 NaN S
... ... ... ... ... ... ... ... ...
434 male 50.0 1 0 13507 55.9000 E44 S
435 female 14.0 1 2 113760 120.0000 B96 B98 S
436 female 21.0 2 2 W./C. 6608 34.3750 NaN S
437 female 24.0 2 3 29106 18.7500 NaN S
438 male 64.0 1 4 19950 263.0000 C23 C25 C27 S

439 rows × 8 columns

【提示】結合之前我們載入的train.csv資料,大致預測一下上面的資料是什麼

2.4.2:任務二:使用concat方法:將資料train-left-up.csv和train-right-up.csv橫向合併為一張表,並儲存這張表為result_up

#寫入程式碼
result_up=pd.concat([dleftup,drightup],axis=1)

result_up
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
434 435 0 1 Silvey, Mr. William Baird male 50.0 1 0 13507 55.9000 E44 S
435 436 1 1 Carter, Miss. Lucile Polk female 14.0 1 2 113760 120.0000 B96 B98 S
436 437 0 3 Ford, Miss. Doolina Margaret "Daisy" female 21.0 2 2 W./C. 6608 34.3750 NaN S
437 438 1 2 Richards, Mrs. Sidney (Emily Hocking) female 24.0 2 3 29106 18.7500 NaN S
438 439 0 1 Fortune, Mr. Mark male 64.0 1 4 19950 263.0000 C23 C25 C27 S

439 rows × 12 columns

2.4.3 任務三:使用concat方法:將train-left-down和train-right-down橫向合併為一張表,並儲存這張表為result_down。然後將上邊的result_up和result_down縱向合併為result。

#寫入程式碼
result_down=pd.concat([dleftdown,drightdown],axis=1)
result=pd.concat([result_up,result_down])
result

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
447 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
448 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
449 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
450 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
451 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

891 rows × 12 columns

pandas.concat(objs, # 合併物件
axis=0, # 合併方向,預設是0縱軸方向
join='outer', # 合併取的是交集inner還是並集outer
ignore_index=False, # 合併之後索引是否重新
keys=None, # 在行索引的方向上帶上原來資料的名字;主要是用於層次化索引,可以是任意的列表或者陣列、元組資料或者列表陣列
levels=None, # 指定用作層次化索引各級別上的索引,如果是設定了keys
names=None, # 行索引的名字,列表形式
verify_integrity=False, # 檢查行索引是否重複;有則報錯
sort=False, # 對非連線的軸進行排序
copy=True # 是否進行深拷貝
)

2.4.4 任務四:使用DataFrame自帶的方法join方法和append:完成任務二和任務三的任務

#寫入程式碼
result_up_test=dleftup.join(drightup)
result_down_test=dleftdown.join(drightdown)
result_2=result_up_test.append(result_down_test,ignore_index=True)
result_2
C:\Users\ThinkPad\AppData\Local\Temp\ipykernel_4824\2842206337.py:4: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  result_2=result_up_test.append(result_down_test,ignore_index=True)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

891 rows × 12 columns

dataframe.join(other, # 待合併的另一個數據框
on=None, # 連線的鍵
how='left', # 連線方式:‘left’, ‘right’, ‘outer’, ‘inner’ 預設是left
lsuffix='', # 左邊(第一個)資料框相同鍵的字尾
rsuffix='', # 第二個資料框的鍵的字尾
sort=False) # 是否根據連線的鍵進行排序;預設False

DataFrame.append(other,
ignore_index=False,
verify_integrity=False,
sort=False)

引數解釋:
other:待合併的資料。可以是pandas中的DataFrame、series,或者是Python中的字典、列表這樣的資料結構
ignore_index:是否忽略原來的索引,生成新的自然數索引
verify_integrity:預設是False,如果值為True,建立相同的index則會丟擲異常的錯誤
sort:boolean,預設是None。如果self和other的列沒有對齊,則對列進行排序,並且屬性只在版本0.23.0中出現。

2.4.5 任務五:使用Panads的merge方法和DataFrame的append方法:完成任務二和任務三的任務

#寫入程式碼
dup=dleftup.merge(drightup,left_index=True,right_index=True)
ddown=dleftdown.merge(drightdown,left_index=True,right_index=True)
result_3=dup.append(ddown)
result_3
C:\Users\ThinkPad\AppData\Local\Temp\ipykernel_4824\3296784267.py:4: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  result_3=dup.append(ddown)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
447 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
448 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
449 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
450 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
451 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

891 rows × 12 columns

merge(
left,
right,
how="inner",
on=None,
left_on=None,
right_on=None,
left_index=False,
right_index=False,
sort=False,
suffixes=("_x", "_y"),
copy=True,
indicator=False,
validate=None,
)

【思考】對比merge、join以及concat的方法的不同以及相同。思考一下在任務四和任務五的情況下,為什麼都要求使用DataFrame的append方法,如何只要求使用merge或者join可不可以完成任務四和任務五呢?

DataFrame有一個例項方法join,相當於merge方法的引數left_index=True和right_index=True
append為新增行數,join可以通過axis設定左右合併
merge可以通過index設定,來實現左右合併和上下合併
join可以通過axis設定,來實現左右合併和上下合併。

2.4.6 任務六:完成的資料儲存為result.csv

#寫入程式碼
result_3.to_csv('data/result.csv')

2.5 換一種角度看資料

2.5.1 任務一:將我們的資料變為Series型別的資料

#寫入程式碼

result_stack=result_3.stack()
result_stack
0    PassengerId                          1
     Survived                             0
     Pclass                               3
     Name           Braund, Mr. Owen Harris
     Sex                               male
                             ...           
451  SibSp                                0
     Parch                                0
     Ticket                          370376
     Fare                              7.75
     Embarked                             Q
Length: 9826, dtype: object

stack()即“堆疊”,作用是將列旋轉到行
unstack()即stack()的反操作,將行旋轉到列

result_3
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
447 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
448 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
449 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
450 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
451 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

891 rows × 12 columns

#寫入程式碼
type(result_stack)

pandas.core.series.Series