動手學資料分析 Task3 學習筆記
複習:在前面我們已經學習了Pandas基礎,第二章我們開始進入資料分析的業務部分,在第二章第一節的內容中,我們學習了資料的清洗,這一部分十分重要,只有資料變得相對乾淨,我們之後對資料的分析才可以更有力。而這一節,我們要做的是資料重構,資料重構依舊屬於資料理解(準備)的範圍。
開始之前,匯入numpy、pandas包和資料
# 匯入基本庫
import numpy as np
import pandas as pd
# 載入data檔案中的:train-left-up.csv
data=pd.read_csv("data/train-left-up.csv")
2 第二章:資料重構
2.4 資料的合併
2.4.1 任務一:將data資料夾裡面的所有資料都載入,觀察資料的之間的關係
#寫入程式碼
dleftup=pd.read_csv('data/train-left-up.csv')
dleftdown=pd.read_csv('data/train-left-down.csv')
drightup=pd.read_csv('data/train-right-up.csv')
drightdown=pd.read_csv('data/train-right-down.csv')
drightup
Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|
0 | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
... | ... | ... | ... | ... | ... | ... | ... | ... |
434 | male | 50.0 | 1 | 0 | 13507 | 55.9000 | E44 | S |
435 | female | 14.0 | 1 | 2 | 113760 | 120.0000 | B96 B98 | S |
436 | female | 21.0 | 2 | 2 | W./C. 6608 | 34.3750 | NaN | S |
437 | female | 24.0 | 2 | 3 | 29106 | 18.7500 | NaN | S |
438 | male | 64.0 | 1 | 4 | 19950 | 263.0000 | C23 C25 C27 | S |
439 rows × 8 columns
【提示】結合之前我們載入的train.csv資料,大致預測一下上面的資料是什麼
2.4.2:任務二:使用concat方法:將資料train-left-up.csv和train-right-up.csv橫向合併為一張表,並儲存這張表為result_up
#寫入程式碼
result_up=pd.concat([dleftup,drightup],axis=1)
result_up
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
434 | 435 | 0 | 1 | Silvey, Mr. William Baird | male | 50.0 | 1 | 0 | 13507 | 55.9000 | E44 | S |
435 | 436 | 1 | 1 | Carter, Miss. Lucile Polk | female | 14.0 | 1 | 2 | 113760 | 120.0000 | B96 B98 | S |
436 | 437 | 0 | 3 | Ford, Miss. Doolina Margaret "Daisy" | female | 21.0 | 2 | 2 | W./C. 6608 | 34.3750 | NaN | S |
437 | 438 | 1 | 2 | Richards, Mrs. Sidney (Emily Hocking) | female | 24.0 | 2 | 3 | 29106 | 18.7500 | NaN | S |
438 | 439 | 0 | 1 | Fortune, Mr. Mark | male | 64.0 | 1 | 4 | 19950 | 263.0000 | C23 C25 C27 | S |
439 rows × 12 columns
2.4.3 任務三:使用concat方法:將train-left-down和train-right-down橫向合併為一張表,並儲存這張表為result_down。然後將上邊的result_up和result_down縱向合併為result。
#寫入程式碼
result_down=pd.concat([dleftdown,drightdown],axis=1)
result=pd.concat([result_up,result_down])
result
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
447 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
448 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
449 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
450 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
451 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
891 rows × 12 columns
pandas.concat(objs, # 合併物件
axis=0, # 合併方向,預設是0縱軸方向
join='outer', # 合併取的是交集inner還是並集outer
ignore_index=False, # 合併之後索引是否重新
keys=None, # 在行索引的方向上帶上原來資料的名字;主要是用於層次化索引,可以是任意的列表或者陣列、元組資料或者列表陣列
levels=None, # 指定用作層次化索引各級別上的索引,如果是設定了keys
names=None, # 行索引的名字,列表形式
verify_integrity=False, # 檢查行索引是否重複;有則報錯
sort=False, # 對非連線的軸進行排序
copy=True # 是否進行深拷貝
)
2.4.4 任務四:使用DataFrame自帶的方法join方法和append:完成任務二和任務三的任務
#寫入程式碼
result_up_test=dleftup.join(drightup)
result_down_test=dleftdown.join(drightdown)
result_2=result_up_test.append(result_down_test,ignore_index=True)
result_2
C:\Users\ThinkPad\AppData\Local\Temp\ipykernel_4824\2842206337.py:4: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
result_2=result_up_test.append(result_down_test,ignore_index=True)
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
886 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
891 rows × 12 columns
dataframe.join(other, # 待合併的另一個數據框
on=None, # 連線的鍵
how='left', # 連線方式:‘left’, ‘right’, ‘outer’, ‘inner’ 預設是left
lsuffix='', # 左邊(第一個)資料框相同鍵的字尾
rsuffix='', # 第二個資料框的鍵的字尾
sort=False) # 是否根據連線的鍵進行排序;預設False
DataFrame.append(other,
ignore_index=False,
verify_integrity=False,
sort=False)
引數解釋:
other:待合併的資料。可以是pandas中的DataFrame、series,或者是Python中的字典、列表這樣的資料結構
ignore_index:是否忽略原來的索引,生成新的自然數索引
verify_integrity:預設是False,如果值為True,建立相同的index則會丟擲異常的錯誤
sort:boolean,預設是None。如果self和other的列沒有對齊,則對列進行排序,並且屬性只在版本0.23.0中出現。
2.4.5 任務五:使用Panads的merge方法和DataFrame的append方法:完成任務二和任務三的任務
#寫入程式碼
dup=dleftup.merge(drightup,left_index=True,right_index=True)
ddown=dleftdown.merge(drightdown,left_index=True,right_index=True)
result_3=dup.append(ddown)
result_3
C:\Users\ThinkPad\AppData\Local\Temp\ipykernel_4824\3296784267.py:4: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
result_3=dup.append(ddown)
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
447 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
448 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
449 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
450 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
451 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
891 rows × 12 columns
merge(
left,
right,
how="inner",
on=None,
left_on=None,
right_on=None,
left_index=False,
right_index=False,
sort=False,
suffixes=("_x", "_y"),
copy=True,
indicator=False,
validate=None,
)
【思考】對比merge、join以及concat的方法的不同以及相同。思考一下在任務四和任務五的情況下,為什麼都要求使用DataFrame的append方法,如何只要求使用merge或者join可不可以完成任務四和任務五呢?
DataFrame有一個例項方法join,相當於merge方法的引數left_index=True和right_index=True
append為新增行數,join可以通過axis設定左右合併
merge可以通過index設定,來實現左右合併和上下合併
join可以通過axis設定,來實現左右合併和上下合併。
2.4.6 任務六:完成的資料儲存為result.csv
#寫入程式碼
result_3.to_csv('data/result.csv')
2.5 換一種角度看資料
2.5.1 任務一:將我們的資料變為Series型別的資料
#寫入程式碼
result_stack=result_3.stack()
result_stack
0 PassengerId 1
Survived 0
Pclass 3
Name Braund, Mr. Owen Harris
Sex male
...
451 SibSp 0
Parch 0
Ticket 370376
Fare 7.75
Embarked Q
Length: 9826, dtype: object
stack()即“堆疊”,作用是將列旋轉到行
unstack()即stack()的反操作,將行旋轉到列
result_3
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
447 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
448 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
449 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
450 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
451 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
891 rows × 12 columns
#寫入程式碼
type(result_stack)
pandas.core.series.Series