泰坦尼克號生存預測 (Logistic and KNN)
從Kaggle官網下載資料:train 、test。
賽事描述:
- 泰坦尼克號的沉沒是歷史上最臭名昭著的沉船之一。1912年4月15日,泰坦尼克號在處女航時與冰山相撞沉沒,2224名乘客和船員中有1502人遇難。這一聳人聽聞的悲劇震驚了國際社會,並導致更好的船舶安全法規。船難造成如此巨大的人員傷亡的原因之一是船上沒有足夠的救生艇供乘客和船員使用。雖然在沉船事件中倖存下來是有運氣因素的,但有些人比其他人更有可能存活下來。比如婦女、兒童和上層階級。
- 在此次比賽中,我們需要參賽者預測哪一類人更有可能存活下來。尤其是,我們需要你用機器學習的工具去預測哪些乘客在這次災難中倖存。
目錄
- 提出問題
- 理解資料
- 資料處理(資料預處理and特徵工程)
- 模型構建與評估
- 總結
一.提出問題:
根據已知資訊預測test中418名乘客生存與否,並將預測結果提交。
問題分析:
即基於一組預測變數預測一個分類結果(二分類)。有監督機器學習領域中包含可用於分類的方法:邏輯迴歸、KNN、決策樹、隨機森林、支援向量機、神經網路等。本文選擇Logistic 和 KNN 來做分類預測。
二.理解資料:
先初步瞭解一下變數個數、資料型別、分佈情況、缺失情況等,並做出一些猜想。
#調入所需模組
#資料處理
import numpy as np
import pandas as pd
import re
#作圖
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
#設定作圖風格
sns.set_style("darkgrid")
OK,先瀏覽資料:
#讀取資料
train = pd.read_csv(r"G:\Kaggle\Titanic\train.csv")
test = pd.read_csv(r"G:\Kaggle\Titanic\test.csv")
#看一下訓練集前6行
train.head(6)
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th… | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
5 | 6 | 0 | 3 | Moran, Mr. James | male | NaN | 0 | 0 | 330877 | 8.4583 | NaN | Q |
訓練集欄位:乘客ID、是否生存、艙位等級、姓名、性別、年齡、堂兄弟和堂兄妹個數、父母和孩子的個數、船票編碼、票價、客艙、上船口岸。
#隨機檢視測試集的資料
test.sample(6)
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|
417 | 1309 | 3 | Peter, Master. Michael J | male | NaN | 1 | 1 | 2668 | 22.3583 | NaN | C |
224 | 1116 | 1 | Candee, Mrs. Edward (Helen Churchill Hungerford) | female | 53.0 | 0 | 0 | PC 17606 | 27.4458 | NaN | C |
99 | 991 | 3 | Nancarrow, Mr. William Henry | male | 33.0 | 0 | 0 | A./5. 3338 | 8.0500 | NaN | S |
410 | 1302 | 3 | Naughton, Miss. Hannah | female | NaN | 0 | 0 | 365237 | 7.7500 | NaN | Q |
41 | 933 | 1 | Franklin, Mr. Thomas Parham | male | NaN | 0 | 0 | 113778 | 26.5500 | D34 | S |
70 | 962 | 3 | Mulvihill, Miss. Bertha E | female | 24.0 | 0 | 0 | 382653 | 7.7500 | NaN | Q |
與訓練集相比,少了目標變數Survived,其餘欄位都是一樣的。
train.info()
print("==" * 50)
test.info()
#檢視數值型資料情況:
train.describe()
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | |
---|---|---|---|---|---|---|---|
count | 891.000000 | 891.000000 | 891.000000 | 714.000000 | 891.000000 | 891.000000 | 891.000000 |
mean | 446.000000 | 0.383838 | 2.308642 | 29.699118 | 0.523008 | 0.381594 | 32.204208 |
std | 257.353842 | 0.486592 | 0.836071 | 14.526497 | 1.102743 | 0.806057 | 49.693429 |
min | 1.000000 | 0.000000 | 1.000000 | 0.420000 | 0.000000 | 0.000000 | 0.000000 |
25% | 223.500000 | 0.000000 | 2.000000 | 20.125000 | 0.000000 | 0.000000 | 7.910400 |
50% | 446.000000 | 0.000000 | 3.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 |
75% | 668.500000 | 1.000000 | 3.000000 | 38.000000 | 1.000000 | 0.000000 | 31.000000 |
max | 891.000000 | 1.000000 | 3.000000 | 80.000000 | 8.000000 | 6.000000 | 512.329200 |
#檢視字元型資料情況:
train.describe(include=['O'])
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
Name | Sex | Ticket | Cabin | Embarked | |
---|---|---|---|---|---|
count | 891 | 891 | 891 | 204 | 889 |
unique | 891 | 2 | 681 | 147 | 3 |
top | Kink-Heilmann, Miss. Luise Gretchen | male | 1601 | C23 C25 C27 | S |
freq | 1 | 577 | 7 | 4 | 644 |
A.基本描述:
類別型變數:Survived、Pclass(順序)、Sex、Embarked。數值型變數:Age、 SibSp(離散)、Parch(離散)、Fare.
總共4個欄位有缺失,缺失程度不一樣(Age、Cabin缺較多,Fare、Embarked缺較少)
訓練集中:
- (1)共有891名乘客,生存率為38%
- (2)年齡最小為0.42,最大為80歲,除去缺失值,平均年齡為29,高齡人士較少
- (3)約25%的乘客有一個或以上的兄弟姐妹陪伴的,75%以上的乘客沒有與父母孩子同行
- (4)票價平均值在32美元,最高值在512美元,差距較大
- (5)每個人的名字都是無重複的
- (6)男性共計577人,男乘客較女乘客多
- (7)Ticket有681個不同的值
- (8)Cabin的資料缺失較多,891人中有記錄的僅為204人
- (9)上船口岸有缺失值,644人在S港口上船,佔比較大
B.猜想:
現已知目標變數為Survived,其餘都作為建模可供考慮的特徵。下面我們要探究一下現有的每一個變數對乘客生存的影響程度,有用的留下,沒用的刪除,也看能不能發掘出新的資訊幫助構建模型。可做出以下猜想:
1.Pclass、Fare反映一個人的身份、財力情況,在危難關頭,社會等級高的乘客的生存率比等級低的乘客的生存率高。
2.在災難發生時,人類社會的尊老愛幼、女性優先必會起作用。故老幼、女性生存率更高。
3.有多個親人同行的話,人多力量大,生存率可能更高些。
4.名字、Ticket看不出能反映什麼,可能會刪掉。
5.Id在記錄資料中有用,在分析中沒什麼用,刪掉。
C:缺失資料:
對於缺失的資料,需要根據不同情況進行處理。
處理缺失值方式(在scikit-learn中,build models時若有缺失值會報錯):
刪(簡單粗暴,dropna)
- 完整例項刪除,即刪行(簡單粗暴,當樣本量大,且缺失案例較少時用)
- 刪除有缺失值的特徵(該列缺失嚴重,且該特徵對建模效果影響不大時用)
Imputation(從已知的部分資料中推斷出缺失值,雖然估計值並不絕對百正確,但是比上述刪除列的做法來說,此法建模效果更好一點)
- 用該特徵的均值、中位數、眾數等去估算(普通版)
- 由其他已知的數值型資料,去估算缺失值的值(進階版)
D.資料型別轉換:
字元型都要轉換成數值型資料。
# 三.資料處理(資料預處理and特徵工程) 首先合併train和test,為了後續寫程式碼能同時處理兩個資料集:combination_data = [train,test]
**下面將根據現在資料的型別,分數值型和字串來討論、研究,同時完成缺失值進行處理、根據每個變數與生存率之間的關係進行選擇,必要時將刪除變數或者創造出新的變數來幫助模型的構建。最終所有的資料型別都將處理為數值型。**
## 數值型:
- PassengerId
乘客編碼,做區分用,對預測無作用,刪掉。
del train["PassengerId"]
- Pclass
船艙分三等,某種程度上代表了乘客的身份、社會地位,下面探究一下Pclass的作用:
train[["Pclass","Survived"]].groupby("Pclass",as_index=False).mean().sort_values(by="Survived",ascending=False)
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
Pclass | Survived | |
---|---|---|
0 | 1 | 0.629630 |
1 | 2 | 0.472826 |
2 | 3 | 0.242363 |
sns.barplot(x="Pclass",y="Survived",data=train)
train[["SibSp","Survived"]].groupby("SibSp",as_index=False).mean().sort_values(by="Survived",ascending=False)
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
SibSp | Survived | |
---|---|---|
1 | 1 | 0.535885 |
2 | 2 | 0.464286 |
0 | 0 | 0.345395 |
3 | 3 | 0.250000 |
4 | 4 | 0.166667 |
5 | 5 | 0.000000 |
6 | 8 | 0.000000 |
SibSp為3、4、5、8人時,生存率都較小,甚至為0,有影響但不明顯。
- Parch
train[["Parch","Survived"]].groupby("Parch",as_index=False).mean().sort_values(by="Survived",ascending=False)
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
Parch | Survived | |
---|---|---|
3 | 3 | 0.600000 |
1 | 1 | 0.550847 |
2 | 2 | 0.500000 |
0 | 0 | 0.343658 |
5 | 5 | 0.200000 |
4 | 4 | 0.000000 |
6 | 6 | 0.000000 |
看到Parch為4、5、6的生存率也較小,影響不是很明顯。跟上面的SibSp情況類似,現將兩變數人數合起來看對生存率的影響如何:
for dataset in combination_data:
dataset["Family"] = dataset["SibSp"] + dataset["Parch"] + 1
train[["Family","Survived"]].groupby("Family",as_index=False).mean().sort_values(by="Survived",ascending=False)
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
Family | Survived | |
---|---|---|
3 | 4 | 0.724138 |
2 | 3 | 0.578431 |
1 | 2 | 0.552795 |
6 | 7 | 0.333333 |
0 | 1 | 0.303538 |
4 | 5 | 0.200000 |
5 | 6 | 0.136364 |
7 | 8 | 0.000000 |
8 | 11 | 0.000000 |
sns.countplot(x="Family",hue="Survived",data=train)
for dataset in combination_data:
dataset["Family_size"] = 0 #建立新的一列
dataset.loc[dataset["Family"] == 1,"Family_size"] = 1 #小家庭(獨自一人)
dataset.loc[(dataset["Family"] > 1) & (dataset["Family"] <= 4),"Family_size"] = 2 #中家庭(2-4)
dataset.loc[dataset["Family"] > 4,"Family_size"] = 3 #大家庭(5-11)
dataset["Family_size"] = dataset["Family_size"].astype(int)
同時,我們也可考慮家庭成員的陪伴對生存率是否有影響,來看是否需要構建一個新的特徵:
for dataset in combination_data:
dataset["Alone"] = dataset["Family"].map(lambda x : 1 if x==1 else 0)
train[["Alone","Survived"]].groupby("Alone",as_index=False).mean().sort_values("Survived",ascending=False)
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
Alone | Survived | |
---|---|---|
0 | 0 | 0.505650 |
1 | 1 | 0.303538 |
sns.barplot(x="Alone",y="Survived",data=train)
for dataset in combination_data:
dataset.drop(["SibSp","Parch","Family"],axis=1,inplace=True)
我們加入Pclass來考慮此問題:
sns.factorplot(x="Pclass",y="Survived",hue="Alone",data=train)
train.Age.describe()
count 714.000000
mean 29.699118
std 14.526497
min 0.420000
25% 20.125000
50% 28.000000
75% 38.000000
max 80.000000
Name: Age, dtype: float64
#檢視Age的分佈情況
sns.violinplot(y="Age",data=train)
#檢視生存與死亡乘客的年齡分佈
sns.violinplot(y="Age",x="Survived",data=train)
train["Age_group"] = pd.cut(train.Age,5)
train[["Age_group","Survived"]].groupby("Age_group",as_index=False).mean().sort_values("Survived",ascending=False)
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
Age_group | Survived | |
---|---|---|
0 | (0.34, 16.336] | 0.550000 |
3 | (48.168, 64.084] | 0.434783 |
2 | (32.252, 48.168] | 0.404255 |
1 | (16.336, 32.252] | 0.369942 |
4 | (64.084, 80.0] | 0.090909 |
sns.barplot(x="Age_group",y="Survived",data=train)
del train["Age_group"]
下面要填補Age的缺失值,先檢視Age列的情況
train.Age.isnull().sum()
177
train資料集的891個乘客中,177人(接近20%)的年齡資料缺失,平均年齡為29.7,標準差為14.5,中位數為28。
對於age的缺失值,暫時用平均值跟標準差填補,這在某種程度上引入了噪聲。後期學到更高階的估算,再回來修改。
for dataset in combination_data:
Age_avg = dataset.Age.mean()
Age_std = dataset["Age"].std()
missing_number = dataset["Age"].isnull().sum()
dataset["Age"][np.isnan(dataset["Age"])] = np.random.randint(Age_avg - Age_std, Age_avg + Age_std, missing_number)
dataset["Age"] = dataset["Age"].astype(int)
F:\Anaconda\lib\site-packages\ipykernel_launcher.py:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
“””
#仍是採用5組:
for dataset in combination_data:
dataset["Age_group"] = pd.cut(dataset.Age, 5)
#現在我們以新的識別符號來記錄每人的分組:
for dataset in combination_data:
dataset.loc[dataset["Age"] <= 16,"Age"] = 0
dataset.loc[(dataset["Age"] > 16) & (dataset["Age"] <= 32), "Age"] = 1
dataset.loc[(dataset["Age"] > 32) & (dataset["Age"] <= 48), "Age"] = 2
dataset.loc[(dataset["Age"] > 48) & (dataset["Age"] <= 64), "Age"] = 3
dataset.loc[dataset["Age"] > 64, "Age"] = 4
for dataset in combination_data:
dataset.drop("Age_group",axis=1,inplace=True)
- Fare
train.Fare.describe()
count 891.000000
mean 32.204208
std 49.693429
min 0.000000
25% 7.910400
50% 14.454200
75% 31.000000
max 512.329200
Name: Fare, dtype: float64
sns.violinplot(y="Fare",data=train)
#對比生死乘客的票價
sns.violinplot(y="Fare",x="Survived",data=train)
train["Fare_group"] = pd.qcut(train["Fare"],4) #分段
train[["Fare_group","Survived"]].groupby("Fare_group",as_index=False).mean()
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
Fare_group | Survived | |
---|---|---|
0 | (-0.001, 7.91] | 0.197309 |
1 | (7.91, 14.454] | 0.303571 |
2 | (14.454, 31.0] | 0.454955 |
3 | (31.0, 512.329] | 0.581081 |
隨著票價的升高,乘客的生存率也是逐漸升高。所以將Fare作為一個考慮特徵。
測試集中Fare有兩個缺失值,我們選擇用中位數填補:
test["Fare"].fillna(test["Fare"].median(),inplace=True)
for dataset in combination_data:
dataset.loc[dataset["Fare"] <= 7.91,"Fare"] = 0
dataset.loc[(dataset["Fare"] > 7.91) & (dataset["Fare"] <= 14.454), "Fare"] = 1
dataset.loc[(dataset["Fare"] > 14.454) & (dataset["Fare"] <= 31.0), "Fare"] = 2
dataset.loc[dataset["Fare"] > 31.0, "Fare"] = 3
dataset["Fare"] = dataset["Fare"].astype(int)
del train["Fare_group"]
## 字元型
### Name
成員的名字沒有重複項,本可刪掉。但從別人的文章得知,外國人的名字長度、頭銜也能反映一個人的身份地位,於是我們來探究一下這兩個因素對生存率的影響:
(1)名字長度
for dataset in combination_data:
dataset["The_length_of_name"] = dataset["Name"].map(lambda x:len(re.split(" ",x)))
train[["The_length_of_name","Survived"]].groupby("The_length_of_name",as_index=False).mean().sort_values("Survived",ascending=False)
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
The_length_of_name | Survived | |
---|---|---|
6 | 9 | 1.000000 |
7 | 14 | 1.000000 |
4 | 7 | 0.842105 |
3 | 6 | 0.773585 |
5 | 8 | 0.555556 |
2 | 5 | 0.427083 |
1 | 4 | 0.340206 |
0 | 3 | 0.291803 |
sns.barplot(x="The_length_of_name",y="Survived",data=train)
from sklearn.preprocessing import StandardScaler
Stdsca = StandardScaler()
name_length1 = Stdsca.fit_transform(train[["The_length_of_name"]])
name_length1 = pd.DataFrame(name_length1,columns=["name_length"])
train = pd.concat([train,name_length1],axis=1)
#同理,test也做標準化處理
name_length2 = Stdsca.fit_transform(test[["The_length_of_name"]])
name_length2 = pd.DataFrame(name_length2,columns=["name_length"])
test = pd.concat([test,name_length2],axis=1)
#把新資料聯合起來
combination_data = [train,test]
#刪除原名字長度
for dataset in combination_data:
del dataset["The_length_of_name"]
(2)頭銜
#檢視一下名字的樣式
train.Name.head(7)
0 Braund, Mr. Owen Harris
1 Cumings, Mrs. John Bradley (Florence Briggs Th…
2 Heikkinen, Miss. Laina
3 Futrelle, Mrs. Jacques Heath (Lily May Peel)
4 Allen, Mr. William Henry
5 Moran, Mr. James
6 McCarthy, Mr. Timothy J
Name: Name, dtype: object
#將title取出當新的一列
for dataset in combination_data:
dataset["Title"] = dataset["Name"].str.extract("([A-Za-z]+)\.",expand=False)
train.sample(4)
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
Survived | Pclass | Name | Sex | Age | Ticket | Fare | Cabin | Embarked | Family_size | Alone | name_length | Title | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
271 | 1 | 3 | Tornquist, Mr. William Henry | male | 1 | LINE | 0 | NaN | S | 1 | 1 | -0.059474 | Mr |
389 | 1 | 2 | Lehmann, Miss. Bertha | female | 1 | SC 1748 | 1 | NaN | C | 1 | 1 | -0.914177 | Miss |
40 | 0 | 3 | Ahlin, Mrs. Johan (Johanna Persdotter Larsson) | female | 2 | 7546 | 1 | NaN | S | 2 | 0 | 1.649930 | Mrs |
709 | 1 | 3 | Moubarek, Master. Halim Gonios (“William George”) | male | 1 | 2661 | 2 | NaN | C | 2 | 0 | 1.649930 | Master |
#title跟Sex有聯絡,聯合起來分析
pd.crosstab(train.Title,train.Sex)
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
Sex | female | male |
---|---|---|
Title | ||
Capt | 0 | 1 |
Col | 0 | 2 |
Countess | 1 | 0 |
Don | 0 | 1 |
Dr | 1 | 6 |
Jonkheer | 0 | 1 |
Lady | 1 | 0 |
Major | 0 | 2 |
Master | 0 | 40 |
Miss | 182 | 0 |
Mlle | 2 | 0 |
Mme | 1 | 0 |
Mr | 0 | 517 |
Mrs | 125 | 0 |
Ms | 1 | 0 |
Rev | 0 | 6 |
Sir | 0 | 1 |
#Title較多集中於Master、Miss、Mr、Mrs,對於其他比較少的進行歸類:
for dataset in combination_data:
dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')
#探索title與生存的關係
train[["Title","Survived"]].groupby("Title",as_index=False).mean().sort_values("Survived",ascending=False)
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
Title | Survived | |
---|---|---|
3 | Mrs | 0.793651 |
1 | Miss | 0.702703 |
0 | Master | 0.575000 |
4 | Rare | 0.347826 |
2 | Mr | 0.156673 |
sns.barplot(x="Title",y="Survived",data=train)
#將各頭銜轉換為數值型資料
for dataset in combination_data:
dataset["Title"] = dataset["Title"].map({"Mr":1,"Mrs":2,"Miss":3,"Master":4,"Rare":5})
dataset["Title"] = dataset["Title"].fillna(0)
#刪除原先的Name特徵
for dataset in combination_data:
del dataset["Name"]
#檢視一下現在的資料
train.head(3)
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
Survived | Pclass | Sex | Age | Ticket | Fare | Cabin | Embarked | Family_size | Alone | name_length | Title | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | male | 1 | A/5 21171 | 0 | NaN | S | 2 | 0 | -0.059474 | 1 |
1 | 1 | 1 | female | 2 | PC 17599 | 3 | C85 | C | 2 | 0 | 2.504633 | 2 |
2 | 1 | 3 | female | 1 | STON/O2. 3101282 | 1 | NaN | S | 1 | 1 | -0.914177 | 3 |
- Sex
在分析title時,我們已知道性別對生存的影響存在,下面我們專門就Sex來研究一下:
train[["Sex","Survived"]].groupby("Sex",as_index=False).mean().sort_values("Survived",ascending=False)
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
Sex | Survived | |
---|---|---|
0 | female | 0.742038 |
1 | male | 0.188908 |
sns.countplot(x="Sex",hue="Survived",data=train)
train[["Pclass","Sex","Survived"]].groupby(["Pclass","Sex"],as_index=False).mean().sort_values(by="Survived",ascending=False)
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
Pclass | Sex | Survived | |
---|---|---|---|
0 | 1 | female | 0.968085 |
2 | 2 | female | 0.921053 |
4 | 3 | female | 0.500000 |
1 | 1 | male | 0.368852 |
3 | 2 | male | 0.157407 |
5 | 3 | male | 0.135447 |
sns.factorplot(x="Pclass",y="Survived",hue="Sex",data=train)
#將字串型別轉換成數值型,0表示男性,1表示女性。
for dataset in combination_data:
dataset["Sex"] = dataset["Sex"].map({"male":0,"female":1})
train.head(4)
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
Survived | Pclass | Sex | Age | Ticket | Fare | Cabin | Embarked | Family_size | Alone | name_length | Title | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | 0 | 1 | A/5 21171 | 0 | NaN | S | 2 | 0 | -0.059474 | 1 |
1 | 1 | 1 | 1 | 2 | PC 17599 | 3 | C85 | C | 2 | 0 | 2.504633 | 2 |
2 | 1 | 3 | 1 | 1 | STON/O2. 3101282 | 1 | NaN | S | 1 | 1 | -0.914177 | 3 |
3 | 1 | 1 | 1 | 2 | 113803 | 3 | C123 | S | 2 | 0 | 2.504633 | 2 |
- Cabin
#從describe已知Cabin缺失較多
a = train.Cabin.isnull().sum()
print("缺失個數:%d" % a)
缺失個數:687
超過75%的資料缺失,故不打算填補。考慮以Cabin是否缺失來構建一個新特徵,看是否對生存有影響。若沒有影響,則刪除該列。
train["Cabin_exist"] = train.Cabin.map(lambda x : "Yes" if type(x)==str else "No")
train[["Cabin_exist", "Survived"]].groupby("Cabin_exist",as_index=False).mean()
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
Cabin_exist | Survived | |
---|---|---|
0 | No | 0.299854 |
1 | Yes | 0.666667 |
sns.barplot(x="Cabin_exist",y="Survived",data=train)
#需將此列轉換為數值型變數,刪掉再構建一遍
del train["Cabin_exist"]
#船艙存在用1表示,缺失則用0表示
for dataset in combination_data:
dataset["Cabin_exist"] = dataset["Cabin"].map(lambda x : 1 if type(x)==str else 0)
#將原Cabin刪掉
for dataset in combination_data:
del dataset["Cabin"]
train.head(3)
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
Survived | Pclass | Sex | Age | Ticket | Fare | Embarked | Family_size | Alone | name_length | Title | Cabin_exist | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | 0 | 1 | A/5 21171 | 0 | S | 2 | 0 | -0.059474 | 1 | 0 |
1 | 1 | 1 | 1 | 2 | PC 17599 | 3 | C | 2 | 0 | 2.504633 | 2 | 1 |
2 | 1 | 3 | 1 | 1 | STON/O2. 3101282 | 1 | S | 1 | 1 | -0.914177 | 3 | 0 |
- Embarked
該列有缺失值。我們先研究一下不同的上船地點對生存率是否有影響:
train[["Embarked","Survived"]].groupby("Embarked",as_index=False).count().sort_values("Survived",ascending=False)
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
Embarked | Survived | |
---|---|---|
2 | S | 644 |
0 | C | 168 |
1 | Q | 77 |
train[["Embarked","Survived"]].groupby("Embarked",as_index=False).mean().sort_values("Survived",ascending=False)
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
Embarked | Survived | |
---|---|---|
0 | C | 0.553571 |
1 | Q | 0.389610 |
2 | S | 0.336957 |
sns.barplot(x="Embarked",y="Survived",data=train)
sns.factorplot(x="Pclass",y="Survived",hue="Embarked",data=train)
train[["Sex","Survived","Embarked"]].groupby(["Sex","Embarked"],as_index=False).count().sort_values("Survived",ascending=False)
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
Sex | Embarked | Survived | |
---|---|---|---|
2 | 0 | S | 441 |
5 | 1 | S | 203 |
0 | 0 | C | 95 |
3 | 1 | C | 73 |
1 | 0 | Q | 41 |
4 | 1 | Q | 36 |
S口岸,登船人數644,女性乘客佔比46%;C口岸,登船人數168,女性佔比接近77%;Q口岸,登船人數77,女性佔比接近88%。前面已知女性生存率明顯高於男性生存率,所以上述問題可能由性別因素引起。
缺失值處理:在檢視資料集的時候,我們已知較多人在S口岸上岸,而Embarked缺失2個。於是我們選擇用S來替換train的缺失值:
train["Embarked"] = train.Embarked.fillna("S")
#將Embarked轉換成數值型資料:
for dataset in combination_data:
dataset["Embarked"] = dataset["Embarked"].map({"C":0,"Q":1,"S":2}).astype(int)
train.head(2)
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
Survived | Pclass | Sex | Age | Ticke |
---|