1. 程式人生 > >泰坦尼克號生存預測 (Logistic and KNN)

泰坦尼克號生存預測 (Logistic and KNN)

從Kaggle官網下載資料:train 、test。

賽事描述:

  • 泰坦尼克號的沉沒是歷史上最臭名昭著的沉船之一。1912年4月15日,泰坦尼克號在處女航時與冰山相撞沉沒,2224名乘客和船員中有1502人遇難。這一聳人聽聞的悲劇震驚了國際社會,並導致更好的船舶安全法規。船難造成如此巨大的人員傷亡的原因之一是船上沒有足夠的救生艇供乘客和船員使用。雖然在沉船事件中倖存下來是有運氣因素的,但有些人比其他人更有可能存活下來。比如婦女、兒童和上層階級。
  • 在此次比賽中,我們需要參賽者預測哪一類人更有可能存活下來。尤其是,我們需要你用機器學習的工具去預測哪些乘客在這次災難中倖存。

目錄

  • 提出問題
  • 理解資料
  • 資料處理(資料預處理and特徵工程)
  • 模型構建與評估
  • 總結

一.提出問題:

根據已知資訊預測test中418名乘客生存與否,並將預測結果提交。

問題分析:

即基於一組預測變數預測一個分類結果(二分類)。有監督機器學習領域中包含可用於分類的方法:邏輯迴歸、KNN、決策樹、隨機森林、支援向量機、神經網路等。本文選擇Logistic 和 KNN 來做分類預測。

二.理解資料:

先初步瞭解一下變數個數、資料型別、分佈情況、缺失情況等,並做出一些猜想。

#調入所需模組
#資料處理
import numpy as np
import pandas as pd
import re

#作圖
import matplotlib.pyplot as plt %matplotlib inline import seaborn as sns #設定作圖風格 sns.set_style("darkgrid")
OK,先瀏覽資料:
#讀取資料
train = pd.read_csv(r"G:\Kaggle\Titanic\train.csv")
test = pd.read_csv(r"G:\Kaggle\Titanic\test.csv")
#看一下訓練集前6行
train.head(6)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q

訓練集欄位:乘客ID、是否生存、艙位等級、姓名、性別、年齡、堂兄弟和堂兄妹個數、父母和孩子的個數、船票編碼、票價、客艙、上船口岸。

#隨機檢視測試集的資料
test.sample(6)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
417 1309 3 Peter, Master. Michael J male NaN 1 1 2668 22.3583 NaN C
224 1116 1 Candee, Mrs. Edward (Helen Churchill Hungerford) female 53.0 0 0 PC 17606 27.4458 NaN C
99 991 3 Nancarrow, Mr. William Henry male 33.0 0 0 A./5. 3338 8.0500 NaN S
410 1302 3 Naughton, Miss. Hannah female NaN 0 0 365237 7.7500 NaN Q
41 933 1 Franklin, Mr. Thomas Parham male NaN 0 0 113778 26.5500 D34 S
70 962 3 Mulvihill, Miss. Bertha E female 24.0 0 0 382653 7.7500 NaN Q

與訓練集相比,少了目標變數Survived,其餘欄位都是一樣的。

train.info()
print("==" * 50)
test.info()
#檢視數值型資料情況:
train.describe()
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
#檢視字元型資料情況:
train.describe(include=['O'])
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
Name Sex Ticket Cabin Embarked
count 891 891 891 204 889
unique 891 2 681 147 3
top Kink-Heilmann, Miss. Luise Gretchen male 1601 C23 C25 C27 S
freq 1 577 7 4 644

A.基本描述:

  • 類別型變數:Survived、Pclass(順序)、Sex、Embarked。數值型變數:Age、 SibSp(離散)、Parch(離散)、Fare.

  • 總共4個欄位有缺失,缺失程度不一樣(Age、Cabin缺較多,Fare、Embarked缺較少)

  • 訓練集中:

    • (1)共有891名乘客,生存率為38%
    • (2)年齡最小為0.42,最大為80歲,除去缺失值,平均年齡為29,高齡人士較少
    • (3)約25%的乘客有一個或以上的兄弟姐妹陪伴的,75%以上的乘客沒有與父母孩子同行
    • (4)票價平均值在32美元,最高值在512美元,差距較大
    • (5)每個人的名字都是無重複的
    • (6)男性共計577人,男乘客較女乘客多
    • (7)Ticket有681個不同的值
    • (8)Cabin的資料缺失較多,891人中有記錄的僅為204人
    • (9)上船口岸有缺失值,644人在S港口上船,佔比較大

B.猜想:

現已知目標變數為Survived,其餘都作為建模可供考慮的特徵。下面我們要探究一下現有的每一個變數對乘客生存的影響程度,有用的留下,沒用的刪除,也看能不能發掘出新的資訊幫助構建模型。可做出以下猜想:

1.Pclass、Fare反映一個人的身份、財力情況,在危難關頭,社會等級高的乘客的生存率比等級低的乘客的生存率高。

2.在災難發生時,人類社會的尊老愛幼、女性優先必會起作用。故老幼、女性生存率更高。

3.有多個親人同行的話,人多力量大,生存率可能更高些。

4.名字、Ticket看不出能反映什麼,可能會刪掉。

5.Id在記錄資料中有用,在分析中沒什麼用,刪掉。

C:缺失資料:

對於缺失的資料,需要根據不同情況進行處理。

處理缺失值方式(在scikit-learn中,build models時若有缺失值會報錯):

  • 刪(簡單粗暴,dropna)

    • 完整例項刪除,即刪行(簡單粗暴,當樣本量大,且缺失案例較少時用)
    • 刪除有缺失值的特徵(該列缺失嚴重,且該特徵對建模效果影響不大時用)
  • Imputation(從已知的部分資料中推斷出缺失值,雖然估計值並不絕對百正確,但是比上述刪除列的做法來說,此法建模效果更好一點)

    • 用該特徵的均值、中位數、眾數等去估算(普通版)
    • 由其他已知的數值型資料,去估算缺失值的值(進階版)

D.資料型別轉換:

字元型都要轉換成數值型資料。

# 三.資料處理(資料預處理and特徵工程) 首先合併train和test,為了後續寫程式碼能同時處理兩個資料集:
combination_data = [train,test]
**下面將根據現在資料的型別,分數值型和字串來討論、研究,同時完成缺失值進行處理、根據每個變數與生存率之間的關係進行選擇,必要時將刪除變數或者創造出新的變數來幫助模型的構建。最終所有的資料型別都將處理為數值型。** ## 數值型: - PassengerId 乘客編碼,做區分用,對預測無作用,刪掉。
del train["PassengerId"]
- Pclass 船艙分三等,某種程度上代表了乘客的身份、社會地位,下面探究一下Pclass的作用:
train[["Pclass","Survived"]].groupby("Pclass",as_index=False).mean().sort_values(by="Survived",ascending=False)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
Pclass Survived
0 1 0.629630
1 2 0.472826
2 3 0.242363
sns.barplot(x="Pclass",y="Survived",data=train)
train[["SibSp","Survived"]].groupby("SibSp",as_index=False).mean().sort_values(by="Survived",ascending=False)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
SibSp Survived
1 1 0.535885
2 2 0.464286
0 0 0.345395
3 3 0.250000
4 4 0.166667
5 5 0.000000
6 8 0.000000

SibSp為3、4、5、8人時,生存率都較小,甚至為0,有影響但不明顯。

  • Parch
train[["Parch","Survived"]].groupby("Parch",as_index=False).mean().sort_values(by="Survived",ascending=False)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
Parch Survived
3 3 0.600000
1 1 0.550847
2 2 0.500000
0 0 0.343658
5 5 0.200000
4 4 0.000000
6 6 0.000000

看到Parch為4、5、6的生存率也較小,影響不是很明顯。跟上面的SibSp情況類似,現將兩變數人數合起來看對生存率的影響如何:

for dataset in combination_data:
    dataset["Family"] = dataset["SibSp"] + dataset["Parch"] + 1
train[["Family","Survived"]].groupby("Family",as_index=False).mean().sort_values(by="Survived",ascending=False)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
Family Survived
3 4 0.724138
2 3 0.578431
1 2 0.552795
6 7 0.333333
0 1 0.303538
4 5 0.200000
5 6 0.136364
7 8 0.000000
8 11 0.000000
sns.countplot(x="Family",hue="Survived",data=train)
for dataset in combination_data:
    dataset["Family_size"] = 0    #建立新的一列
    dataset.loc[dataset["Family"] == 1,"Family_size"] = 1                              #小家庭(獨自一人)
    dataset.loc[(dataset["Family"] > 1) & (dataset["Family"] <= 4),"Family_size"] = 2  #中家庭(2-4)
    dataset.loc[dataset["Family"] > 4,"Family_size"] = 3                                #大家庭(5-11)
    dataset["Family_size"] = dataset["Family_size"].astype(int)
同時,我們也可考慮家庭成員的陪伴對生存率是否有影響,來看是否需要構建一個新的特徵:
for dataset in combination_data:
    dataset["Alone"] = dataset["Family"].map(lambda x : 1 if x==1 else 0)
train[["Alone","Survived"]].groupby("Alone",as_index=False).mean().sort_values("Survived",ascending=False)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
Alone Survived
0 0 0.505650
1 1 0.303538
sns.barplot(x="Alone",y="Survived",data=train)
for dataset in combination_data:
    dataset.drop(["SibSp","Parch","Family"],axis=1,inplace=True)
我們加入Pclass來考慮此問題:
sns.factorplot(x="Pclass",y="Survived",hue="Alone",data=train)
train.Age.describe()
count 714.000000 mean 29.699118 std 14.526497 min 0.420000 25% 20.125000 50% 28.000000 75% 38.000000 max 80.000000 Name: Age, dtype: float64
#檢視Age的分佈情況
sns.violinplot(y="Age",data=train)
#檢視生存與死亡乘客的年齡分佈
sns.violinplot(y="Age",x="Survived",data=train)
train["Age_group"] = pd.cut(train.Age,5)
train[["Age_group","Survived"]].groupby("Age_group",as_index=False).mean().sort_values("Survived",ascending=False)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
Age_group Survived
0 (0.34, 16.336] 0.550000
3 (48.168, 64.084] 0.434783
2 (32.252, 48.168] 0.404255
1 (16.336, 32.252] 0.369942
4 (64.084, 80.0] 0.090909
sns.barplot(x="Age_group",y="Survived",data=train)
del train["Age_group"]
下面要填補Age的缺失值,先檢視Age列的情況
train.Age.isnull().sum()
177 train資料集的891個乘客中,177人(接近20%)的年齡資料缺失,平均年齡為29.7,標準差為14.5,中位數為28。 對於age的缺失值,暫時用平均值跟標準差填補,這在某種程度上引入了噪聲。後期學到更高階的估算,再回來修改。
for dataset in combination_data:
    Age_avg = dataset.Age.mean()
    Age_std = dataset["Age"].std()
    missing_number = dataset["Age"].isnull().sum()
    dataset["Age"][np.isnan(dataset["Age"])] = np.random.randint(Age_avg - Age_std, Age_avg + Age_std, missing_number)
    dataset["Age"] = dataset["Age"].astype(int) 
F:\Anaconda\lib\site-packages\ipykernel_launcher.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy “””
#仍是採用5組:
for dataset in combination_data:
    dataset["Age_group"] = pd.cut(dataset.Age, 5)
#現在我們以新的識別符號來記錄每人的分組:
for dataset in combination_data:
    dataset.loc[dataset["Age"]  <= 16,"Age"] = 0
    dataset.loc[(dataset["Age"] > 16) & (dataset["Age"] <= 32), "Age"] = 1
    dataset.loc[(dataset["Age"] > 32) & (dataset["Age"] <= 48), "Age"] = 2
    dataset.loc[(dataset["Age"] > 48) & (dataset["Age"] <= 64), "Age"] = 3
    dataset.loc[dataset["Age"]  > 64, "Age"] = 4
for dataset in combination_data:
    dataset.drop("Age_group",axis=1,inplace=True)
- Fare
train.Fare.describe()
count 891.000000 mean 32.204208 std 49.693429 min 0.000000 25% 7.910400 50% 14.454200 75% 31.000000 max 512.329200 Name: Fare, dtype: float64
sns.violinplot(y="Fare",data=train)
#對比生死乘客的票價
sns.violinplot(y="Fare",x="Survived",data=train)
train["Fare_group"] = pd.qcut(train["Fare"],4) #分段
train[["Fare_group","Survived"]].groupby("Fare_group",as_index=False).mean()
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
Fare_group Survived
0 (-0.001, 7.91] 0.197309
1 (7.91, 14.454] 0.303571
2 (14.454, 31.0] 0.454955
3 (31.0, 512.329] 0.581081

隨著票價的升高,乘客的生存率也是逐漸升高。所以將Fare作為一個考慮特徵。

測試集中Fare有兩個缺失值,我們選擇用中位數填補:

test["Fare"].fillna(test["Fare"].median(),inplace=True)
for dataset in combination_data:
    dataset.loc[dataset["Fare"]  <= 7.91,"Fare"] = 0
    dataset.loc[(dataset["Fare"] >  7.91)   & (dataset["Fare"] <= 14.454), "Fare"] = 1
    dataset.loc[(dataset["Fare"] >  14.454) & (dataset["Fare"] <= 31.0),   "Fare"] = 2
    dataset.loc[dataset["Fare"]  >  31.0, "Fare"] = 3
    dataset["Fare"] = dataset["Fare"].astype(int)
del train["Fare_group"]
## 字元型 ### Name 成員的名字沒有重複項,本可刪掉。但從別人的文章得知,外國人的名字長度、頭銜也能反映一個人的身份地位,於是我們來探究一下這兩個因素對生存率的影響: (1)名字長度
for dataset in combination_data:
    dataset["The_length_of_name"] = dataset["Name"].map(lambda x:len(re.split(" ",x)))
train[["The_length_of_name","Survived"]].groupby("The_length_of_name",as_index=False).mean().sort_values("Survived",ascending=False)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
The_length_of_name Survived
6 9 1.000000
7 14 1.000000
4 7 0.842105
3 6 0.773585
5 8 0.555556
2 5 0.427083
1 4 0.340206
0 3 0.291803
sns.barplot(x="The_length_of_name",y="Survived",data=train)
from sklearn.preprocessing import StandardScaler
Stdsca = StandardScaler()
name_length1 = Stdsca.fit_transform(train[["The_length_of_name"]])
name_length1 = pd.DataFrame(name_length1,columns=["name_length"])
train = pd.concat([train,name_length1],axis=1)
#同理,test也做標準化處理
name_length2 = Stdsca.fit_transform(test[["The_length_of_name"]])
name_length2 = pd.DataFrame(name_length2,columns=["name_length"])
test = pd.concat([test,name_length2],axis=1)
#把新資料聯合起來
combination_data = [train,test]
#刪除原名字長度
for dataset in combination_data:
    del dataset["The_length_of_name"]
(2)頭銜
#檢視一下名字的樣式
train.Name.head(7)
0 Braund, Mr. Owen Harris 1 Cumings, Mrs. John Bradley (Florence Briggs Th… 2 Heikkinen, Miss. Laina 3 Futrelle, Mrs. Jacques Heath (Lily May Peel) 4 Allen, Mr. William Henry 5 Moran, Mr. James 6 McCarthy, Mr. Timothy J Name: Name, dtype: object
#將title取出當新的一列
for dataset in combination_data:
    dataset["Title"] = dataset["Name"].str.extract("([A-Za-z]+)\.",expand=False)
train.sample(4)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
Survived Pclass Name Sex Age Ticket Fare Cabin Embarked Family_size Alone name_length Title
271 1 3 Tornquist, Mr. William Henry male 1 LINE 0 NaN S 1 1 -0.059474 Mr
389 1 2 Lehmann, Miss. Bertha female 1 SC 1748 1 NaN C 1 1 -0.914177 Miss
40 0 3 Ahlin, Mrs. Johan (Johanna Persdotter Larsson) female 2 7546 1 NaN S 2 0 1.649930 Mrs
709 1 3 Moubarek, Master. Halim Gonios (“William George”) male 1 2661 2 NaN C 2 0 1.649930 Master
#title跟Sex有聯絡,聯合起來分析
pd.crosstab(train.Title,train.Sex)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
Sex female male
Title
Capt 0 1
Col 0 2
Countess 1 0
Don 0 1
Dr 1 6
Jonkheer 0 1
Lady 1 0
Major 0 2
Master 0 40
Miss 182 0
Mlle 2 0
Mme 1 0
Mr 0 517
Mrs 125 0
Ms 1 0
Rev 0 6
Sir 0 1
#Title較多集中於Master、Miss、Mr、Mrs,對於其他比較少的進行歸類:
for dataset in combination_data:
    dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
    dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')
#探索title與生存的關係
train[["Title","Survived"]].groupby("Title",as_index=False).mean().sort_values("Survived",ascending=False)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
Title Survived
3 Mrs 0.793651
1 Miss 0.702703
0 Master 0.575000
4 Rare 0.347826
2 Mr 0.156673
sns.barplot(x="Title",y="Survived",data=train)
#將各頭銜轉換為數值型資料
for dataset in combination_data:
    dataset["Title"] = dataset["Title"].map({"Mr":1,"Mrs":2,"Miss":3,"Master":4,"Rare":5})
    dataset["Title"] = dataset["Title"].fillna(0)
#刪除原先的Name特徵
for dataset in combination_data:
    del dataset["Name"]
#檢視一下現在的資料
train.head(3)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
Survived Pclass Sex Age Ticket Fare Cabin Embarked Family_size Alone name_length Title
0 0 3 male 1 A/5 21171 0 NaN S 2 0 -0.059474 1
1 1 1 female 2 PC 17599 3 C85 C 2 0 2.504633 2
2 1 3 female 1 STON/O2. 3101282 1 NaN S 1 1 -0.914177 3
  • Sex

在分析title時,我們已知道性別對生存的影響存在,下面我們專門就Sex來研究一下:

train[["Sex","Survived"]].groupby("Sex",as_index=False).mean().sort_values("Survived",ascending=False)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
Sex Survived
0 female 0.742038
1 male 0.188908
sns.countplot(x="Sex",hue="Survived",data=train)
train[["Pclass","Sex","Survived"]].groupby(["Pclass","Sex"],as_index=False).mean().sort_values(by="Survived",ascending=False)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
Pclass Sex Survived
0 1 female 0.968085
2 2 female 0.921053
4 3 female 0.500000
1 1 male 0.368852
3 2 male 0.157407
5 3 male 0.135447
sns.factorplot(x="Pclass",y="Survived",hue="Sex",data=train)
#將字串型別轉換成數值型,0表示男性,1表示女性。
for dataset in combination_data:
    dataset["Sex"] = dataset["Sex"].map({"male":0,"female":1})
train.head(4)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
Survived Pclass Sex Age Ticket Fare Cabin Embarked Family_size Alone name_length Title
0 0 3 0 1 A/5 21171 0 NaN S 2 0 -0.059474 1
1 1 1 1 2 PC 17599 3 C85 C 2 0 2.504633 2
2 1 3 1 1 STON/O2. 3101282 1 NaN S 1 1 -0.914177 3
3 1 1 1 2 113803 3 C123 S 2 0 2.504633 2
  • Cabin
#從describe已知Cabin缺失較多
a = train.Cabin.isnull().sum()
print("缺失個數:%d" % a)
缺失個數:687 超過75%的資料缺失,故不打算填補。考慮以Cabin是否缺失來構建一個新特徵,看是否對生存有影響。若沒有影響,則刪除該列。
train["Cabin_exist"] = train.Cabin.map(lambda x : "Yes" if type(x)==str else "No")
train[["Cabin_exist", "Survived"]].groupby("Cabin_exist",as_index=False).mean()
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
Cabin_exist Survived
0 No 0.299854
1 Yes 0.666667
sns.barplot(x="Cabin_exist",y="Survived",data=train)
#需將此列轉換為數值型變數,刪掉再構建一遍
del train["Cabin_exist"]
#船艙存在用1表示,缺失則用0表示
for dataset in combination_data:
    dataset["Cabin_exist"] = dataset["Cabin"].map(lambda x : 1 if type(x)==str else 0)
#將原Cabin刪掉
for dataset in combination_data:
    del dataset["Cabin"]
train.head(3)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
Survived Pclass Sex Age Ticket Fare Embarked Family_size Alone name_length Title Cabin_exist
0 0 3 0 1 A/5 21171 0 S 2 0 -0.059474 1 0
1 1 1 1 2 PC 17599 3 C 2 0 2.504633 2 1
2 1 3 1 1 STON/O2. 3101282 1 S 1 1 -0.914177 3 0
  • Embarked

該列有缺失值。我們先研究一下不同的上船地點對生存率是否有影響:

train[["Embarked","Survived"]].groupby("Embarked",as_index=False).count().sort_values("Survived",ascending=False)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
Embarked Survived
2 S 644
0 C 168
1 Q 77
train[["Embarked","Survived"]].groupby("Embarked",as_index=False).mean().sort_values("Survived",ascending=False)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
Embarked Survived
0 C 0.553571
1 Q 0.389610
2 S 0.336957
sns.barplot(x="Embarked",y="Survived",data=train)
sns.factorplot(x="Pclass",y="Survived",hue="Embarked",data=train)
train[["Sex","Survived","Embarked"]].groupby(["Sex","Embarked"],as_index=False).count().sort_values("Survived",ascending=False)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
Sex Embarked Survived
2 0 S 441
5 1 S 203
0 0 C 95
3 1 C 73
1 0 Q 41
4 1 Q 36

S口岸,登船人數644,女性乘客佔比46%;C口岸,登船人數168,女性佔比接近77%;Q口岸,登船人數77,女性佔比接近88%。前面已知女性生存率明顯高於男性生存率,所以上述問題可能由性別因素引起。

缺失值處理:在檢視資料集的時候,我們已知較多人在S口岸上岸,而Embarked缺失2個。於是我們選擇用S來替換train的缺失值:

train["Embarked"] = train.Embarked.fillna("S")
#將Embarked轉換成數值型資料:
for dataset in combination_data:
    dataset["Embarked"] = dataset["Embarked"].map({"C":0,"Q":1,"S":2}).astype(int)
train.head(2)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
Survived Pclass Sex Age Ticke