1. 程式人生 > >泰坦尼克號生存預測 (Logistic and KNN)

泰坦尼克號生存預測 (Logistic and KNN)

從Kaggle官網下載資料:train 、test。


  • 泰坦尼克號的沉沒是歷史上最臭名昭著的沉船之一。1912年4月15日,泰坦尼克號在處女航時與冰山相撞沉沒,2224名乘客和船員中有1502人遇難。這一聳人聽聞的悲劇震驚了國際社會,並導致更好的船舶安全法規。船難造成如此巨大的人員傷亡的原因之一是船上沒有足夠的救生艇供乘客和船員使用。雖然在沉船事件中倖存下來是有運氣因素的,但有些人比其他人更有可能存活下來。比如婦女、兒童和上層階級。
  • 在此次比賽中,我們需要參賽者預測哪一類人更有可能存活下來。尤其是,我們需要你用機器學習的工具去預測哪些乘客在這次災難中倖存。


  • 提出問題
  • 理解資料
  • 資料處理(資料預處理and特徵工程)
  • 模型構建與評估
  • 總結




即基於一組預測變數預測一個分類結果(二分類)。有監督機器學習領域中包含可用於分類的方法:邏輯迴歸、KNN、決策樹、隨機森林、支援向量機、神經網路等。本文選擇Logistic 和 KNN 來做分類預測。



import numpy as np
import pandas as pd
import re

import matplotlib.pyplot as plt %matplotlib inline import seaborn as sns #設定作圖風格 sns.set_style("darkgrid")
train = pd.read_csv(r"G:\Kaggle\Titanic\train.csv")
test = pd.read_csv(r"G:\Kaggle\Titanic\test.csv")
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q


PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
417 1309 3 Peter, Master. Michael J male NaN 1 1 2668 22.3583 NaN C
224 1116 1 Candee, Mrs. Edward (Helen Churchill Hungerford) female 53.0 0 0 PC 17606 27.4458 NaN C
99 991 3 Nancarrow, Mr. William Henry male 33.0 0 0 A./5. 3338 8.0500 NaN S
410 1302 3 Naughton, Miss. Hannah female NaN 0 0 365237 7.7500 NaN Q
41 933 1 Franklin, Mr. Thomas Parham male NaN 0 0 113778 26.5500 D34 S
70 962 3 Mulvihill, Miss. Bertha E female 24.0 0 0 382653 7.7500 NaN Q


print("==" * 50)
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
Name Sex Ticket Cabin Embarked
count 891 891 891 204 889
unique 891 2 681 147 3
top Kink-Heilmann, Miss. Luise Gretchen male 1601 C23 C25 C27 S
freq 1 577 7 4 644


  • 類別型變數:Survived、Pclass(順序)、Sex、Embarked。數值型變數:Age、 SibSp(離散)、Parch(離散)、Fare.

  • 總共4個欄位有缺失,缺失程度不一樣(Age、Cabin缺較多,Fare、Embarked缺較少)

  • 訓練集中:

    • (1)共有891名乘客,生存率為38%
    • (2)年齡最小為0.42,最大為80歲,除去缺失值,平均年齡為29,高齡人士較少
    • (3)約25%的乘客有一個或以上的兄弟姐妹陪伴的,75%以上的乘客沒有與父母孩子同行
    • (4)票價平均值在32美元,最高值在512美元,差距較大
    • (5)每個人的名字都是無重複的
    • (6)男性共計577人,男乘客較女乘客多
    • (7)Ticket有681個不同的值
    • (8)Cabin的資料缺失較多,891人中有記錄的僅為204人
    • (9)上船口岸有缺失值,644人在S港口上船,佔比較大










處理缺失值方式(在scikit-learn中,build models時若有缺失值會報錯):

  • 刪(簡單粗暴,dropna)

    • 完整例項刪除,即刪行(簡單粗暴,當樣本量大,且缺失案例較少時用)
    • 刪除有缺失值的特徵(該列缺失嚴重,且該特徵對建模效果影響不大時用)
  • Imputation(從已知的部分資料中推斷出缺失值,雖然估計值並不絕對百正確,但是比上述刪除列的做法來說,此法建模效果更好一點)

    • 用該特徵的均值、中位數、眾數等去估算(普通版)
    • 由其他已知的數值型資料,去估算缺失值的值(進階版)



# 三.資料處理(資料預處理and特徵工程) 首先合併train和test,為了後續寫程式碼能同時處理兩個資料集:
combination_data = [train,test]
**下面將根據現在資料的型別,分數值型和字串來討論、研究,同時完成缺失值進行處理、根據每個變數與生存率之間的關係進行選擇,必要時將刪除變數或者創造出新的變數來幫助模型的構建。最終所有的資料型別都將處理為數值型。** ## 數值型: - PassengerId 乘客編碼,做區分用,對預測無作用,刪掉。
del train["PassengerId"]
- Pclass 船艙分三等,某種程度上代表了乘客的身份、社會地位,下面探究一下Pclass的作用:
Pclass Survived
0 1 0.629630
1 2 0.472826
2 3 0.242363
SibSp Survived
1 1 0.535885
2 2 0.464286
0 0 0.345395
3 3 0.250000
4 4 0.166667
5 5 0.000000
6 8 0.000000


  • Parch
Parch Survived
3 3 0.600000
1 1 0.550847
2 2 0.500000
0 0 0.343658
5 5 0.200000
4 4 0.000000
6 6 0.000000


for dataset in combination_data:
    dataset["Family"] = dataset["SibSp"] + dataset["Parch"] + 1
Family Survived
3 4 0.724138
2 3 0.578431
1 2 0.552795
6 7 0.333333
0 1 0.303538
4 5 0.200000
5 6 0.136364
7 8 0.000000
8 11 0.000000
for dataset in combination_data:
    dataset["Family_size"] = 0    #建立新的一列
    dataset.loc[dataset["Family"] == 1,"Family_size"] = 1                              #小家庭(獨自一人)
    dataset.loc[(dataset["Family"] > 1) & (dataset["Family"] <= 4),"Family_size"] = 2  #中家庭(2-4)
    dataset.loc[dataset["Family"] > 4,"Family_size"] = 3                                #大家庭(5-11)
    dataset["Family_size"] = dataset["Family_size"].astype(int)
for dataset in combination_data:
    dataset["Alone"] = dataset["Family"].map(lambda x : 1 if x==1 else 0)
Alone Survived
0 0 0.505650
1 1 0.303538
for dataset in combination_data:
count 714.000000 mean 29.699118 std 14.526497 min 0.420000 25% 20.125000 50% 28.000000 75% 38.000000 max 80.000000 Name: Age, dtype: float64
train["Age_group"] = pd.cut(train.Age,5)
Age_group Survived
0 (0.34, 16.336] 0.550000
3 (48.168, 64.084] 0.434783
2 (32.252, 48.168] 0.404255
1 (16.336, 32.252] 0.369942
4 (64.084, 80.0] 0.090909
del train["Age_group"]
177 train資料集的891個乘客中,177人(接近20%)的年齡資料缺失,平均年齡為29.7,標準差為14.5,中位數為28。 對於age的缺失值,暫時用平均值跟標準差填補,這在某種程度上引入了噪聲。後期學到更高階的估算,再回來修改。
for dataset in combination_data:
    Age_avg = dataset.Age.mean()
    Age_std = dataset["Age"].std()
    missing_number = dataset["Age"].isnull().sum()
    dataset["Age"][np.isnan(dataset["Age"])] = np.random.randint(Age_avg - Age_std, Age_avg + Age_std, missing_number)
    dataset["Age"] = dataset["Age"].astype(int) 
for dataset in combination_data:
    dataset["Age_group"] = pd.cut(dataset.Age, 5)
for dataset in combination_data:
    dataset.loc[dataset["Age"]  <= 16,"Age"] = 0
    dataset.loc[(dataset["Age"] > 16) & (dataset["Age"] <= 32), "Age"] = 1
    dataset.loc[(dataset["Age"] > 32) & (dataset["Age"] <= 48), "Age"] = 2
    dataset.loc[(dataset["Age"] > 48) & (dataset["Age"] <= 64), "Age"] = 3
    dataset.loc[dataset["Age"]  > 64, "Age"] = 4
for dataset in combination_data:
- Fare
count 891.000000 mean 32.204208 std 49.693429 min 0.000000 25% 7.910400 50% 14.454200 75% 31.000000 max 512.329200 Name: Fare, dtype: float64
train["Fare_group"] = pd.qcut(train["Fare"],4) #分段
Fare_group Survived
0 (-0.001, 7.91] 0.197309
1 (7.91, 14.454] 0.303571
2 (14.454, 31.0] 0.454955
3 (31.0, 512.329] 0.581081



for dataset in combination_data:
    dataset.loc[dataset["Fare"]  <= 7.91,"Fare"] = 0
    dataset.loc[(dataset["Fare"] >  7.91)   & (dataset["Fare"] <= 14.454), "Fare"] = 1
    dataset.loc[(dataset["Fare"] >  14.454) & (dataset["Fare"] <= 31.0),   "Fare"] = 2
    dataset.loc[dataset["Fare"]  >  31.0, "Fare"] = 3
    dataset["Fare"] = dataset["Fare"].astype(int)
del train["Fare_group"]
## 字元型 ### Name 成員的名字沒有重複項,本可刪掉。但從別人的文章得知,外國人的名字長度、頭銜也能反映一個人的身份地位,於是我們來探究一下這兩個因素對生存率的影響: (1)名字長度
for dataset in combination_data:
    dataset["The_length_of_name"] = dataset["Name"].map(lambda x:len(re.split(" ",x)))
The_length_of_name Survived
6 9 1.000000
7 14 1.000000
4 7 0.842105
3 6 0.773585
5 8 0.555556
2 5 0.427083
1 4 0.340206
0 3 0.291803
from sklearn.preprocessing import StandardScaler
Stdsca = StandardScaler()
name_length1 = Stdsca.fit_transform(train[["The_length_of_name"]])
name_length1 = pd.DataFrame(name_length1,columns=["name_length"])
train = pd.concat([train,name_length1],axis=1)
name_length2 = Stdsca.fit_transform(test[["The_length_of_name"]])
name_length2 = pd.DataFrame(name_length2,columns=["name_length"])
test = pd.concat([test,name_length2],axis=1)
combination_data = [train,test]
for dataset in combination_data:
    del dataset["The_length_of_name"]
0 Braund, Mr. Owen Harris 1 Cumings, Mrs. John Bradley (Florence Briggs Th… 2 Heikkinen, Miss. Laina 3 Futrelle, Mrs. Jacques Heath (Lily May Peel) 4 Allen, Mr. William Henry 5 Moran, Mr. James 6 McCarthy, Mr. Timothy J Name: Name, dtype: object
for dataset in combination_data:
    dataset["Title"] = dataset["Name"].str.extract("([A-Za-z]+)\.",expand=False)
Survived Pclass Name Sex Age Ticket Fare Cabin Embarked Family_size Alone name_length Title
271 1 3 Tornquist, Mr. William Henry male 1 LINE 0 NaN S 1 1 -0.059474 Mr
389 1 2 Lehmann, Miss. Bertha female 1 SC 1748 1 NaN C 1 1 -0.914177 Miss
40 0 3 Ahlin, Mrs. Johan (Johanna Persdotter Larsson) female 2 7546 1 NaN S 2 0 1.649930 Mrs
709 1 3 Moubarek, Master. Halim Gonios (“William George”) male 1 2661 2 NaN C 2 0 1.649930 Master
Sex female male
Capt 0 1
Col 0 2
Countess 1 0
Don 0 1
Dr 1 6
Jonkheer 0 1
Lady 1 0
Major 0 2
Master 0 40
Miss 182 0
Mlle 2 0
Mme 1 0
Mr 0 517
Mrs 125 0
Ms 1 0
Rev 0 6
Sir 0 1
for dataset in combination_data:
    dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
    dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')
Title Survived
3 Mrs 0.793651
1 Miss 0.702703
0 Master 0.575000
4 Rare 0.347826
2 Mr 0.156673
for dataset in combination_data:
    dataset["Title"] = dataset["Title"].map({"Mr":1,"Mrs":2,"Miss":3,"Master":4,"Rare":5})
    dataset["Title"] = dataset["Title"].fillna(0)
for dataset in combination_data:
    del dataset["Name"]
Survived Pclass Sex Age Ticket Fare Cabin Embarked Family_size Alone name_length Title
0 0 3 male 1 A/5 21171 0 NaN S 2 0 -0.059474 1
1 1 1 female 2 PC 17599 3 C85 C 2 0 2.504633 2
2 1 3 female 1 STON/O2. 3101282 1 NaN S 1 1 -0.914177 3
  • Sex


Sex Survived
0 female 0.742038
1 male 0.188908
Pclass Sex Survived
0 1 female 0.968085
2 2 female 0.921053
4 3 female 0.500000
1 1 male 0.368852
3 2 male 0.157407
5 3 male 0.135447
for dataset in combination_data:
    dataset["Sex"] = dataset["Sex"].map({"male":0,"female":1})
Survived Pclass Sex Age Ticket Fare Cabin Embarked Family_size Alone name_length Title
0 0 3 0 1 A/5 21171 0 NaN S 2 0 -0.059474 1
1 1 1 1 2 PC 17599 3 C85 C 2 0 2.504633 2
2 1 3 1 1 STON/O2. 3101282 1 NaN S 1 1 -0.914177 3
3 1 1 1 2 113803 3 C123 S 2 0 2.504633 2
  • Cabin
a = train.Cabin.isnull().sum()
print("缺失個數:%d" % a)
缺失個數:687 超過75%的資料缺失,故不打算填補。考慮以Cabin是否缺失來構建一個新特徵,看是否對生存有影響。若沒有影響,則刪除該列。
train["Cabin_exist"] = train.Cabin.map(lambda x : "Yes" if type(x)==str else "No")
train[["Cabin_exist", "Survived"]].groupby("Cabin_exist",as_index=False).mean()
Cabin_exist Survived
0 No 0.299854
1 Yes 0.666667
del train["Cabin_exist"]
for dataset in combination_data:
    dataset["Cabin_exist"] = dataset["Cabin"].map(lambda x : 1 if type(x)==str else 0)
for dataset in combination_data:
    del dataset["Cabin"]
Survived Pclass Sex Age Ticket Fare Embarked Family_size Alone name_length Title Cabin_exist
0 0 3 0 1 A/5 21171 0 S 2 0 -0.059474 1 0
1 1 1 1 2 PC 17599 3 C 2 0 2.504633 2 1
2 1 3 1 1 STON/O2. 3101282 1 S 1 1 -0.914177 3 0
  • Embarked


Embarked Survived
2 S 644
0 C 168
1 Q 77
Embarked Survived
0 C 0.553571
1 Q 0.389610
2 S 0.336957
Sex Embarked Survived
2 0 S 441
5 1 S 203
0 0 C 95
3 1 C 73
1 0 Q 41
4 1 Q 36



train["Embarked"] = train.Embarked.fillna("S")
for dataset in combination_data:
    dataset["Embarked"] = dataset["Embarked"].map({"C":0,"Q":1,"S":2}).astype(int)
Survived Pclass Sex Age Ticke