kaggle初探--泰坦尼克號生存預測
繼續學習資料探勘,嘗試了kaggle上的泰坦尼克號生存預測。
Titanic for Machine Learning
匯入和讀取
# data processing
import numpy as np
import pandas as pd
import re
#visiulization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')
train = pd.read_csv('D:/data/titanic/train.csv' )
test = pd.read_csv('D:/data/titanic/test.csv')
train.head()
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th… | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
資料特徵有:PassengerId,無特別意義
Pclass,客艙等級,對生存有影響嗎?是否高等倉的有更多機會?
Name,姓名,可幫助我們判斷性別,大概年齡。
Sex,女性的生產率是否更高?
Age,不同年齡段是否對生存有影響?
SibSp和Parch,指是否有兄弟姐妹和配偶父母,有親人的情況下生存率是提高還是降低?
Fare,票價,高票價是否有更多機會?
Cabin,Embarked,客艙和登入港口……自然理解對生存應該沒有影響
train.describe()
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | |
---|---|---|---|---|---|---|---|
count | 891.000000 | 891.000000 | 891.000000 | 714.000000 | 891.000000 | 891.000000 | 891.000000 |
mean | 446.000000 | 0.383838 | 2.308642 | 29.699118 | 0.523008 | 0.381594 | 32.204208 |
std | 257.353842 | 0.486592 | 0.836071 | 14.526497 | 1.102743 | 0.806057 | 49.693429 |
min | 1.000000 | 0.000000 | 1.000000 | 0.420000 | 0.000000 | 0.000000 | 0.000000 |
25% | 223.500000 | 0.000000 | 2.000000 | 20.125000 | 0.000000 | 0.000000 | 7.910400 |
50% | 446.000000 | 0.000000 | 3.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 |
75% | 668.500000 | 1.000000 | 3.000000 | 38.000000 | 1.000000 | 0.000000 | 31.000000 |
max | 891.000000 | 1.000000 | 3.000000 | 80.000000 | 8.000000 | 6.000000 | 512.329200 |
train.describe(include=['O'])#['O'] indicates category feature
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
Name | Sex | Ticket | Cabin | Embarked | |
---|---|---|---|---|---|
count | 891 | 891 | 891 | 204 | 889 |
unique | 891 | 2 | 681 | 147 | 3 |
top | Hippach, Mrs. Louis Albert (Ida Sophia Fischer) | male | 1601 | C23 C25 C27 | S |
freq | 1 | 577 | 7 | 4 | 644 |
目標Survived特徵
survive_num = train.Survived.value_counts()
survive_num.plot.pie(explode=[0,0.1],autopct='%1.1f%%',labels=['died','survived'],shadow=True)
plt.show()
x=[0,1]
plt.bar(x,survive_num,width=0.35)
plt.xticks(x,('died','survived'))
plt.show()
特徵分析
num_f = [f for f in train.columns if train.dtypes[f] != 'object']
cat_f = [f for f in train.columns if train.dtypes[f]=='object']
print('there are %d numerical features:'%len(num_f),num_f)
print('there are %d category features:'%len(cat_f),cat_f)
there are 7 numerical features: [‘PassengerId’, ‘Survived’, ‘Pclass’, ‘Age’, ‘SibSp’, ‘Parch’, ‘Fare’]
there are 5 category features: [‘Name’, ‘Sex’, ‘Ticket’, ‘Cabin’, ‘Embarked’]
feature類別:
- 數值型
- 特徵型:可排序/不可排序型
- category不可排序型:sex,Embarked
category特徵
性別
train.groupby(['Sex'])['Survived'].count()
Sex
female 314
male 577
Name: Survived, dtype: int64
f,ax = plt.subplots(figsize=(8,6))
fig = sns.countplot(x='Sex',hue='Survived',data=train)
fig.set_title('Sex:Survived vs Dead')
plt.show()
train.groupby(['Sex'])['Survived'].sum()/train.groupby(['Sex'])['Survived'].count()
Sex
female 0.742038
male 0.188908
Name: Survived, dtype: float64
船上原有人數,男性遠高於女性;存活率,女性在75%左右,遠高於男性18%-19%.可見女性存活率遠高於男性,是重要特徵。
Embarked
sns.factorplot('Embarked','Survived',data=train)
plt.show()
f,ax = plt.subplots(1,3,figsize=(24,6))
sns.countplot('Embarked',data=train,ax=ax[0])
ax[0].set_title('No. Of Passengers Boarded')
sns.countplot(x='Embarked',hue='Survived',data=train,ax=ax[1])
ax[1].set_title('Embarked vs Survived')
sns.countplot('Embarked',hue='Pclass',data=train,ax=ax[2])
ax[2].set_title('Embarked vs Pclass')
#plt.subplots_adjust(wspace=0.2,hspace=0.5)
plt.show()
#pd.pivot_table(train,index='Embarked',columns='Pclass',values='Fare')
sns.boxplot(x='Embarked',y='Fare',hue='Pclass',data=train)
plt.show()
從圖中看出大部分乘客來自S port,其中多數為class 3,但是class 1 的人數也是3個口中最多的,C port的存活率最高,為0.55,因為C port中class1的人比例較高,Q port 絕大部分乘客是class 3的。C口1,2倉的票價均值較高,可能是暗示這個口上的人的社會地位較高。不過,從邏輯上說登入口對生存率是沒有影響的,所以可以將其轉成啞變數或drop.
Pclass
train.groupby('Pclass')['Survived'].value_counts()
Pclass Survived
1 1 136
0 80
2 0 97
1 87
3 0 372
1 119
Name: Survived, dtype: int64
plt.subplots(figsize=(8,6))
f = sns.countplot('Pclass',hue='Survived',data=train)
sns.factorplot('Pclass','Survived',hue='Sex',data=train)
plt.show()
class1,2的存活率明顯較高,1有半數以上存活,2也基本持平,1,2倉女性甚至接近於1,所以客艙等級對生存有很大影響。
SibSp
train[["SibSp", "Survived"]].groupby(['SibSp'], as_index=False).mean().sort_values(by='Survived', ascending=False)
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
SibSp | Survived | |
---|---|---|
1 | 1 | 0.535885 |
2 | 2 | 0.464286 |
0 | 0 | 0.345395 |
3 | 3 | 0.250000 |
4 | 4 | 0.166667 |
5 | 5 | 0.000000 |
6 | 8 | 0.000000 |
sns.factorplot('SibSp','Survived',data=train)
plt.show()
#pd.pivot_table(train,values='Survived',index='SibSp',columns='Pclass')
sns.countplot(x='SibSp',hue='Pclass',data=train)
plt.show()
在沒有同伴的情況下,存活率大概在0.3左右,有一個同伴的存活率最高>0.5,可能原因是1,2倉的乘客比例較高,隨後,隨著同伴數量增加而降低,降低的主要原因可能是,超過3人以上的乘客主要在class3,class3中3人以上存活率很低
Parch
#pd.pivot_table(train,values='Survived',index='Parch',columns='Pclass')
sns.countplot(x='Parch',hue='Pclass',data=train)
plt.show()
sns.factorplot('Parch','Survived',data=train)
plt.show()
趨勢跟SibSp相似,一個人存活率較低,在有1-3parents時存活率較高,隨後迅速降低,因為多數乘客來自class3
Age
train.groupby('Survived')['Age'].describe()
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
Survived | ||||||||
0 | 424.0 | 30.626179 | 14.172110 | 1.00 | 21.0 | 28.0 | 39.0 | 74.0 |
1 | 290.0 | 28.343690 | 14.950952 | 0.42 | 19.0 | 28.0 | 36.0 | 80.0 |
f,ax = plt.subplots(1,2,figsize=(16,6))
sns.violinplot('Pclass','Age',hue='Survived',data=train,split=True,ax=ax[0])
ax[0].set_title('Pclass Age & Survived')
sns.violinplot('Sex','Age',hue='Survived',data=train,split=True,ax=ax[1])
ax[1].set_title('Sex Age & Survived')
plt.show()
1等倉獲救年齡總體偏低,生存率年齡跨度大,尤其是20歲以上至50歲的生存率較高,可能和1等倉人年齡總體偏大有關;10歲左右的兒童在2,3等倉的生存率明顯提升,對於男性而言同理,兒童有個明顯提升,;女性的生存年齡集中在中青年;20-40歲左右的中青年人死亡人數最多。
Name
name主要用途是可以幫助我們分辨性別,幫助補充有相同title的年齡缺失值
#用正則表示式幫助找出姓名中表示年齡的title
def getTitle(data):
name_sal = []
for i in range(len(data['Name'])):
name_sal.append(re.findall(r'.\w*\.',data.Name[i]))
Salut = []
for i in range(len(name_sal)):
name = str(name_sal[i])
name = name[1:-1].replace("'","")
name = name.replace(".","").strip()
name = name.replace(" ","")
Salut.append(name)
data['Title'] = Salut
getTitle(train)
train.head(2)
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Title | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | Mr |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th… | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | Mrs |
pd.crosstab(train['Title'],train['Sex'])
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
Sex | female | male |
---|---|---|
Title | ||
Capt | 0 | 1 |
Col | 0 | 2 |
Countess | 1 | 0 |
Don | 0 | 1 |
Dr | 1 | 6 |
Jonkheer | 0 | 1 |
Lady | 1 | 0 |
Major | 0 | 2 |
Master | 0 | 40 |
Miss | 182 | 0 |
Mlle | 2 | 0 |
Mme | 1 | 0 |
Mr | 0 | 517 |
Mrs | 124 | 0 |
Mrs,L | 1 | 0 |
Ms | 1 | 0 |
Rev | 0 | 6 |
Sir | 0 | 1 |
補習一波英語:Mme:稱呼非英語民族的”上層社會”已婚婦女,及有職業的婦女,相當於Mrs;Jonkheer:鄉紳;Capt:船長;Lady:貴族夫人;Don唐:是西班牙語中貴族和有地位者的尊稱;the Countess:女伯爵;Ms:Ms.或Mz:婚姻狀態不明的婦女;Col:上校;Major:少校;Mlle:小姐;Rev:牧師。
Fare
train.groupby('Pclass')['Fare'].mean()
Pclass
1 84.154687
2 20.662183
3 13.675550
Name: Fare, dtype: float64
sns.distplot(train['Fare'].dropna())
plt.xlim((0,200))
plt.xticks(np.arange(0,200,10))
plt.show()
初步分析總結:
- 對於性別,女性生存率明顯高於男性
- 頭等艙生存率很高,3等倉很低,class1,2女性生存率接近於1
- 10歲左右的兒童生存率又明顯提升
- SibSp和Parch相似,一個人存活率較低,有1-2SibSp或者1-3Parents生存率較高,但超過後生存率大幅下降
- name和age可以對所有資料進行處理,用name提取性別title,藉助均值對age進行補充
資料處理
#合併訓練集和測試集
passID = test['PassengerId']
all_data = pd.concat([train,test],keys=["train","test"])
all_data.shape
#all_data.head()
(1309, 13)
#統計缺失值
NAs = pd.concat([train.isnull().sum(),train.isnull().sum()/train.isnull().count(),test.isnull().sum(),test.isnull().sum()/test.isnull().count()],axis=1,keys=["train","percent_train","test","percent"])
NAs[NAs.sum(axis=1)>1].sort_values(by="percent",ascending=False)
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
train | percent_train | test | percent | |
---|---|---|---|---|
Cabin | 687 | 0.771044 | 327.0 | 0.782297 |
Age | 177 | 0.198653 | 86.0 | 0.205742 |
Fare | 0 | 0.000000 | 1.0 | 0.002392 |
Embarked | 2 | 0.002245 | 0.0 | 0.000000 |
#刪除無意義特徵
all_data.drop(['PassengerId','Cabin'],axis=1,inplace=True)
all_data.head(2)
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
Age | Embarked | Fare | Name | Parch | Pclass | Sex | SibSp | Survived | Ticket | Title | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|
train | 0 | 22.0 | S | 7.2500 | Braund, Mr. Owen Harris | 0 | 3 | male | 1 | 0.0 | A/5 21171 | Mr |
1 | 38.0 | C | 71.2833 | Cumings, Mrs. John Bradley (Florence Briggs Th… | 0 | 1 | female | 1 | 1.0 | PC 17599 | Mrs |
Age處理
#先提取name中的title
getTitle(all_data)
pd.crosstab(all_data['Title'], all_data['Sex'])
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
Sex | female | male |
---|---|---|
Title | ||
Capt | 0 | 1 |
Col | 0 | 4 |
Countess | 1 | 0 |
Don | 0 | 1 |
Dona | 1 | 0 |
Dr | 1 | 7 |
Jonkheer | 0 | 1 |
Lady | 1 | 0 |
Major | 0 | 2 |
Master | 0 | 61 |
Miss | 260 | 0 |
Mlle | 2 | 0 |
Mme | 1 | 0 |
Mr | 0 | 757 |
Mrs | 196 | 0 |
Mrs,L | 1 | 0 |
Ms | 2 | 0 |
Rev | 0 | 8 |
Sir | 0 | 1 |
all_data['Title'] = all_data['Title'].replace(
['Lady','Dr','Dona','Mme','Countess'],'Mrs')
all_data['Title'] =all_data['Title'].replace('Mlle','Miss')
all_data['Title'] =all_data['Title'].replace('Mrs,L','Mrs')
all_data['Title'] = all_data['Title'].replace('Ms', 'Miss')
#all_data['Title'] = all_data['Title'].replace('Mme', 'Mrs')
all_data['Title'] = all_data['Title'].replace(['Capt','Col','Don','Major','Rev','Jonkheer','Sir'],'Mr')
'''
all_data['Title'] = all_data.Title.replace({'Mlle':'Miss','Mme':'Mrs','Ms':'Miss','Dr':'Mrs',
'Major':'Mr','Lady':'Mrs','Countess':'Mrs',
'Jonkheer':'Mr','Col':'Mr','Rev':'Mr',
'Capt':'Mr','Sir':'Mr','Don':'Mr','Mrs,L':'Mrs'})
'''
all_data.Title.isnull().sum()
0
all_data[:train.shape[0]].groupby('Title')['Age'].mean()
Title
Master 4.574167
Miss 21.845638
Mr 32.891990
Mrs 36.188034
Name: Age, dtype: float64
#通過訓練集中title對應的age均值替換
all_data.loc[(all_data.Age.isnull()) & (all_data.Title=='Mr'),'Age']=32
all_data.loc[(all_data.Age.isnull())&(all_data.Title=='Mrs'),'Age']=36
all_data.loc[(all_data.Age.isnull())&(all_data.Title=='Master'),'Age']=5
all_data.loc[(all_data.Age.isnull())&(all_data.Title=='Miss'),'Age']=22
#all_data.loc[(all_data.Age.isnull())&(all_data.Title=='other'),'Age']=46
all_data.Age.isnull().sum()
0
all_data[:train.shape[0]][['Title', 'Survived']].groupby(['Title'], as_index=False).mean()
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
Title | Survived | |
---|---|---|
0 | Master | 0.575000 |
1 | Miss | 0.702703 |
2 | Mr | 0.158192 |
3 | Mrs | 0.777778 |
f,ax = plt.subplots(1,2,figsize=(16,6))
sns.distplot(all_data[:train.shape[0]].loc[all_data[:train.shape[0]].Sex=='female','Age'],color='red',ax=ax[0])
sns.distplot(all_data[:train.shape[0]].loc[all_data[:train.shape[0]].Sex=='male','Age'],color='blue',ax=ax[0])
sns.distplot(all_data[:train.shape[0]].loc[all_data[:train.shape[0]].Survived==0,'Age' ],
color='red', label='Not Survived', ax=ax[1])
sns.distplot(all_data[:train.shape[0]].loc[all_data[:train.shape[0]].Survived==1,'Age' ],
color='blue', label='Survived', ax=ax[1])
plt.legend(loc='best')
plt.show()
- 16歲左右兒童存活率較高,最年長乘客(80歲)倖存
- 大量16~40青少年沒有存活
- 大多數乘客在16~40歲
- 為輔助分類,將年齡分段,創造新特徵,同時增加兒童特徵
add isChild
def male_female_child(passenger):
# 取年齡和性別
age,sex = passenger
# 提出兒童特徵
if age < 16:
return 'child'
else:
return sex
# 建立新特徵
all_data['person'] = all_data[['Age','Sex']].apply(male_female_child,axis=1)
#0-80歲的年齡分佈,若分段成3組,按少年、中青年、老年分
all_data['Age_band']=0
all_data.loc[all_data['Age']<=16,'Age_band']=0
all_data.loc[(all_data['Age']>16)&(all_data['Age']<=40),'Age_band']=1
all_data.loc[all_data['Age']>40,'Age_band']=2
Name處理
df = pd.get_dummies(all_data['Title'],prefix='Title')
all_data = pd.concat([all_data,df],axis=1)
all_data.drop('Title',axis=1,inplace=True)
#drop name
all_data.drop('Name',axis=1,inplace=True)
fiilna Embarked
all_data.loc[all_data.Embarked.isnull()]
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
Age | Embarked | Fare | Parch | Pclass | Sex | SibSp | Survived | Ticket | Title | person | Age_band | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
train | 61 | 38.0 | NaN | 80.0 | 0 | 1 | female | 0 | 1.0 | 113572 | 2 | female | 1 |
829 | 62.0 | NaN | 80.0 | 0 | 1 | female | 0 | 1.0 | 113572 | 3 | female | 2 |
票價80,一等艙,很大概率是C口
all_data['Embarked'].fillna('C',inplace=True)
all_data.Embarked.isnull().any()
False
embark_dummy = pd.get_dummies(all_data.Embarked)
all_data = pd.concat([all_data,embark_dummy],axis=1)
all_data.head(2)
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
Age | Embarked | Fare | Parch | Pclass | Sex | SibSp | Survived | Ticket | person | Age_band | Title_Master | Title_Miss | Title_Mr | Title_Mrs | C | Q | S | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
train | 0 | 22.0 | S | 7.2500 | 0 | 3 | male | 1 | 0.0 | A/5 21171 | male | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
1 | 38.0 | C | 71.2833 | 0 | 1 | female | 1 | 1.0 | PC 17599 | female | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
add SibSp and Parch
#創造familysize和alone兩個新特徵
all_data['Family_size'] = all_data['SibSp']+all_data['Parch']#是所有親屬總和
all_data['alone'] = 0#不是一個人
all_data.loc[all_data.Family_size==0,'alone']=1#代表是一個人
f,ax=plt.subplots(1,2,figsize=(16,6))
sns.factorplot('Family_size','Survived',data=all_data[:train.shape[0]],ax=ax[0])
ax[0].set_title('Family_size vs Survived')
sns.factorplot('alone','Survived',data=all_data[:train.shape[0]],ax=ax[1])
ax[1].set_title('alone vs Survived')
plt.close(2)
plt.close(3)
plt.show()
當乘客一個人的時候,生存率很低,大概在0.3左右,有1-3家庭成員時生存率上升,但>4時,生存率又急速下降。
#再將family size分段
all_data['Family_size'] = np.where(all_data['Family_size']==0, 'solo',
np.where(all_data['Family_size']<=3, 'normal', 'big'))
sns.factorplot('alone','Survived',hue='Sex',data=all_data[:train.shape[0]],col='Pclass')
plt.show()
對於女性,1,2等倉來說,是否一個人對生存率影響不大,但對於3等倉女性,一個人時反而生存率提高。
all_data['poor_girl'] = 0
all_data.loc[(all_data['Sex']=='female')&(all_data['Pclass']==3)&(all_data['alone']==1),'poor_girl']=1
連續變數Fare填充、分段
#補充全缺失值
all_data.loc[(all_data.Fare.isnull()) & (all_data.Pclass==1),'Fare']=84
all_data.loc[(all_data.Fare.isnull()) & (all_data.Pclass==2),'Fare']=21
all_data.loc[(all_data.Fare.isnull()) & (all_data.Pclass==3),'Fare']=14
sns.distplot(all_data[:train.shape[0]].loc[all_data[:train.shape[0]].Survived==0,'Fare' ],
color='red', label='Not Survived')
sns.distplot(all_data[:train.shape[0]].loc[all_data[:train.shape[0]].Survived==1,'Fare' ],
color='blue', label='Survived')
plt.xlim((0,100))
(0, 100)
sns.lmplot('Fare','Survived',data=all_data[:train.shape[0]])
plt.show()
#Fare平均分成3段取均值
all_data['Fare_band'] = pd.qcut(all_data['Fare'],3)
all_data[:train.shape[0]].groupby('Fare_band')['Survived'].mean()
Fare_band
(-0.001, 8.662] 0.198052
(8.662, 26.0] 0.402778
(26.0, 512.329] 0.559322
Name: Survived, dtype: float64
#將連續變數fare分段,離散化
all_data['Fare_cut'] = 0
all_data.loc[all_data['Fare']<=8.662,'Fare_cut'] = 0
all_data.loc[((all_data['Fare']>8.662) & (all_data['Fare']<=26)),'Fare_cut'] = 1
#all_data.loc[((all_data['Fare']>14.454) & (all_data['Fare']<=31.275)),'Fare_cut'] = 2
all_data.loc[((all_data['Fare']>26) & (all_data['Fare']<513)),'Fare_cut'] = 2
sns.factorplot('Fare_cut','Survived',hue='Sex',data=all_data[:train.shape[0]])
plt.show()
價格上升,生存率增加,對男性尤為明顯
# creat a feature about rich man
all_data['rich_man'] = 0
all_data.loc[((all_data['Fare']>=80) & (all_data['Sex']=='male')),'rich_man'] = 1
型別特徵數值化
all_data.head()
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
Age | Embarked | Fare | Parch | Pclass | Sex | SibSp | Survived | Ticket | person | … | Title_Mrs | C | Q | S | Family_size | alone | poor_girl | Fare_band | Fare_cut | rich_man | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
train | 0 | 22.0 | S | 7.2500 | 0 | 3 | male | 1 | 0.0 | A/5 21171 | male | … | 0 | 0 | 0 | 1 | normal | 0 | 0 | (-0.001, 8.662] | 0 | 0 |
1 | 38.0 | C | 71.2833 | 0 | 1 | female | 1 | 1.0 | PC 17599 | female | … | 1 | 1 | 0 | 0 | normal | 0 | 0 | (26.0, 512.329] | 2 | 0 | |
2 | 26.0 | S | 7.9250 | 0 | 3 | female | 0 | 1.0 | STON/O2. 3101282 | female | … | 0 | 0 | 0 | 1 | solo | 1 | 1 | (-0.001, 8.662] | 0 | 0 | |
3 | 35.0 | S | 53.1000 | 0 | 1 | female | 1 | 1.0 | 113803 | female | … | 1 | 0 | 0 | 1 | normal | 0 | 0 | (26.0, 512.329] | 2 | 0 | |
4 | 35.0 | S | 8.0500 | 0 | 3 | male | 0 | 0.0 | 373450 | male | … | 0 | 0 | 0 | 1 | solo | 1 | 0 | (-0.001, 8.662] | 0 | 0 |
5 rows × 24 columns
捨棄特徵有Embarked(已離散化),Fare,Fare_band(已用Fare_cut代替),Sex(已用Person代替),Age(有Age_band),Ticket,S,SibSp,Parch
'''
捨棄不需要的特徵:Age,用Age_band分段代替了,
Fare,Fare_band用Fare_cut分段代替了
Ticket無意義
'''
#all_data.drop(['Age','Fare','Fare_band','Ticket'],axis=1,inplace=True)
#all_data.drop(['Age','Fare','Fare_band','Ticket','Embarked','C'],axis=1,inplace=True)
all_data.drop(['Age','Fare','Ticket','Embarked','C','Fare_band','SibSp','Parch'],axis=1,inplace=True)
all_data.head(2)
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
Pclass | Sex | Survived | person | Age_band | Title_Master | Title_Miss | Title_Mr | Title_Mrs | Q | S | Family_size | alone | poor_girl | Fare_cut | rich_man | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
train | 0 | 3 | male | 0.0 | male | 1 | 0 | 0 | 1 | 0 | 0 | 1 | normal | 0 | 0 | 0 | 0 |
1 | 1 | female | 1.0 | female | 1 | 0 | 0 | 0 | 1 | 0 | 0 | normal | 0 | 0 | 2 | 0 |
df1 = pd.get_dummies(all_data['Family_size'],prefix='Family_size')
df2 = pd.get_dummies(all_data['person'],prefix='person')
df3 = pd.get_dummies(all_data['Age_band'],prefix='age')
all_data = pd.concat([all_data,df1,df2,df3],axis=1)
all_data.head()
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
Pclass | Sex | Survived | person | Age_band | Title_Master | Title_Miss | Title_Mr | Title_Mrs | Q | … | rich_man | Family_size_big | Family_size_normal | Family_size_solo | person_child | person_female | person_male | age_0 | age_1 | age_2 | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
train | 0 | 3 | male | 0.0 | male | 1 | 0 | 0 | 1 | 0 | 0 | … | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
1 | 1 | female | 1.0 | female | 1 | 0 | 0 | 0 | 1 | 0 | … | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | |
2 | 3 | female | 1.0 | female | 1 | 0 | 1 | 0 | 0 | 0 | … | 0 | 0 |