利用R語言分析挖掘Titanic資料集(二)
阿新 • • 發佈:2019-01-07
6.視別與視覺化技術
1)執行資料的探索與視覺化技術
>barplot(table(train.data$Survived),main="passenger survival",names = c("perished","survived"))
2)繪製乘客艙位等級分佈圖
>barplot(table(train.data$Pclass),main = "passenger class",names = c("first","second","third"))
3)用條形圖展示性別資訊
>barplot(table(train.data$Sex) ,main = "passenger gender",names = c("F","M"))
4)使用hist繪製不同年齡乘客數目的直方圖
>hist(train.data$Age,main = "passager age",xlab = "Age")
5)繪製乘客同船的兄弟姐妹或者配偶數目的條形圖
> barplot(table(train.data$SibSp),main = "passenger SibSp")
6)繪製父母子女同乘船的資訊
> barplot(table(train.data$Parch) ,main = "passenger parch")
7)繪製乘客票價直方圖
>hist(train.data$Fare,main = "passenger parch",xlab = "Fare")
#####8)乘客港口出發信息
> barplot(table(train.data$Embarked),main = "port of embarkation")
9)使用barplot函式尋找什麼性別乘客在沉船事故中喪生概率更大
>counts = table(train.data$Survived,train.data $Sex)
>barplot(counts,col = c("darkblue","red"),legend = c("Perished","Survived"),main = "passenger survived by sex")
10)船艙等級(pclass)是否對逃生概率有影響
> counts = table(train.data$Survived,train.data$Pclass)
> barplot(counts,col = c("darkblue","red"),legend = c("Perished","Survived"),main = "passenger survived by pclass")
11)分析每種艙位中乘客的性別分佈
counts = table(train.data$Sex,train.data$Pclass)
barplot(counts,col = c("darkblue","red"),legend = rownames(counts),main = "passenger Gender by pclass")
12)用直方圖檢視乘客年齡的分佈
> hist(train.data$Age[which(train.data$Survived == "0")],main = "Passenger Age Histogram",xlab = "Age",ylab = "count",col = "blue",breaks=seq(0,80,by=2))
> hist(train.data$Age[which(train.data$Survived == "1")],col = "red",add=T,breaks=seq(0,80,by=2))
13)為了獲得更多有關年齡與逃生概率之間的細節,使用boxplot函式箱圖來分析:
> boxplot(train.data$Age ~ train.data$Survived,
+ main = "passenger survival by age",
+ xlab = "survived",ylab = "age")
14)將乘客按照年齡段分成不同的組,例如兒童(小於13歲),少年(13-19歲),成年(20-65歲),老年(大於65歲)
> train.child = train.data$Survived[train.data$Age<13]
> length(train.child[which(train.child == 1)])/length(train.child)
[1] 0.57525
> train.youth = train.data$Survived[train.data$Age >= 13 & train.data$Age < 25]
> length(train.youth[which(train.youth == 1)])/length(train.youth)
33
1] 0.408133
> train.adult = train.data$Survived[train.data$Age >= 25 & train.data$Age < 65]
> length(train.adult[which(train.adult == 1)])/length(train.adult)
[1] 0.3540925
> train.older = train.data$Survived[train.data$Age>=65]
> length(train.older[which(train.older == 1)])/length(train.older)
[1] 0.09090909
15)分析
從1圖可以知道,死亡人數要大於獲救人數。
從2圖可以知道,三等艙所佔的比例最大
從3圖可以知道,男性乘客多於女性乘客
從4圖可以知道,大多數乘客年齡在20歲與40歲之間
從5圖可以知道,大多數乘客都有隨行的兄弟姐妹或者配偶
從6圖可以知道,大多數乘客父母或者子女的隨行人數都在0到2人之間
從7圖可以知道,票價的不同暗示艙位的不同
從8圖可以知道,曾經在三個港口停留搭載乘客
從9圖可以知道,女性獲救的概率要大於男性
從10圖可以知道,表面上等級越獲救的概率越大,但是真的是這樣麼??
從11圖可以知道,大多數三等艙的乘客是男性,所以三等的艙的死亡概率大點
從12圖可以知道,各個年齡段的獲救情況,並不能很明確有告訴我們不同年齡段在逃生概率上的不同,也不能證明那一個更容易獲救
從13圖可以知道,獲救的情況與年齡的分佈情況,顯示出資料的分佈情況
從最後的詳細年齡段劃分,年齡越小,逃生概率越大。