資料分析--用R語言預測離職(上)
阿新 • • 發佈:2019-02-08
資料分析–用R語言預測離職(上)
資料可以直接下載,欄位都是英文的,部分欄位描述如下:
變數型別 | 變數名 | 描述 | 取值範圍 |
---|---|---|---|
結果變數 | Attrition | 員工是否流失 | Yes, No |
自變數 | Age | 年齡 | 數值 |
BusinessTravel | 出差 | 1.Non-Travel, 2.Travel_Rarely 3.Travel_Frequently | |
Department | 部門 | 1.Sales 2.Research & Development 3.Human Resources | |
DistanceFromHome | 公司到家的距離 | 數值 | |
Education | 學歷 | 1 ‘Below College’ 2 ‘College’ 3 ‘Bachelor’ 4 ‘Master’ 5 ‘Doctor’ | |
EducationField | 學歷領域 | ||
EnvironmentSatisfaction | 環境滿意度 | 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’ | |
Gender | 性別 | 1.Male 2.Female | |
JobInvolvement | 工作投入 | 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’ | |
JobLevel | 職位等級 | ||
JobRole | 職位 | ||
JobSatisfaction | 工作滿意度 | 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’ | |
MaritalStatus | 是否結婚 | 1.Single 2.Married 3.Divorced | |
MonthlyIncome | 月收入 | 數值 | |
NumCompaniesWorked | 任職過的企業數量 | 數值 | |
OverTime | 是否加班 | Yes, No | |
PercentSalaryHike | 漲薪百分比 | 數值 | |
PerformanceRating | 績效評分 | 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’ | |
RelationshipSatisfaction | 關係滿意度 | 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’ | |
StockOptionLevel | 員工優先認股權 | 數值 | |
TotalWorkingYears | 工齡 | 數值 | |
TrainingTimesLastYear | 上一年培訓次數 | 數值 | |
WorkLifeBalance | 工作與生活平衡情況 | 1 ‘Bad’ 2 ‘Good’ 3 ‘Better’ 4 ‘Best’ | |
YearsAtCompany | 在公司工作時間 | 數值 | |
YearsInCurrentRole | 當前職位的工作時間 | 數值 | |
YearsSinceLastPromotion | 距離上次升職的時間 | 數值 | |
YearsWithCurrManager | 與當前經理工作的時間 | 數值 |
資料讀取
讀取資料之後,summary一下,觀察變數
(注意一點:在讀取資料的時候,stringsAsFactors = T,因為資料裡面有字串的變數)
> attr.df <- read.csv("HR-Employee-Attrition.csv",header = T,stringsAsFactors = T)
> summary(attr.df)
Age Attrition BusinessTravel DailyRate Department DistanceFromHome
Min. :18.00 No :1233 Non-Travel : 150 Min. : 102.0 Human Resources : 63 Min. : 1.000
1st Qu.:30.00 Yes: 237 Travel_Frequently: 277 1st Qu.: 465.0 Research & Development:961 1st Qu.: 2.000
Median :36.00 Travel_Rarely :1043 Median : 802.0 Sales :446 Median : 7.000
Mean :36.92 Mean : 802.5 Mean : 9.193
3rd Qu.:43.00 3rd Qu.:1157.0 3rd Qu.:14.000
Max. :60.00 Max. :1499.0 Max. :29.000
Education EducationField EmployeeCount EmployeeNumber EnvironmentSatisfaction Gender
Min. :1.000 Human Resources : 27 Min. :1 Min. : 1.0 Min. :1.000 Female:588
1st Qu.:2.000 Life Sciences :606 1st Qu.:1 1st Qu.: 491.2 1st Qu.:2.000 Male :882
Median :3.000 Marketing :159 Median :1 Median :1020.5 Median :3.000
Mean :2.913 Medical :464 Mean :1 Mean :1024.9 Mean :2.722
3rd Qu.:4.000 Other : 82 3rd Qu.:1 3rd Qu.:1555.8 3rd Qu.:4.000
Max. :5.000 Technical Degree:132 Max. :1 Max. :2068.0 Max. :4.000
HourlyRate JobInvolvement JobLevel JobRole JobSatisfaction MaritalStatus
Min. : 30.00 Min. :1.00 Min. :1.000 Sales Executive :326 Min. :1.000 Divorced:327
1st Qu.: 48.00 1st Qu.:2.00 1st Qu.:1.000 Research Scientist :292 1st Qu.:2.000 Married :673
Median : 66.00 Median :3.00 Median :2.000 Laboratory Technician :259 Median :3.000 Single :470
Mean : 65.89 Mean :2.73 Mean :2.064 Manufacturing Director :145 Mean :2.729
3rd Qu.: 83.75 3rd Qu.:3.00 3rd Qu.:3.000 Healthcare Representative:131 3rd Qu.:4.000
Max. :100.00 Max. :4.00 Max. :5.000 Manager :102 Max. :4.000
(Other) :215
MonthlyIncome MonthlyRate NumCompaniesWorked Over18 OverTime PercentSalaryHike PerformanceRating
Min. : 1009 Min. : 2094 Min. :0.000 Y:1470 No :1054 Min. :11.00 Min. :3.000
1st Qu.: 2911 1st Qu.: 8047 1st Qu.:1.000 Yes: 416 1st Qu.:12.00 1st Qu.:3.000
Median : 4919 Median :14236 Median :2.000 Median :14.00 Median :3.000
Mean : 6503 Mean :14313 Mean :2.693 Mean :15.21 Mean :3.154
3rd Qu.: 8379 3rd Qu.:20462 3rd Qu.:4.000 3rd Qu.:18.00 3rd Qu.:3.000
Max. :19999 Max. :26999 Max. :9.000 Max. :25.00 Max. :4.000
RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance
Min. :1.000 Min. :80 Min. :0.0000 Min. : 0.00 Min. :0.000 Min. :1.000
1st Qu.:2.000 1st Qu.:80 1st Qu.:0.0000 1st Qu.: 6.00 1st Qu.:2.000 1st Qu.:2.000
Median :3.000 Median :80 Median :1.0000 Median :10.00 Median :3.000 Median :3.000
Mean :2.712 Mean :80 Mean :0.7939 Mean :11.28 Mean :2.799 Mean :2.761
3rd Qu.:4.000 3rd Qu.:80 3rd Qu.:1.0000 3rd Qu.:15.00 3rd Qu.:3.000 3rd Qu.:3.000
Max. :4.000 Max. :80 Max. :3.0000 Max. :40.00 Max. :6.000 Max. :4.000
YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000
1st Qu.: 3.000 1st Qu.: 2.000 1st Qu.: 0.000 1st Qu.: 2.000
Median : 5.000 Median : 3.000 Median : 1.000 Median : 3.000
Mean : 7.008 Mean : 4.229 Mean : 2.188 Mean : 4.123
3rd Qu.: 9.000 3rd Qu.: 7.000 3rd Qu.: 3.000 3rd Qu.: 7.000
Max. :40.000 Max. :18.000 Max. :15.000 Max. :17.000
我們的資料總共有1470行,35列
上面Attrition是我們研究的變數:代表是否離職的意思
從上面我們可以看出:
1.離職的人數佔總人數的 16%左右;
2.月收入平均為:6503,中值為:4919,其中中值更能代表薪資水平
3.加班的人數佔總人數的28%(Overtime欄位)
資料分析及視覺化
下面我們來看下離職的人和各個變數之間的關係:
> library(ggplot2)
> library(gridExtra)
> g1 <- ggplot(attr.df, aes(x=Age,fill=Attrition))+
+ geom_density(alpha = 0.7)
> g2 <- ggplot(attr.df, aes(x=DistanceFromHome, fill=Attrition))+
+ geom_density(alpha = 0.7)
> g3 <- ggplot(attr.df, aes(x=MonthlyIncome, fill=Attrition))+
+ geom_density(alpha = 0.7)
> g4 <- ggplot(attr.df, aes(x=NumCompaniesWorked, fill= Attrition))+
+ geom_density(alpha = 0.7)
> g5 <- ggplot(attr.df, aes(x=TotalWorkingYears, fill= Attrition))+
+ geom_density(alpha = 0.7)
> g6 <- ggplot(attr.df, aes(x=TrainingTimesLastYear, fill= Attrition))+
+ geom_density(alpha = 0.7)
> g7 <- ggplot(attr.df, aes(x=YearsAtCompany, fill= Attrition))+
+ geom_density(alpha = 0.7)
> g8 <- ggplot(attr.df, aes(x=YearsInCurrentRole, fill= Attrition))+
+ geom_density(alpha = 0.7)
> g9 <- ggplot(attr.df, aes(x=YearsWithCurrManager, fill= Attrition))+
+ geom_density(alpha = 0.7)
> grid.arrange(g1,g2,g3,g4,g5,g6,g7,g8,g9, ncol = 3, nrow = 3)
這裡選擇的9個變數,來做核密度曲線:
其中我們可以看出
1.從年齡上面看30歲左右的人員是離職的高峰,
2.從離家距離來看,10英里意外的人員離職的概率會比較大
3.低收入的人員離職概率較大
4.在任職公司超過5個的離職概率較大
5.工齡在5年以下的離職率要高
其可能的原因在於年輕的員工更傾向於多嘗試,且對未來目標相對迷茫,高流失率也意味著此類員工難以在短期形成對企業價值觀的長期認同。