1. 程式人生 > 實用技巧 >Lending Club 貸款資料 - 探索性分析(附程式碼)

Lending Club 貸款資料 - 探索性分析(附程式碼)

《python風控建模實戰lendingClub》視訊教程

騰訊課堂入口:

https://ke.qq.com/course/3063950?tuin=dcbf0ba

網易雲課堂報名入口:

https://study.163.com/course/introduction.htm?courseId=1005988013&_trace_c_p_k2_=89165cf6a75a4945bf15d85adbe3d63c

(騰訊課堂新營業,報名可領取20元優惠券)

Introduction

Lending Clubis a peer-to-peer lending company that matches borrowers with investors through an online platform. It services people that need personal loans between $1,000 and $40,000. Borrowers receive the full amount of the issued loan minus the origination fee, which is paid to the company. Investors purchase notes backed by the personal loans and pay Lending Club a service fee. The company shares data about all loans issued through its platform during certain time periods.

This analysis will focus on the Lending ClubLoan Datafrom the first quarter of 2017. This document is generated usingR Markdown. The code that powers the analysis is hidden by default but you can expand any section by clicking theCodebutton, like the one in the top right corner of this section. So far I have loaded in the actual data file and the data dictionary. I also performed some minor formatting to prepare for the rest of the analysis.

We’ll start off by running some broad summary statistics and using this information to clean up the data set. Once the data are reasonably formatted we will move on to visualizing the relationships between the variables.

Lending Club是個人對個人的借貸公司,通過線上平臺將借貸者與投資者配對。它為需要1,000至40,000美元之間個人貸款的人們提供服務。借款人將獲得已發放貸款的全額減去向公司支付的啟動費。投資者購買由個人貸款支援的票據,並向Lending Club支付服務費。該公司共享在特定時間段內通過其平臺發行的所有貸款的資料。

該分析將重點關注2017年第一季度以來的Lending Club貸款資料。該文件使用R Markdown生成。預設情況下,用於分析的程式碼是隱藏的,但是您可以通過單擊“程式碼”按鈕展開任何部分,例如本部分右上角的程式碼。到目前為止,我已經載入了實際的資料檔案和資料字典。我還進行了一些次要格式化,以準備其餘的分析。

我們將從執行一些廣泛的摘要統計資訊開始,並使用此資訊來清理資料集。合理格式化資料後,我們將繼續視覺化變數之間的關係。

Broad Summary Statistics and Scrubbing

Before we start any analysis or data scrubbing, let’s join in thedata dicionaryso we can have a quick reference to what the variables actually mean:

廣泛彙總統計和清理

在開始任何分析或資料清理之前,讓我們加入資料分類,以便快速瞭解變數的實際含義:

# Extracting all data column names and joining to data dictionary
loan_data_cols <- data.frame(Variable = colnames(loan_data),
                             stringsAsFactors = F)
data_dict$var <- trimws(data_dict$var)
mapping <- loan_data_cols %>% left_join(data_dict, by = c("Variable" = "var"))
colnames(mapping)[2] <- "Full Description"
# Formatting into interactive HTML table
lib_load("DT")
mapping <- sapply(mapping, trimws)
datatable(mapping)


我們將從廣泛瞭解資料集中的不同變數型別開始。在刪除了一些空列和編校列之後,我們總共只剩下113個變數。有94個連續變數和22個分類變數。讓我們對連續變數執行一些摘要統計資訊:
We’ll start by taking a broad look at the different variable types in the data set. After dropping some empty and redacted columns we are left with 113 variables in total. There appear to be 94 continuous variables and 22 categorical variables. Let’s run some summary statistics on thecontinuous variables:

# Grabbing column types to see if they are categorical or contiunous
all_vars <- unlist(lapply(loan_data,class))
# Enters zero NAs for summary when there are none so the summary data structures can be combined
# Borrowed from: https://stackoverflow.com/questions/32011873/force-summary-to-report-the-number-of-nas-even-if-none
custom_summary <- function(var) {
  
  if(!any(is.na(var))) {
    res <- c(summary(var),"NA's"=0)
  } else {
    res <- summary(var)
  }
  return(res)
}
# Extracting continuous variables
cont_info <- lapply(loan_data[,all_vars == "numeric"], custom_summary) 
# Formatting summaries into uniform data structure and combining
cont_names <- names(all_vars[all_vars == "numeric"])
cont_info <- lapply(1:length(cont_info), function(inx) {
  
  new_vect <- c(cont_names[inx],round(cont_info[[inx]],2))
  
  names(new_vect)[1] <- "Var Name"
  
  new_vect
  
}); cont_info <- do.call(rbind,cont_info)
# Formatting into interactive HTML table
datatable(cont_info)

我們可以看到一些描述貸款的變數,例如金額、付款、利率和期限。我們還可以看到一些關於借款人的描述性資訊,如年收入、債務收入比(DTI)、抵押賬戶數量和總信貸限額。看起來dti、回收和回收費用有一些資料問題。後兩者在整個資料集中都是空白的。dti上的最小值和最大值似乎完全關閉。我們可以繪製各種dti範圍的直方圖,以獲得更合理的資料界限:

We see several variables that describe the loan such as the amount, payment, interest rate, and term. We also see some descriptive information on the borrower such as annual income, debt-to-income ratio (DTI), number of mortgage accounts, and total credit limit. It looks likedti,recoveries, andcollection_recovery_feehave some data issues. The latter two are blank throughout the entire data set. The minimum and maximum values ondtiseem completely off. We can plot the histograms of variousdtiranges to get more reasonable bounds for the data:

# First cleaning up the blank variables
loan_data[,c("recoveries","collection_recovery_fee")] = NULL
# Plotting the density of dti under various cut-offs
dti_raw <- loan_data$dti
lib_load("ggplot2")
dens1 <- qplot(dti_raw, fill = I("dodgerblue4"), 
               alpha = I(0.4), col = I("grey29")) + xlab("dti full range") + ylab("Count")
dens2 <- qplot(dti_raw[dti_raw > 0 & dti_raw < 15], fill = I("dodgerblue4"), 
               alpha = I(0.4), col = I("grey29")) + xlab("0 < dti < 15") + ylab("Count")
dens3 <- qplot(dti_raw[dti_raw > 0 & dti_raw < 50], fill = I("dodgerblue4"), 
               alpha = I(0.7), col = I("grey29")) + xlab("0 < dti < 50") + ylab("Count")
dens4 <- qplot(dti_raw[dti_raw > 50 & dti_raw < 9999], fill = I("dodgerblue4"), 
               alpha = I(0.4), col = I("grey29")) + xlab("50 < dti < 9999") + ylab("Count")
# Combining density plots
lib_load("gridExtra")
lib_load("grid")
# Subjectively clipping range of dti
grid.arrange(dens1, dens2, dens3, dens4,
             top = textGrob("DTI Histograms (30 bins)"), 
             widths = c(4,4), heights = c(4,4))

# Clipping the range for dti
loan_data$dti[loan_data$dti < 0 | loan_data$dti > 50] = NA

保持dti的完整範圍似乎沒有意義。下限應該始終為零,因為你不能有少於零的債務。上限有點值得商榷。似乎大部分密度是在0到50之間捕獲的,這似乎是合理的。我們可以主觀地刪除這個範圍之外的所有內容,並將這些條目轉換為缺失的值,在R語言中這些值表示為NA。這導致411個條目丟失。

接下來,讓我們看一下範疇變數。我們將為每個變數計算前四個類別的頻率,並將其他所有變數歸為第五個類別“其他”。如果少於四個類別,那麼我們將只顯示所有的計數。

Keeping the full range fordtidoesn’t seem to make sense. The lower bound should always be zero since you can’t have less than no debt. The upper bound is a bit debatable. It appears that the majority of the density is captured between 0 and 50, which seems to be reasonable. We can subjectively drop everything outside of this range and convert those entries to missing values, which are represented asNAin the R language. This results in 411 missing entries.

Next, let’s take a look at thecategorical variables. We’ll count the frequencies for the top four categories for each variable and lump everything else into a fifth category calledOther. If there are less than four categories then we’ll just show all of the counts.

# Extracting categorical variables
cat_info <- lapply(1:sum(all_vars == "character"), function(inx) {
  
  Category <- loan_data[,names(all_vars[all_vars == "character"])[inx]]
  
  # Getting frequency counts and sorting in decreasing order
  counts_df <- data.frame(table(Category)) %>% arrange(desc(Freq))
  counts_df$Category <- as.character(counts_df$Category)
  
  # Summarizing only top 4 counts and lumping everything into a fifth category, Other
  if(nrow(counts_df) > 5) {
    
   counts_df$Freq[5] <- sum(counts_df$Freq[5:nrow(counts_df)])
   counts_df$Category[5] <- "Other"
   
   counts_df <- counts_df[1:5,]
  } 
    
  df <- data.frame(Name = names(all_vars[all_vars == "character"])[inx],
                   counts_df, stringsAsFactors = F)
  df$`Freq %` <- round(100*df$Freq/sum(df$Freq))
  
  df
  
}) %>% bind_rows()
# Saving example of an incorrectly tagged NA
bad_row <- which(loan_data$emp_title == ".")
# Formatting into interactive HTML table
datatable(cat_info)

我們看到一些關於貸款的描述性資訊,比如期限、等級和用途。還有一些變數可以描述借款人,如就業頭銜、居住狀態和當前拖欠的賬戶數量。像earliest_cr_line這樣的列可以轉換為表示自該日期起的年份的整數,這對於以後的建模更有用。還有一些條目被標記為空格或點,它們實際上應該是NA。一個例子是emp_title的22990行,它被標記為句點。讓我們把這些清理乾淨。

We see some descriptive information on the loans such as the term, grade, and purpose. There are also some variables that describe the borrower such as employment title, state of residence, and number of accounts currently delinquent. Columns likeearliest_cr_linecan be converted to an integer that represents the years since that date, which would be more useful for modeling later. There are also entries that are tagged as empty spaces or dots that should really beNA. An example is row 22990 foremp_title, which is tagged as a period. Let’s clean these up.

# Fixing date columns
lib_load("lubridate")
lib_load("zoo")
# Earliest credit line is now the years since the given date
loan_data$earliest_cr_line <- difftime(as.yearmon(loan_data$issue_d, format = "%b-%Y"),
                                       as.yearmon(loan_data$earliest_cr_line, format = "%b-%Y"), 
                                       unit = "weeks")/52.25
loan_data$earliest_cr_line <- as.numeric(loan_data$earliest_cr_line)
loan_data$sec_app_earliest_cr_line <- difftime(as.yearmon(loan_data$issue_d, format = "%b-%Y"),
                                               as.yearmon(loan_data$sec_app_earliest_cr_line, format = "%b-%Y"), 
                                               unit = "weeks")/52.25
loan_data$sec_app_earliest_cr_line <- as.numeric(loan_data$sec_app_earliest_cr_line)
# Dropping columns that don't really add any info
loan_data[,c("pymnt_plan")] = NULL
# Coercing various entries to NA
loan_data[loan_data == "" | loan_data == "."] <- NA

  

Now that we have cleaned up the variables and correctly tagged missing values asNA, let’s take a look at the sparsity of various columns. Missing data can have strong impacts on predictive and inferential analysis. It’s important to understand any patterns in the sparsity, sometimes dropping incomplete observations can lead to a biased understanding of the data.

現在我們已經清理了變數,並正確地將缺失的值標記為NA,讓我們看看各種列的稀疏性。缺失的資料對預測和推理分析有很大的影響。理解稀疏性中的任何模式是很重要的,有時丟棄不完整的觀測值會導致對資料的理解有偏差。

# Summarizing variables with lots of NAs
sparse_count <- lapply(1:ncol(loan_data), function(inx) {
  temp <- loan_data[,inx]
  
  Variable = colnames(loan_data)[inx]
  
  `NA %` = round(sum(is.na(temp))/length(temp)*100,2)
  
  `Full Name` = data_dict$desc[which(Variable == data_dict$var)[1]]
  df <- data.frame(Variable,`NA %`,`Full Name`,
                   check.names = F,
                   stringsAsFactors = F)
  
  return(df)
}) %>% bind_rows() %>% arrange(desc(`NA %`))
# Grabbing any variable with at least one NA
sparse_count <- sparse_count[sparse_count$`NA %` > 0,]
datatable(sparse_count)

  

t looks like a lot of the missing values are related to variables that deal with a second applicant. It doesn’t seem like these are critical to exploratory analysis or inference but we should still keep them in mind. We could explore different ways to impute some of the missing values, especially if we want to use the variables as part of a model for certain types of borrowers.

看起來很多缺失的值都和處理第二個申請者的變數有關。這些似乎對探索性分析或推理並不重要,但我們仍應牢記在心。我們可以探索不同的方法來填補一些缺失的價值,特別是如果我們想把這些變數作為某類借款人模型的一部分。

Visualzing Distributions視覺化分佈

# Getting raw counts of continuous and categorical vars, then getting fully complete ones (no NAs)
cont <- sapply(loan_data, class) == "numeric"
cat <- sapply(loan_data, class) == "character"
cont_full <- sum(!(names(cont[cont == TRUE]) %in% sparse_count$Variable))
cat_full <- sum(!(names(cat[cat == TRUE]) %in% sparse_count$Variable))

At this point, we’re left with 65 continuous variables and 14 categorical variables with noNAs. The full data set contains 94 continuous variables and 19 categorical variables. We’ve also got 96779 observations, with each row representing a unique loan. It’s a bit difficult to think about this much data at once. A good first step is to create some plots to better understand the most important variables.

First, let’s try to break out the total loan volume in the first quarter of 2017. Let’s get a feel forwhois borrowing the money,whatthey’re using it for,wherethey live, and theirriskprofiles.

此時,我們只剩下65個連續變數和14個沒有NAs的分類變數。整個資料集包含94個連續變數和19個分類變數。我們還有96779個觀察值,每一行代表一個獨特的貸款。一次考慮這麼多資料有點困難。一個好的第一步是建立一些圖,以便更好地理解最重要的變數。首先,我們試著把2017年一季度的貸款總額分解出來。讓我們瞭解一下誰在借錢,他們用它做什麼,他們住在哪裡,以及他們的風險狀況。
# Cleaning up registered nurse double counting across loan_data
loan_data$emp_title[loan_data$emp_title %in% c("RN","Rn","rn","nurse","Nurse")] <- "Registered Nurse"
loan_data$emp_title[is.na(loan_data$emp_title)] <- "Not Available"
# Aggregating up total loans by emp_title
loan_by_emp <- loan_data %>% 
               group_by(emp_title) %>% 
               summarize(`Total Loans ($)` = sum(loan_amnt)) %>%
               arrange(desc(`Total Loans ($)`))
# Getting percentage information since we can only plot a subset
loan_by_emp$emp_title <- paste0(loan_by_emp$emp_title," - ",paste0(round(100*loan_by_emp$`Total Loans ($)`/sum(loan_by_emp$`Total Loans ($)`),1),"%"))
loan_by_emp_plot <- ggplot(loan_by_emp[1:10,], aes(x = reorder(emp_title,-`Total Loans ($)`), 
                                                   y = (`Total Loans ($)`)/1e6, 
                                                   fill = I("dodgerblue4"),
                                                   alpha = I(rep(0.7,10)),
                                                   col = I("grey29"))) + 
                    geom_bar(stat = "identity") +
                    theme(axis.text.x = element_text(angle = 55, hjust = 1)) +
                    xlab("Job Title - % of Total") +
                    ylab("Total Loans - Millions of $")
# Aggregating up by purpose
loan_by_purp <- loan_data %>% 
                group_by(title) %>% 
                summarize(`Total Loans ($)` = sum(loan_amnt)) %>%
                arrange(desc(`Total Loans ($)`))
# Getting percentage information
loan_by_purp$title <- paste0(loan_by_purp$title," - ",paste0(round(100*loan_by_purp$`Total Loans ($)`/sum(loan_by_purp$`Total Loans ($)`),1),"%"))
loan_by_purp_plot <- ggplot(loan_by_purp, aes(x = reorder(title,-`Total Loans ($)`), 
                                              y = (`Total Loans ($)`)/1e6, 
                                              fill = I("dodgerblue4"),
                                              alpha = I(rep(0.7,12)),
                                              col = I("grey29"))) + 
                     geom_bar(stat = "identity") +
                     theme(axis.text.x = element_text(angle = 55, hjust = 1)) +
                     xlab("Purpose") +
                     ylab(NULL)
grid.arrange(loan_by_emp_plot, loan_by_purp_plot,
             top = textGrob("Total Loans by Job Title and Purpose"),
             ncol = 2)

我們可以看到,很多職位實際上都不見了,這可能是因為貸款俱樂部選擇隱藏這些資訊來保持借款人的匿名性。註冊護士、經理、教師和企業主是第二大類別。然而,這些只佔貸款總額的7.5%左右。這個分佈有一個強大的右尾,橫跨33052個不同的職位。按用途劃分的貸款量分佈實際上與此相反。絕大多數貸款都是用來鞏固債務的。只有11個其他的目的,都顯示在圖中。

現在讓我們來看看借款人居住的州:We can see that many of the job titles are actually missing, which could be because Lending Club chooses to hide that information to maintain the borrowers’ anonymity. Registered Nurse, Manager, Teacher, and Business Owner form the next largest categories. However, these only account for about 7.5% of the loan volume. This distribution has a strong right tail that stretches across the 33052 different job titles. The distribution of loan volume bypurposeis actually the opposite of this. The vast majority of the loans have been taken out to consolidate debt. There are only 11 other purposes, which are all shown in the plot.

Now let’s take a look at what states the borrowers live in:

# Aggregating up by state
loan_by_state <- loan_data %>% 
                 group_by(addr_state) %>%
                 summarize(`Total Loans ($)` = sum(loan_amnt)/1e6) %>%
                 arrange(desc(`Total Loans ($)`))
colnames(loan_by_state) <- c("region","value")
# Getting summary percentage of top 4 regions
top4_states <- round(100*sum(loan_by_state$value[1:4])/sum(loan_by_state$value),1)
# Replacing out the state codes with their full names for plotting
lib_load("rgdal")
lib_load("choroplethrMaps")
lib_load("choroplethr")
data("state.regions")
loan_by_state$region <- sapply(loan_by_state$region, function(state_code) {
  
  inx <- grep(pattern = state_code, x = state.regions$abb)
  
  state.regions$region[inx]
  
})
# Plotting US map with values
state_choropleth(loan_by_state, title = "           Total Loan Volume by State - Millions $")

2017年第一季度,通過貸款俱樂部借貸的大部分資金流向了加利福尼亞、德克薩斯、紐約和佛羅里達的居民。這些地區佔同期成交量的38.2%。這個排名實際上模仿了這些州以GDP衡量的經濟產出排名。經濟規模較大的國家往往會有更多的人借錢。值得注意的是,貸款俱樂部貸款目前在愛荷華州或西弗吉尼亞州不可用。

最後,我們可以想象出借款人風險狀況的各種度量。我們可以先看看每個等級的利率分佈。

Most of the funds borrowed through Lending Club in the first quarter for 2017 went to people in California, Texas, New York, and Florida. These regions accounted for 38.2% of the volume during the period. This ranking actually mimics the ranking of those states’ economic output asmeasured by GDP. States with larger economies tend to have people who borrow more. It’s also interesting to note that Lending Club loans are currently not available in Iowa or West Virginia.

Finally, we can visualize various measures of the borrowers’ risk profiles. We can start off by taking a look at the distribution of the interest rate charged for eachgraderating.

# Grabbing the means
cdat <- data.frame(tapply(loan_data$int_rate, loan_data$grade, mean))
rate_grade_dens <- ggplot(loan_data, aes(x = int_rate, fill = grade)) + 
                    geom_density(alpha = 0.6) +
                      geom_vline(data = cdat, aes(xintercept = cdat, colour =  factor(rownames(cdat))),
                                 linetype = "dashed", size = 1, show.legend = F) +
                      ylab(NULL) +
                      xlab("Interest Rate") + 
                      guides( fill = guide_legend(title = "Loan Grade")) + 
                      theme(axis.ticks.y = element_blank(), plot.title = element_text(hjust = 0.5)) + 
                      ggtitle("Interest Rate Distribution by Grade")
rate_grade_dens

利率通常會隨著貸款等級的降低而增加,這是意料之中的。然而,這些分佈似乎相當雜亂,這表明每個年級組中都有不同的風險組。讓我們更深入地瞭解這些年級分組。

The interest rate generally increases as the loan’s grade decreases, which is expected. However, these distributions appear to be quite lumpy, which points to the fact that there are various risk groups within each grade group. Let’s try to get a deeper look at thesegradesubgroups.

lib_load("moments")
## Generic function to create four descriptive plots for each loan grade --> Employment, State, Purpose, Amount
grade_plotter <- function(grade) {
  
  # Filtering for grade
  loan_data_tmp <- loan_data[loan_data$grade == grade,]
  ## Aggregating up total loans by job ##
  loan_by_emp <- loan_data_tmp %>% 
                 group_by(emp_title) %>% 
                 summarize(`Total Loans ($)` = sum(loan_amnt)) %>%
                 arrange(desc(`Total Loans ($)`))
  
  # Getting percentage information since we can only plot a subset
  loan_by_emp$emp_title <- paste0(loan_by_emp$emp_title," - ",paste0(round(100*loan_by_emp$`Total Loans ($)`/sum(loan_by_emp$`Total Loans ($)`),1),"%"))
  
  loan_by_emp_plot <- ggplot(loan_by_emp[1:10,], aes(x = reorder(emp_title,-`Total Loans ($)`), 
                                                     y = (`Total Loans ($)`)/1e6, 
                                                     fill = I("dodgerblue4"),
                                                     alpha = I(rep(0.7,10)),
                                                     col = I("grey29"))) + 
                      geom_bar(stat = "identity") +
                      theme(axis.text.x = element_text(angle = 55, hjust = 1)) +
                      xlab("Job Title - % of Total") +
                      ylab("Total Loans - Millions of $")
  
  ## Aggregating up total loans by purpose ##
  loan_by_purp <- loan_data_tmp %>% 
                  group_by(title) %>% 
                  summarize(`Total Loans ($)` = sum(loan_amnt)) %>%
                  arrange(desc(`Total Loans ($)`))
  # Getting percentage information
  loan_by_purp$title <- paste0(loan_by_purp$title," - ",paste0(round(100*loan_by_purp$`Total Loans ($)`/sum(loan_by_purp$`Total Loans ($)`),1),"%"))
  
  loan_by_purp_plot <- ggplot(loan_by_purp, aes(x = reorder(title,-`Total Loans ($)`), 
                                                y = (`Total Loans ($)`)/1e6, 
                                                fill = I("dodgerblue4"),
                                                alpha = I(rep(0.7,12)),
                                                col = I("grey29"))) + 
                     geom_bar(stat = "identity") +
                     theme(axis.text.x = element_text(angle = 55, hjust = 1)) +
                     xlab("Purpose") +
                     ylab(NULL)
  
  ## Aggregating up total loans by state ##
  loan_by_state <- loan_data_tmp %>% 
                   group_by(addr_state) %>% 
                   summarize(`Total Loans ($)` = sum(loan_amnt)) %>%
                   arrange(desc(`Total Loans ($)`))  
  
  # Finding full state names, capitalizing first letter of each one
  loan_by_state$addr_state <- sapply(loan_by_state$addr_state, function(state_code) {
    
    inx <- grep(pattern = state_code, x = state.regions$abb)
    
    state.regions$region[inx]
  
  }, USE.NAMES = F)
  
  # Borrowed from: https://stackoverflow.com/questions/6364783/capitalize-the-first-letter-of-both-words-in-a-two-word-string
  loan_by_state$addr_state <- sapply(loan_by_state$addr_state, function(state_name) {
    
    split <- strsplit(x=state_name, " ")[[1]]
    
      paste(toupper(substring(split, 1,1)), 
            substring(split, 2), sep="", collapse=" ")
  }, USE.NAMES = F)
  
  # Getting percentage information especially since we can only plot a subset
  loan_by_state$addr_state <- paste0(loan_by_state$addr_state," - ",paste0(round(100*loan_by_state$`Total Loans ($)`/sum(loan_by_state$`Total Loans ($)`),1),"%"))
  
  loan_by_state_plot <- ggplot(loan_by_state[1:10,], aes(x = reorder(addr_state,-`Total Loans ($)`), 
                                                         y = (`Total Loans ($)`)/1e6, 
                                                         fill = I("dodgerblue4"),
                                                         alpha = I(rep(0.7,10)),
                                                         col = I("grey29"))) + 
                    geom_bar(stat = "identity") +
                    theme(axis.text.x = element_text(angle = 55, hjust = 1)) +
                    xlab("State - % of Total") +
                    ylab("Total Loans - Millions of $")
  
  ## Aggregating up by loan amount ##
  loan_amnt_tmp <- loan_data_tmp$loan_amnt
  
  loan_amnt_hist <- qplot(loan_amnt_tmp, fill = I("dodgerblue4"), 
                           alpha = I(0.7), col = I("grey29")) + xlab("Loan Amount") + ylab("Count") + 
                            geom_vline(aes(xintercept = mean(loan_amnt_tmp)), 
                                       color = "dodgerblue4", 
                                       linetype = "dashed", 
                                       size = 2) +
                            annotate("text", x = Inf, y = Inf, 
                                     label = sprintf("\n Mean: %s  \n Average Deviation: %s   \n Skewness: %s   \n Kurtosis: %s   ",
                                                     round(mean(loan_amnt_tmp)),
                                                     round(mean(abs(loan_amnt_tmp-mean(loan_amnt_tmp)))),
                                                     round(skewness(loan_amnt_tmp),2),
                                                     round(kurtosis(loan_amnt_tmp),2)), 
                                     vjust = 1, hjust = 1)
  
  # Arranging plots into grid
  grid.arrange(loan_by_emp_plot, loan_by_purp_plot, loan_by_state_plot, loan_amnt_hist,
               widths = c(4,4), heights = c(4,3),   
               top = textGrob(sprintf("Grade %s Loan Volume by Employment, Purpose, State, and Counts",grade)))
  
  return(NULL)
}
# This seems to be a bug, but R errors out due on an obscure dplyr issue that is resolved with:
# 1) Restarting the R session at this point
# 2) Running library(ggplot2); library(dplyr); library(gridExtra); library(moments); library(grid)
# 3) Running the rest of the chunks

目的、狀態、職稱的相對順序在年級間變化不大。貸款規模的分佈確實會隨著年級的降低而向右移動。也就是說,風險較高的人往往比風險較小的人借款更多。隨著等級的降低,將貸款用於債務合併的趨勢更大。整個資料集中的變數之間可能存在複雜的互動作用,可以更好地解釋年級組內的利率差異。我們可能需要建立各種模型,試圖捕捉是什麼讓借款人跌入某個級別,以及什麼決定了利率。

The relative orders of the purpose, state, and job title don’t change much between grades. The distribution of the loan size does shift to the right as the grade decreases. That is, people that are riskier tend to borrow more than those that are less risky. As the grade decreases there is a greater tendency to use the loan for debt consolidation. There are likely complex interactions amongst the variables in the full data set that better explain the differences in interest rates withingradegroups. We’d likely need to build various models to try to capture what makes a borrower fall under a certain grade and what dictates the interest rate.

Summary

We’ve broken out the data set into continuous and categorical variables. These were scrubbed and analyzed for sparsity. Once we were comfortable with the data set, we moved on to visualizing the relationships between the variables. We broke out loan volume by state, job, and purpose. We also looked at the distribution of interest rate by loan grade and dived deeper into each grade’s statistics. Now that we have a reasonably clean data set and some aggregate information on the variables, we can move on to building models for inference and prediction.

總結

我們已經將資料集分解成連續的和分類的變數。這些被擦洗和分析稀疏。一旦我們熟悉了資料集,我們就開始視覺化變數之間的關係。我們按州、工作和用途列出了貸款量。我們還研究了按貸款級別劃分的利率分佈情況,並深入研究了每個級別的統計資料。現在我們有了一個相當乾淨的資料集和一些關於變數的聚合資訊,我們可以繼續構建用於推斷和預測的模型。

python金融風控評分卡模型和資料分析微專業課

騰訊課堂報名入口

https://ke.qq.com/course/package/31250?tuin=dcbf0ba

網易雲課堂報名入口

https://study.163.com/series/1202875601.htm?share=2&shareId=400000000398149

(騰訊課堂新營業,報名可領取20元優惠券)