邏輯回歸實例

阿新 • • 發佈：2018-08-26

pda nts -a null 類目 https head kit frame

　　簡介

　　Logistic回歸是一種機器學習分類算法，用於預測分類因變量的概率。在邏輯回歸中，因變量是一個二進制變量，包含編碼為1（是，成功等）或0（不，失敗等）的數據。換句話說，邏輯回歸模型預測P（Y = 1）是X的函數。

　　數據

　　該數據集來自UCI機器學習庫，它與葡萄牙銀行機構的直接營銷活動（電話）有關。分類目標是預測客戶是否將購買定期存款（變量y）。數據集可以從這裏下載或者here。

import pandas as pd
import numpy as np
from sklearn import preprocessing
import matplotlib.pyplot as plt 
plt.rc( 
"font", size=14)
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
import seaborn as sns
sns.set(style="white")
sns.set(style="whitegrid", color_codes=True)

data=pd.read_csv(‘F:/wd.jupyter/datasets/log_data/bank.csv‘,delimiter=‘;‘)
data=data.dropna()
 
print(data.shape)
print(list(data.columns))

data.head()

(41188, 21)
[‘age‘, ‘job‘, ‘marital‘, ‘education‘, ‘default‘, ‘housing‘, ‘loan‘, ‘contact‘, ‘month‘, ‘day_of_week‘, ‘duration‘, ‘campaign‘, ‘pdays‘, ‘previous‘, ‘poutcome‘, ‘emp.var.rate‘, ‘cons.price.idx‘, ‘cons.conf.idx‘, ‘euribor3m‘, ‘nr.employed‘, ‘y‘]

　　數據集提供銀行客戶的信息。它包括41,188條記錄和21個字段。

　　變量

age (numeric)
job : type of job (categorical: “admin”, “blue-collar”, “entrepreneur”, “housemaid”, “management”, “retired”, “self-employed”, “services”, “student”, “technician”, “unemployed”, “unknown”)
marital : marital status (categorical: “divorced”, “married”, “single”, “unknown”)
education (categorical: “basic.4y”, “basic.6y”, “basic.9y”, “high.school”, “illiterate”, “professional.course”, “university.degree”, “unknown”)
default: has credit in default? (categorical: “no”, “yes”, “unknown”)
housing: has housing loan? (categorical: “no”, “yes”, “unknown”)
loan: has personal loan? (categorical: “no”, “yes”, “unknown”)
contact: contact communication type (categorical: “cellular”, “telephone”)
month: last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”)
day_of_week: last contact day of the week (categorical: “mon”, “tue”, “wed”, “thu”, “fri”)
duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=’no’). The duration is not known before a call is performed, also, after the end of the call, y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model
campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
previous: number of contacts performed before this campaign and for this client (numeric)
poutcome: outcome of the previous marketing campaign (categorical: “failure”, “nonexistent”, “success”)
emp.var.rate: employment variation rate?—?(numeric)
cons.price.idx: consumer price index?—?(numeric)
cons.conf.idx: consumer confidence index?—?(numeric)
euribor3m: euribor 3 month rate?—?(numeric)
nr.employed: number of employees?—?(numeric)

　　預測變量

　　y - 客戶是否訂購了定期存款？（二進制：“1”表示“是”，“0”表示“否”）

　　數據集的教育列有許多類別，我們需要減少類別以獲得更好的建模。教育專欄有以下幾類：　

data[‘education‘].unique()

array([‘basic.4y‘, ‘high.school‘, ‘basic.6y‘, ‘basic.9y‘,
       ‘professional.course‘, ‘unknown‘, ‘university.degree‘,
       ‘illiterate‘], dtype=object)

　　讓我們將“basic.4y”，“basic.9y”和“basic.6y”組合在一起，稱之為“basic”。　

data[‘education‘]=np.where(data[‘education‘]==‘basic.4y‘,‘basic‘,data[‘education‘])
data[‘education‘]=np.where(data[‘education‘]==‘basic.6y‘,‘basic‘,data[‘education‘])
data[‘education‘]=np.where(data[‘education‘]==‘basic.9y‘,‘basic‘,data[‘education‘])
data[‘education‘].unique()

array([‘basic‘, ‘high.school‘, ‘professional.course‘, ‘unknown‘,
       ‘university.degree‘, ‘illiterate‘], dtype=object)

　　如果不懂np.where函數，可以看這裏。

　　數據探索

　　信息總覽

data.info()

<class ‘pandas.core.frame.DataFrame‘>
Int64Index: 41188 entries, 0 to 41187
Data columns (total 21 columns):
age               41188 non-null int64
job               41188 non-null object
marital           41188 non-null object
education         41188 non-null object
default           41188 non-null object
housing           41188 non-null object
loan              41188 non-null object
contact           41188 non-null object
month             41188 non-null object
day_of_week       41188 non-null object
duration          41188 non-null int64
campaign          41188 non-null int64
pdays             41188 non-null int64
previous          41188 non-null int64
poutcome          41188 non-null object
emp.var.rate      41188 non-null float64
cons.price.idx    41188 non-null float64
cons.conf.idx     41188 non-null float64
euribor3m         41188 non-null float64
nr.employed       41188 non-null float64
y                 41188 non-null object
dtypes: float64(5), int64(5), object(11)
memory usage: 6.9+ MB

　　可以看到，有5列float64數據，有5列int64數據，有11列object。

　　把y變為數值型，並進行簡單的統計。　

data.loc[data[‘y‘]==‘yes‘,‘y‘]=1
data.loc[data[‘y‘]==‘no‘,‘y‘]=0
data[‘y‘].value_counts()

0    36548
1     4640
Name: y, dtype: int64

sns.countplot(x=‘y‘,data=data,palette=‘hls‘)

技術分享圖片

　　因變量中有36548個沒有，4640個是。

　　讓我們深入了解這兩個類別。

data.groupby(‘y‘).mean()

技術分享圖片

可以看到：

　　購買定期存款的客戶的平均年齡高於未購買定期存款的客戶的平均年齡。
　　對於購買它的客戶來說，pdays（自上次聯系客戶以來的日子）可以理解的更低。 pdays越低，最後一次通話的記憶越好，因此銷售的機會就越大。
　　令人驚訝的是，購買定期存款的客戶的廣告系列（compaign當前廣告系列期間的聯系人或通話次數）較低。
　　我們可以計算其他分類變量（如教育和婚姻狀況）的分類方法，以更詳細地了解我們的數據。

data.groupby(‘education‘).mean()

技術分享圖片

data.groupby(‘marital‘).mean()

　　可視化

　　工作和y的關系

pd.crosstab(data.job,data.y).plot(kind=‘bar‘)
plt.title(‘Purchase Frequency for Job Title‘)
plt.xlabel(‘Job‘)
plt.ylabel(‘Frequency of Purchase‘)
#plt.savefig(‘purchase_fre_job‘)

技術分享圖片

購買存款的頻率在很大程度上取決於職位。因此，職稱可以是結果變量的良好預測因子。

　　婚姻狀況與y的關系：

table=pd.crosstab(data.marital,data.y)
table.div(table.sum(axis=1).astype(float), axis=0).plot(kind=‘bar‘, stacked=True)
plt.title(‘Stacked Bar Chart of Marital Status vs Purchase‘)
plt.xlabel(‘Marital Status‘)
plt.ylabel(‘Proportion of Customers‘)

技術分享圖片

婚姻狀況似乎不是結果變量的強預測因子。

　　教育情況與y的關系

table=pd.crosstab(data.education,data.y)
table.div(table.sum(axis=1).astype(float),axis=0).plot(kind=‘bar‘)
plt.title(‘Stacked Bar Chart of Education Status vs Purchase‘)
plt.xlabel(‘Education Status‘)
plt.ylabel(‘Proportion of Customers‘)

技術分享圖片

教育似乎是良好預測指標。

　　day_of_week與y的關系：

pd.crosstab(data.day_of_week,data.y).plot(kind=‘bar‘)
plt.title(‘Purchase Frequency for Day of Week‘)
plt.xlabel(‘Day of Week‘)
plt.ylabel(‘Frequency of Purchase‘)
plt.savefig(‘pur_dayofweek_bar‘)

技術分享圖片

Day of week 或許不是一個良好的預測指標。

　　month與y的關系

pd.crosstab(data.month,data.y).plot(kind=‘bar‘)
plt.title(‘Purchase Frequency for Month‘)
plt.xlabel(‘Month‘)
plt.ylabel(‘Frequency of Purchase‘)
plt.savefig(‘pur_fre_month_bar‘)

技術分享圖片

Month是一個良好的預測指標。

　　年齡的分布：

data.age.hist()
plt.title(‘Histogram of Age‘)
plt.xlabel(‘Age‘)
plt.ylabel(‘Frequency‘)

技術分享圖片

該數據集中銀行的大多數客戶的年齡範圍為30-40。

　　poutcome與y的關系：

pd.crosstab(data.poutcome,data.y).plot(kind=‘bar‘)
plt.title(‘Purchase Frequency for Poutcome‘)
plt.xlabel(‘Poutcome‘)
plt.ylabel(‘Frequency of Purchase‘)

技術分享圖片

poutcome是一個良好的預測指標。

　　創建虛擬變量

　　這是只有兩個值的變量，0和1。

　　回顧我們數據集的信息，有11個object，其中y已經轉化過來，另外有10個類別需要轉化。

cat_vars=[‘job‘,‘marital‘,‘education‘,‘default‘,‘housing‘,‘loan‘,‘contact‘,‘month‘,‘day_of_week‘,‘poutcome‘]
for var in cat_vars:
    cat_list=‘var‘+‘_‘+var
    cat_list = pd.get_dummies(data[var], prefix=var)
    data1=data.join(cat_list)
    data=data1
    

data_vars=data.columns.values.tolist()
to_keep=[i for i in data_vars if i not in cat_vars]


data_final=data[to_keep]
data_final.columns.values

最終的數據

　技術分享圖片

　　分離特征與目標變量

data_final_vars=data_final.columns.values.tolist()
y=[‘y‘]
X=[i for i in data_final_vars if i not in y]

　　 特征選擇　

　　遞歸特征消除（Recursive Feature Elimination，RFE）基於以下思想：首先，在初始特征集上訓練估計器，並且通過coef_屬性或通過feature_importances_屬性獲得每個特征的重要性。然後，從當前的一組特征中刪除最不重要的特征。在修剪的集合上遞歸地重復該過程，直到最終到達所需數量的要選擇的特征。

from sklearn import datasets
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
rfe = RFE(logreg, 18)
rfe = rfe.fit(data_final[X], data_final[y] )
print(rfe.support_)
print(rfe.ranking_)

　　技術分享圖片

根據布爾值篩選我們想要的特征(參考）：

from itertools import compress

cols=list(compress(X,rfe.support_))
cols

或者：
cols= [i for index,i in list(enumerate(X)) if rfe.support_[index] == True]

　　執行模型

import statsmodels.api as sm

X=data_final[cols]
y=data_final[‘y‘]


logit_model=sm.Logit(y,X)
logit_model.raise_on_perfect_prediction = False
result=logit_model.fit()
print(result.summary().as_text)

技術分享圖片

　　大多數變量的p值小於0.05，因此，大多數變量對模型都很重要。

　邏輯回歸模型的擬合

from sklearn.linear_model import LogisticRegression
from sklearn import metrics

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
y_pred = logreg.predict(X_test)
print(‘Accuracy of logistic regression classifier on test set: {:.2f}‘.format(logreg.score(X_test, y_test)))

　　可以看到，準確率達到了0.9.

　　交叉驗證

　　交叉驗證嘗試避免過度擬合，同時仍然為每個觀察數據集生成預測。我們使用10折交叉驗證來訓練我們的Logistic回歸模型。

from sklearn import model_selection
from sklearn.model_selection import cross_val_score

kfold = model_selection.KFold(n_splits=10, random_state=7)
modelCV = LogisticRegression()
scoring = ‘accuracy‘
results = model_selection.cross_val_score(modelCV, X_train, y_train, cv=kfold, scoring=scoring)
print("10-fold cross validation average accuracy: %.3f" % (results.mean()))

技術分享圖片

平均精度仍然非常接近Logistic回歸模型的準確度; 因此，我們可以得出結論，我們的模型很好擬合了數據。

　Confusion Matrix

from sklearn.metrics import confusion_matrix

confusion_matrix = confusion_matrix(y_test, y_pred)
print(confusion_matrix)

技術分享圖片

結果告訴我們，我們有10848 + 2564個正確預測和1124 + 121個錯誤預測。

　　計算精度（precision）召回(recall)F測量(F-measure)和支持(support)

精度是比率tp /（tp + fp），其中tp是真陽性的數量，fp是假陽性的數量。精確度直觀地是分類器如果是負的則不將樣品標記為陽性的能力。

召回是比率tp /（tp + fn）其中tp是真陽性的數量，fn是假陰性的數量。召回直觀地是分類器找到所有陽性樣本的能力。

F-beta分數可以解釋為精確度和召回率的加權調和平均值，其中F-β分數在1處達到其最佳值，在0處達到最差分數。

F-beta評分對召回的重量超過精確度β因子。 beta = 1.0意味著召回和精確度同樣重要。

支持是y_test中每個類的出現次數。

from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))

技術分享圖片

可以看出：在整個測試集中，88％的促銷定期存款是客戶喜歡的定期存款。在整個測試集中，90％的客戶首選定期存款。

　　Macro F1 Score

from sklearn.metrics import f1_score

print(f1_score(y_test, y_pred, average = ‘macro‘))

0.6217653450907061

　　ROC曲線

from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

logit_roc_auc = roc_auc_score(y_test, logreg.predict(X_test))
fpr, tpr, thresholds = roc_curve(y_test, logreg.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label=‘Logistic Regression (area = %0.2f)‘ % logit_roc_auc)
plt.plot([0, 1], [0, 1],‘r--‘)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel(‘False Positive Rate‘)
plt.ylabel(‘True Positive Rate‘)
plt.title(‘Receiver operating characteristic‘)
plt.legend(loc="lower right")
#plt.savefig(‘Log_ROC‘)
plt.show()

技術分享圖片

　　ROC曲線是與二元分類器一起使用的另一種常用工具。虛線表示純隨機分類器的ROC曲線; 一個好的分類器盡可能遠離該線（朝左上角）。

邏輯回歸實例

pda nts -a null 類目 https head kit frame 　　簡介　　Logistic回歸是一種機器學習分類算法，用於預測分類因變量的概率。在邏輯回歸中，因變量是一個二進制變量，包含編碼為1（是，成功等）或0（不，失敗等）的數據。換句話說，邏輯回

Bayesian generalized linear model (GLM) | 貝葉斯廣義線性回歸實例

gamma tail merge detailed 變量 clas under acc sig 學習GLM的時候在網上找不到比較通俗易懂的教程。這裏以一個實例應用來介紹GLM。 We used a Bayesian generalized linear model

機器學習筆記（六）邏輯回歸

邏輯回歸 alt 表示結果不變改變最小值 nbsp 可能性一、邏輯回歸問題二分類的問題為是否的問題，由算出的分數值，經過sign函數輸出的是（+1，-1），想要輸出的結果為一個幾率值，則需要改變函數模型，其中，，則邏輯回歸的函數為二、邏輯回歸錯誤評價線性

Machine Learning — 邏輯回歸

url home mage 簡化 bsp 線性 alt 邏輯回歸 sce 現實生活中有很多分類問題，比如正常郵件/垃圾郵件，良性腫瘤/惡性腫瘤，識別手寫字等等，這些可以用邏輯回歸算法來解決。一、二分類問題所謂二分類問題，即結果只有兩類，Yes or No，這樣結果｛0，

SparkMLlib學習分類算法之邏輯回歸算法

spl sca class put net lac gradient map ica SparkMLlib學習分類算法之邏輯回歸算法（一），邏輯回歸算法的概念（參考網址：http://blog.csdn.net/sinat_33761963/article/details

遞歸實例

python 遞歸計算階乘：n! = 1*2*3*...n#/usr/bin/env python def func(n): if n == 1: return 1 return n*func(n-1)菲波那切數列：0，1,1,2,3,5,8,13,21,34....

邏輯回歸的正則化

正則 .com logistic 可能 cnblogs 技術技術分享 img 規範我們可以規範logistic回歸以類似的方式，我們對線性回歸。作為一個結果，我們可以避免過擬合。下面的圖像顯示了正則化函數，用粉紅色的線顯示出來，是不太可能過度擬合非正則的藍線表示功能：

統計學習方法[6]——邏輯回歸模型

算法 ima 題解問題回歸統計學習同步轉換步長統計學習方法由三個要素組成：方法=模型+策略+算法模型是針對具體的問題做的假設空間，是學習算法要求解的參數空間。例如模型可以是線性函數等。策略是學習算法學習的目標，不同的問題可以有不同的學習目標，例如經驗風險最

邏輯回歸（Logistic Regression）

方差 %d pan transpose pos mit int gre cost import numpy as np import random def genData(numPoints,bias,variance):#實例偏好方差 x = np.zer

21-城裏人套路深之用python實現邏輯回歸算法

rom 成功基礎知識壓力 dvp ilb nbsp html 感覺如果和一個人交流時，他的思想像彈幕一樣飄散在空中，將是怎樣的一種景象？我想大概會毫不猶豫的點關閉的。生活為啥不能簡單明了？因為太直白了令人乏味。保留一些不確定性反而撲朔迷離，引人入勝。我們學習了線性回歸

分類和邏輯回歸(Classification and logistic regression)，廣義線性模型(Generalized Linear Models) ，生成學習算法(Generative Learning algorithms)

line learning nbsp ear 回歸 logs http zdb del 分類和邏輯回歸(Classification and logistic regression) http://www.cnblogs.com/czdbest/p/5768467.html

邏輯回歸實例

數據

變量

預測變量

數據探索

可視化

創建虛擬變量

執行模型

邏輯回歸模型的擬合

交叉驗證

Confusion Matrix

計算精度（precision）召回(recall)F測量(F-measure)和支持(support)

ROC曲線

邏輯回歸實例

Bayesian generalized linear model (GLM) | 貝葉斯廣義線性回歸實例

機器學習筆記（六）邏輯回歸

Machine Learning — 邏輯回歸

SparkMLlib學習分類算法之邏輯回歸算法

遞歸實例

邏輯回歸的正則化

統計學習方法[6]——邏輯回歸模型

邏輯回歸（Logistic Regression）

21-城裏人套路深之用python實現邏輯回歸算法

分類和邏輯回歸(Classification and logistic regression)，廣義線性模型(Generalized Linear Models) ，生成學習算法(Generative Learning algorithms)

關於邏輯回歸和感知器一些基礎知識的理解

分析決策樹算法和邏輯回歸算法的不同之處

Spark 機器學習------邏輯回歸

機器學習python實戰----邏輯回歸

Spark 二項邏輯回歸__二分類

Spark 多項式邏輯回歸__多分類

Spark 多項式邏輯回歸__二分類

機器學習之邏輯回歸

機器學習筆記(3)：多類邏輯回歸

邏輯回歸實例

數據

變量

預測變量

數據探索

可視化

創建虛擬變量

執行模型

邏輯回歸模型的擬合

交叉驗證

Confusion Matrix

計算精度（precision）召回(recall)F測量(F-measure)和支持(support)

ROC曲線

相關推薦

　　數據

　　變量

　　預測變量

　　數據探索

　　可視化

　　創建虛擬變量

　　執行模型

　邏輯回歸模型的擬合

　　交叉驗證

　Confusion Matrix

　　計算精度（precision）召回(recall)F測量(F-measure)和支持(support)

　　ROC曲線