1. 程式人生 > >Python數據挖掘—回歸—邏輯回歸

Python數據挖掘—回歸—邏輯回歸

dsl type near vid sselect pan input dia 取數

概念

針對因變量為分類變量而進行回歸分析的一種統計方法,屬於概率型非線性回歸

  優點:算法易於實現和部署,執行效率和準確度高

  缺點:離散型的自變量數據需要通過生成虛擬變量的方式來使用

在線性回歸中,因變量是連續性變量,那麽線性回歸能根據因變量和自變量存在的線性關系來構造回歸方程,因變量變成分類變量後就不存在這種關系了,需通過對數變換來進行處理(Sigmoid函數)

步驟:

1、讀取數據;

import pandas
from pandas import read_csv

data=read_csv(
    "C:\\Users\\Jw\\Desktop\\python_work\\Python數據挖掘實戰課程課件\\4.4\\data.csv
", encoding="utf-8") data=data.dropna() dummyColumns=() data.shape

  

2、處理字符型和大小無關的字段,如果字段有可比性,可進行大小比較,然後調用map一一映射,將離散型數據轉化為數值型數據

  首先處理字符類型和大小無關的字段

#首先處理字符類型和大小無關的字段
dummyColumns=[
    Gender,Home Ownership,
    Internet Connection, Marital Status,
    Movie Selector, Prerec Format
, TV Signal] for column in dummyColumns: data[column]=data[column].astype(category) dummiesData=pandas.get_dummies( data, columns=dummyColumns, prefix=dummyColumns, prefix_sep=" ", #列名和屬性值之間的分割符號 drop_first=True) #根據特征列建模,為避免模型共軛,只選取一列 data.Gender.unique()
#去重 dummiesData.columns #獲取所有列

  處理字符類型和大小有關的字段,然後使用map一一映射

#有可比性,可進行大小比較
educationLevelDict={
    Post-Doc: 9,
    Doctorate: 8,
    Master\‘s Degree: 7,
    Bachelor\‘s Degree: 6,
    Associate\‘s Degree: 5,
    Some College: 4,
    Trade School: 3,
    High School: 2,
    Grade School: 1
    }
    
#調用map一一映射,將離散型數據轉化為數值型數據
dummiesData["Education Level Map"]=dummiesData[Education Level].map(educationLevelDict)


freqMap={
    Never:0,
    Rarely: 1,
    Monthly: 2,
    Weekly: 3,
    Daily: 4}
    
dummiesData[PPV Freq Map]=dummiesData[PPV Freq].map(freqMap)
dummiesData[Theater Freq Map] = dummiesData[Theater Freq].map(freqMap)
dummiesData[TV Movie Freq Map] = dummiesData[TV Movie Freq].map(freqMap)
dummiesData[Prerec Buying Freq Map] = dummiesData[Prerec Buying Freq].map(freqMap)
dummiesData[Prerec Renting Freq Map] = dummiesData[Prerec Renting Freq].map(freqMap)
dummiesData[Prerec Viewing Freq Map] = dummiesData[Prerec Viewing Freq].map(freqMap)

3、選取自標量和因變量,縣選取所有列,然後一一查看選擇

 1 #選取自變量和因變量
 2 dummiesData.columns
 3 
 4 #先選取所有列,然後一一查看選擇
 5 dummiesSelect = [
 6     Age, Num Bathrooms, Num Bedrooms, Num Cars, Num Children, Num TVs, 
 7     Education Level Map, PPV Freq Map, Theater Freq Map, TV Movie Freq Map, 
 8     Prerec Buying Freq Map, Prerec Renting Freq Map, Prerec Viewing Freq Map, 
 9     Gender Male,
10     Internet Connection DSL, Internet Connection Dial-Up, 
11     Internet Connection IDSN, Internet Connection No Internet Connection,
12     Internet Connection Other, 
13     Marital Status Married, Marital Status Never Married, 
14     Marital Status Other, Marital Status Separated, 
15     Movie Selector Me, Movie Selector Other, Movie Selector Spouse/Partner, 
16     Prerec Format DVD, Prerec Format Laserdisk, Prerec Format Other, 
17     Prerec Format VHS, Prerec Format Video CD, 
18     TV Signal Analog antennae, TV Signal Cable, 
19     TV Signal Digital Satellite, TV Signal Don\‘t watch TV
20 ]
21 
22 inputData=dummiesData[dummiesSelect]   #自變量
23 
24 
25 outputData=dummiesData[["Home Ownership Rent"]]   #因變量

4、建模、訓練、評分

1 #建模、訓練
2 from sklearn import linear_model
3 
4 lrModel=linear_model.LogisticRegression()
5 
6 lrModel.fit(inputData,outputData)
7 
8 lrModel.score(inputData,outputData)

5、預測(因為邏輯回歸所用的參數是經過虛擬變量處理過的,所以新數據也許通過處理才能進行預測)

 1 #因為邏輯回歸所用的參數是經過虛擬變量處理過的,需對新的數據進行預測,要先處理新數據
 2 newData=read_csv(
 3     "C:\\Users\\Jw\\Desktop\\python_work\\Python數據挖掘實戰課程課件\\4.4\\newData.csv",
 4     encoding="utf-8")
 5 
 6 for column in dummyColumns:
 7     newData[column]=newData[column].astype(
 8         "category",
 9         categories=data[column].cat.categories)
10     
11 newData=newData.dropna()
12 
13 
14 newData[Education Level Map] = newData[Education Level].map(educationLevelDict)
15 newData[PPV Freq Map] = newData[PPV Freq].map(freqMap)
16 newData[Theater Freq Map] = newData[Theater Freq].map(freqMap)
17 newData[TV Movie Freq Map] = newData[TV Movie Freq].map(freqMap)
18 newData[Prerec Buying Freq Map] = newData[Prerec Buying Freq].map(freqMap)
19 newData[Prerec Renting Freq Map] = newData[Prerec Renting Freq].map(freqMap)
20 newData[Prerec Viewing Freq Map] = newData[Prerec Viewing Freq].map(freqMap)
21 
22 
23 dummiesNewData=pandas.get_dummies (
24         newData,
25         columns=dummyColumns,
26         prefix=dummyColumns,
27         prefix_sep=" ",
28         drop_first=True)
29 
30 inputNewData = dummiesNewData[dummiesSelect]
31 
32 lrModel.predict(inputData)

Python數據挖掘—回歸—邏輯回歸