1. 程式人生 > >Python 決策樹預測 分類演算法

Python 決策樹預測 分類演算法

準備工作:
安裝pandas

pip3 install pandas

資料載入和清洗

import os
import numpy as np
import pandas as pd
home_folder = os.path.expanduser("~")#os.path.expanduser(path) 把path中包含的"~"和"~user"轉換成使用者目錄
data_folder = os.path.join(home_folder, "Data", "basketball")
data_filename = os.path.join(data_folder, "leagues_NBA_2014_games_games.csv"
)#資料檔名
results = pd.read_csv(data_filename)#載入資料集
results.ix[:5]#輸出前5行,ix索引資料

這裡寫圖片描述

#results = pd.read_csv(data_filename, parse_dates=["Date"], skiprows=[0,])#parse_dates解析“Date”的值為日期; skiprows需要跳過的行號列表(從0開始)
results = pd.read_csv(data_filename, parse_dates=["Date"])#當前檔案第一行不為空,不需要跳過
results.columns
= ["Date", "Start", "Visitor Team", "VisitorPts", "Home Team", "HomePts", "Score Type","OT?", "Notes"]#修改列名 results.ix[:5]

這裡寫圖片描述

提取新特徵

results["HomeWin"] = results["VisitorPts"] < results["HomePts"]#找出主場獲勝的隊伍
y_true = results["HomeWin"].values#儲存為類別資料,方便scikit-learn後續分析
print("Home Win percentage:{0:.1f}%"
.format(100*results["HomeWin"].sum()/results["HomeWin"].count()))#count返回資料長度 results["HomeLastWin"] = False#建立新特徵 results["VisitorLastWin"] = False#建立新特徵 from collections import defaultdict won_last = defaultdict(int) for index, row in results.sort_values("Date").iterrows():#sort_values將results按時間順序排序,獲取每行的index、row home_team = row["Home Team"] visitor_team = row["Visitor Team"] row["HomeLastWin"] = won_last[home_team] row["VisitorLastWin"] = won_last[visitor_team] results.loc[index] = row#更新當前行資料 # Set current win won_last[home_team] = row["HomeWin"] won_last[visitor_team] = not row["HomeWin"] results.ix[20:25]

這裡寫圖片描述

決策樹
scikit-learn庫生成決策樹的預設演算法是分類迴歸樹演算法(Classification and Regression Trees, CART)