Python資料分析與機器學習-Python庫分析科比生涯資料

阿新 • • 發佈：2019-01-03

原始碼下載：http://download.csdn.net/download/adam_zs/10222492

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

'''科比生涯資料'''
pd.set_option('display.height', 9999)
pd.set_option('display.max_rows', 9999)
pd.set_option('display.max_columns', 9999)
pd.set_option('display.width', 9999)

raw = pd.read_csv("data.csv")
# print(raw.shape) #(30697, 25)
# print(raw.head())
'''
         action_type combined_shot_type  game_event_id   game_id      lat  loc_x  loc_y       lon  minutes_remaining  period  playoffs   season  seconds_remaining  shot_distance  shot_made_flag       shot_type         shot_zone_area  shot_zone_basic  shot_zone_range     team_id           team_name   game_date    matchup opponent  shot_id
0          Jump Shot          Jump Shot             10  20000012  33.9723    167     72 -118.1028                 10       1         0  2000-01                 27             18             NaN  2PT Field Goal          Right Side(R)        Mid-Range        16-24 ft.  1610612747  Los Angeles Lakers  2000-10-31  LAL @ POR      POR        1
1          Jump Shot          Jump Shot             12  20000012  34.0443   -157      0 -118.4268                 10       1         0  2000-01                 22             15             0.0  2PT Field Goal           Left Side(L)        Mid-Range         8-16 ft.  1610612747  Los Angeles Lakers  2000-10-31  LAL @ POR      POR        2
2          Jump Shot          Jump Shot             35  20000012  33.9093   -101    135 -118.3708                  7       1         0  2000-01                 45             16             1.0  2PT Field Goal   Left Side Center(LC)        Mid-Range        16-24 ft.  1610612747  Los Angeles Lakers  2000-10-31  LAL @ POR      POR        3
3          Jump Shot          Jump Shot             43  20000012  33.8693    138    175 -118.1318                  6       1         0  2000-01                 52             22             0.0  2PT Field Goal  Right Side Center(RC)        Mid-Range        16-24 ft.  1610612747  Los Angeles Lakers  2000-10-31  LAL @ POR      POR        4
4  Driving Dunk Shot               Dunk            155  20000012  34.0443      0      0 -118.2698                  6       2         0  2000-01                 19              0             1.0  2PT Field Goal              Center(C)  Restricted Area  Less Than 8 ft.  1610612747  Los Angeles Lakers  2000-10-31  LAL @ POR      POR        5
'''
# shot_made_flag 是否進球
kobe = raw[pd.notnull(raw["shot_made_flag"])]
# print(kobe.shape)  # (25697, 25)

plt.figure(figsize=(10, 10))

# alpha = 0.02  # 透明度
# plt.subplot(121)
# plt.scatter(kobe["loc_x"], kobe["loc_y"], color="R", alpha=alpha)  # 球場中座標
# plt.title("loc_x and loc_y")
# plt.subplot(122)
# plt.scatter(kobe["lon"], kobe["lat"], color="B", alpha=alpha)  # 經緯度
# plt.title("lon and lat")
# plt.show()

# 極座標，到圓心的距離+與X軸的夾角
# raw['dist'] = np.sqrt(raw['loc_x'] ** 2 + raw['loc_y'] ** 2)
# loc_x_zero = raw['loc_x'] == 0
# raw['angle'] = np.array([0] * len(raw))
# raw['angle'][~loc_x_zero] = np.arctan(raw['loc_y'][~loc_x_zero] / raw['loc_x'][~loc_x_zero])
# raw['angle'][loc_x_zero] = np.pi / 2
# raw['remaining_time'] = raw['minutes_remaining'] * 60 + raw['seconds_remaining']

# print(kobe["action_type"].unique())
# print(kobe["combined_shot_type"].unique())
# print(kobe["shot_type"].unique())
# print(kobe["shot_type"].value_counts())

# season 賽季
# print(kobe['season'].unique())
raw['season'] = raw['season'].apply(lambda x: int(x.split("-")[1]))
# print(raw['season'].unique())
#
# print(kobe['team_id'].unique())
# print(kobe['team_name'].unique())

gs = kobe.groupby("shot_zone_area")
# print(kobe["shot_zone_area"].value_counts())
# print(len(gs))

# print(kobe["shot_zone_area"].unique())
'''
['Left Side(L)' 'Left Side Center(LC)' 'Right Side Center(RC)' 'Center(C)'
 'Right Side(R)' 'Back Court(BC)']
'''
# print(kobe["shot_zone_basic"].unique())
'''
['Mid-Range' 'Restricted Area' 'In The Paint (Non-RA)' 'Above the Break 3'
 'Right Corner 3' 'Backcourt' 'Left Corner 3']
'''
# print(kobe["shot_zone_range"].unique())
'''['8-16 ft.' '16-24 ft.' 'Less Than 8 ft.' '24+ ft.' 'Back Court Shot']'''

import matplotlib.cm as cm

plt.figure(figsize=(20, 10))


def scatter_plot_by_category(feat):
    alpha = 0.1
    gs = kobe.groupby(feat)
    cs = cm.rainbow(np.linspace(0, 1, len(gs)))
    for g, c in zip(gs, cs):
        plt.scatter(g[1]["loc_x"], g[1]["loc_y"], color=c, alpha=alpha)


# # shot_zone_area
# plt.subplot(131)
# scatter_plot_by_category('shot_zone_area')
# plt.title('shot_zone_area')
#
# # shot_zone_basic
# plt.subplot(132)
# scatter_plot_by_category('shot_zone_basic')
# plt.title('shot_zone_basic')
#
# # shot_zone_range
# plt.subplot(133)
# scatter_plot_by_category('shot_zone_range')
# plt.title('shot_zone_range')
#
# plt.show()

drops = ['shot_id', 'team_id', 'team_name', 'shot_zone_area', 'shot_zone_range', 'shot_zone_basic',
         'matchup', 'lon', 'lat', 'seconds_remaining', 'minutes_remaining',
         'shot_distance', 'loc_x', 'loc_y', 'game_event_id', 'game_id', 'game_date']
for drop in drops:
    raw.drop(drop, axis=1, inplace=True)
# print(raw.head())
'''
         action_type combined_shot_type  period  playoffs  season  shot_made_flag       shot_type opponent
0          Jump Shot          Jump Shot       1         0       1             NaN  2PT Field Goal      POR
1          Jump Shot          Jump Shot       1         0       1             0.0  2PT Field Goal      POR
2          Jump Shot          Jump Shot       1         0       1             1.0  2PT Field Goal      POR
3          Jump Shot          Jump Shot       1         0       1             0.0  2PT Field Goal      POR
4  Driving Dunk Shot               Dunk       2         0       1             1.0  2PT Field Goal      POR
'''
# print(raw['combined_shot_type'].value_counts())
# dummies_cs_type = pd.get_dummies(raw['combined_shot_type'], prefix='cs_type')
# raw = pd.concat([raw, dummies_cs_type], axis=1)
# raw = raw.drop("combined_shot_type", axis=1)
# print(raw.head())
categorical_vars = ['action_type', 'combined_shot_type', 'shot_type', 'opponent', 'period', 'season']
for var in categorical_vars:
    raw = pd.concat([raw, pd.get_dummies(raw[var], prefix=var)], axis=1)
    raw = raw.drop(var, 1)
print(raw.head())

train_kobe = raw[pd.notnull(raw['shot_made_flag'])]
train_label = train_kobe['shot_made_flag']
train_kobe = train_kobe.drop('shot_made_flag', axis=1)
test_kobe = raw[pd.isnull(raw['shot_made_flag'])]
test_kobe = test_kobe.drop('shot_made_flag', 1)

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import confusion_matrix, log_loss
import time
import numpy as np
# find the best n_estimators for RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import KFold

print('Finding best n_estimators for RandomForestClassifier...')
min_score = 100000
best_n = 0
scores_n = []
range_n = np.logspace(0, 2, num=3).astype(int)
for n in range_n:  # 樹的個數
    print("the number of trees : {0}".format(n))
    t1 = time.time()

    rfc_score = 0.
    rfc = RandomForestClassifier(n_estimators=n)
    for train_k, test_k in KFold(len(train_kobe), n_folds=10, shuffle=True):
        rfc.fit(train_kobe.iloc[train_k], train_label.iloc[train_k])
        # rfc_score += rfc.score(train.iloc[test_k], train_y.iloc[test_k])/10
        pred = rfc.predict(train_kobe.iloc[test_k])
        rfc_score += log_loss(train_label.iloc[test_k], pred) / 10
    scores_n.append(rfc_score)
    if rfc_score < min_score:
        min_score = rfc_score
        best_n = n

    t2 = time.time()
    print('Done processing {0} trees ({1:.3f}sec)'.format(n, t2 - t1))
print(best_n, min_score)

# find best max_depth for RandomForestClassifier
print('Finding best max_depth for RandomForestClassifier...')
min_score = 100000
best_m = 0
scores_m = []
range_m = np.logspace(0, 2, num=3).astype(int)
for m in range_m:  # 樹的深度
    print("the max depth : {0}".format(m))
    t1 = time.time()

    rfc_score = 0.
    rfc = RandomForestClassifier(max_depth=m, n_estimators=best_n)
    for train_k, test_k in KFold(len(train_kobe), n_folds=10, shuffle=True):
        rfc.fit(train_kobe.iloc[train_k], train_label.iloc[train_k])
        # rfc_score += rfc.score(train.iloc[test_k], train_y.iloc[test_k])/10
        pred = rfc.predict(train_kobe.iloc[test_k])
        rfc_score += log_loss(train_label.iloc[test_k], pred) / 10
    scores_m.append(rfc_score)
    if rfc_score < min_score:
        min_score = rfc_score
        best_m = m

    t2 = time.time()
    print('Done processing {0} trees ({1:.3f}sec)'.format(m, t2 - t1))
print(best_m, min_score)
plt.figure(figsize=(10, 5))
plt.subplot(121)
plt.plot(range_n, scores_n)
plt.ylabel('score')
plt.xlabel('number of trees')

plt.subplot(122)
plt.plot(range_m, scores_m)
plt.ylabel('score')
plt.xlabel('max depth')
plt.show()

model = RandomForestClassifier(n_estimators=best_n, max_depth=best_m)
model.fit(train_kobe, train_label)

Python資料分析與機器學習-Python庫分析科比生涯資料

原始碼下載：http://download.csdn.net/download/adam_zs/10222492 import matplotlib.pyplot as plt import pandas as pd import numpy as np '''科比生涯

Python資料分析與機器學習-Python時間序列分析

原始碼下載地址： http://download.csdn.net/download/adam_zs/10224873 from __future__ import absolute_import, division, print_function # http://w

Python資料分析與機器學習實戰 Numpy/Pandas/Matplotlib等常用庫精講

課程簡介: 課程風格通俗易懂，真實案例實戰。精心挑選真實的資料集為案例，通過Python資料科學庫numpy,pandas,matplot結合機器學習庫scikit-learn完成一些列的機器學習案例。課程以實戰為基礎，所有課時都結合程式碼演示如何使用這些python

【A-003】python資料分析與機器學習實戰 Python科學計算庫 Pandas資料分析處理庫（四）DataFrame資料結構

pandas資料結構：DataFrame 引入：在上一節中已經介紹過了Series物件，Series物件可以理解為由一列索引和一列值，共兩列資料組成的結構。而DataFrame就是由一列索引和多列值組成的結構，其中，在DataFrame中的每一列都是一個S

高端實戰 Python數據分析與機器學習實戰 Numpy/Pandas/Matplotlib等常用庫

反向傳播數據讀取初識微信試圖 ada 安裝python 改進貝葉斯課程簡介:? ? 課程風格通俗易懂，真實案例實戰。精心挑選真實的數據集為案例，通過Python數據科學庫numpy,pandas,matplot結合機器學習庫scikit-learn完成一些列的

python資料分析（預測性分析與機器學習）

本文涉及到的主題如下所示：預處理基於邏輯迴歸的分類基於支援向量機的分類基於ElasticNetCV的迴歸分析支援向量迴歸基於相似性傳播均值漂移演算法遺傳演算法神經網路決策樹演算法1、預處理在上一章，我們已經做過一次預處理，即過濾掉停用詞。一些機器學習演算法對某些資料比

Python資料分析與機器學習-使用sklearn構造決策樹模型

# datasets包括內建的資料集 california_housing房價的資料集 from sklearn.datasets.california_housing import fetch_california_housing import pandas as pd

【A】python資料分析與機器學習實戰專案開始前，怎麼根據實際專案選擇合適的機器學習模型

選擇合適的機器學習演算法：最好的演算法的唯一路徑可能就是去嚐遍所有的演算法。但是這種方法非常 “蠢”。機器學習型別這部分我們會介紹一些最流行的機器學習模型型別。如果你對這些類別比較熟悉，那麼對你以後去選擇機器學習模型是非常有利的。監督學習

Python資料分析與機器學習-SVM調參例項

import numpy as np import matplotlib.pyplot as plt from scipy import stats from sklearn.svm import SVC from sklearn.datasets.samples_gene

Python資料分析與機器學習-使用者流失預警

import pandas as pd import numpy as np pd.set_option('display.height', 9999) pd.set_option('display.max_rows', 9999) pd.set_option('disp

Python資料分析與機器學習-Numpy

import numpy world_alcohol = numpy.genfromtxt("world_alcohol.txt", delimiter=",", dtype=str) print(type(world_alcohol)) print(world_alco

Python大數據與機器學習之NumPy初體驗

clas 自己入門 left ray spark AC DC 精通本文是Python大數據與機器學習系列文章中的第6篇，將介紹學習Python大數據與機器學習所必須的NumPy庫。通過本文系列文章您將能夠學到的知識如下：應用Python進行大數據與機器學習

2017年資料分析與機器學習實戰到經典案例全套高清視訊教

課程特點：1. 通俗易懂，快速入門對機器學習經典演算法結合數學推導進行形象解釋，例項演示。2. Python主導，實用高效使用資料領域最主流語言Python及其分析與建模庫作為課程核心工具。3. 案例為師，實戰護航基於真實資料集，從零開始結合Python工具與機器學習演算法完

Cloudera成立基金會，運用資料分析與機器學習改善人們的生活

2017年10月9日，北京 –為雲端計算優化的機器學習和分析平臺供應商Cloudera近期宣佈旗下慈善組織——Cloudera基金會(the Cloudera Foundation)正式成立，進一步強調其在運用大資料為社會公益服務方面所做出的承諾。Clouder

機器學習：sklearn資料集與機器學習組成

二、模型的選擇演算法是核心，資料和計算是基礎。這句話很好的說明了機器學習中演算法的重要性。那麼我們開看下機器學習的幾種分類：監督學習分類 k-近鄰演算法、決策樹、貝葉斯、邏輯迴歸(LR)、支援

資料結構與演算法學習--複雜度分析

什麼是複雜度分析資料結構和演算法解決是“如何讓計算機更快時間、更省空間的解決問題”。因此需從執行時間和佔用空間兩個維度來評估資料結構和演算法的效能。分別用時間複雜度和空間複雜度兩個概念來描述效能問題，二者統稱為複雜度。複雜度描述的是演算法執行時間（或佔用空間）與資料規模的增長關係。為什麼需要複雜度分析

Python資料分析與挖掘學習筆記一:庫和環境搭建

概念介紹: 資料分析: 用適當的統計分析方法對收集來的大量資料進行詳細研究和概括總結，以求最大化地發揮資料的作用，提取有用資訊和形成結論資料探勘: 從大量資料中通過演算法搜尋隱藏於其中資訊的過程. 資料分析的三大作用：現狀分析、原因分析、預測分析。資料分析的流程

python轉型資料分析、機器學習、人工智慧學習路線

最近1年的主要學習時間，都投資到了 python 資料分析和資料探勘上面來了，雖然經驗並不是十分豐富，但希望也能把自己的經驗分享下，幫助到更多想轉行python資料分析和人工智慧的朋友，給廣大同學朋友規劃個適合學習規劃。我大學學習的應用化學，後來畢業做了2年全棧設計師（PS：設計和前端

Python資料分析與挖掘學習筆記（2）使用pandas進行資料匯入

一、匯入pandas模組： import pandas as pda 二、匯入CSV格式資料： #資料匯入 i=pda.read_csv("E:/hexun.csv") 可對匯入的資料進行統計以及按列排序： #統計 i.describe() #排序 i

Python資料分析與挖掘學習筆記（4）淘寶商品資料探索與清洗實戰

一、相關理論：資料探索的核心：（1）資料質量分析（跟資料清洗密切聯絡）（2）資料特徵分析（分佈、對比、週期性、相關性、常見統計量等）資料清洗的步驟：（1）缺失值處理（通過describe與len直接發現、通過0資料發現）（2）異常值處理（通過散點圖發現

Python資料分析與機器學習-Python庫分析科比生涯資料

相關推薦