1. 程式人生 > 實用技巧 >sklearn的pca建模_基於pca和內容的建模,用於英雄推薦英雄聯盟

sklearn的pca建模_基於pca和內容的建模,用於英雄推薦英雄聯盟

sklearn的pca建模

Note: All the code for the below can be found here.

注意:以下所有程式碼均可在此處找到。

Previously I wrote an article on how we can use graph networks to help provide Champion recommendations in the game League of Legends (LoL). The technique is known as “User-user collaborative filtering”, where we utilise the information we know about a person to find similar users and then base our recommendation on what we know they like.

之前,我寫過一篇文章,介紹如何使用圖形網路幫助英雄聯盟(LoL)遊戲中提供冠軍推薦。 該技術被稱為“使用者-使用者協作過濾” ,其中我們利用我們瞭解的有關某人的資訊來查詢相似的使用者,然後根據我們所知道的他們喜歡的東西提出建議。

To help illustrate this, we’ll use the classic Amazon example. Imagine that you have added a PS4 and the latest FIFA game to your Amazon basket, the algorithm looks at all users who have previously bought a PS4 and FIFA together and then finds which other items they tend to have in their basket, i.e. the latest NFL game, Madden, which is then recommended to you.

為了說明這一點,我們將使用經典的Amazon示例。 想象一下,您已經在您的亞馬遜購物籃中添加了PS4和最新的FIFA遊戲,該演算法會檢視先前一起購買過PS4和FIFA的所有使用者,然後查詢他們傾向於在購物籃中擁有哪些其他物品,即最新的NFL遊戲,Madden,然後推薦給您。

Today, we’re looking at a different form of recommendation algorithm known as a “Content Based Model”. This technique instead looks to connect items together based on their similarities, i.e. if you’re buying a PS4 sports game produced by EA then here are some other PS4 sports games produced by EA. This technique is favourable when you have no information about user preference, such as when just launching the product.

今天,我們正在尋找一種不同形式的推薦演算法,即“基於內容的模型”。 這項技術而是根據相似度將專案連線在一起,即,如果您購買的是EA製作的PS4體育遊戲,那麼這裡是EA製作的其他PS4體育遊戲。 當您沒有有關使用者首選項的資訊時(例如僅在啟動產品時),此技術非常有用。

However, there are almost 150 LoL Champions and we don’t want to spend all our time labeling them with all the various attributes we would need to make this work. So instead, what we are going to do is “describe” the Champions using their in-game statistics, such as their average kills per game or how much objective damage they do.

但是,有將近150個LoL冠軍,我們不想花所有時間為他們貼上進行這項工作所需的所有各種屬性的標籤。 因此,相反,我們要做的是使用遊戲中的統計資料“描述”冠軍,例如他們每場比賽的平均擊殺數或他們造成的客觀傷害。

To do this, we can analyse 150,000 Diamond games. Note that I’ve limited this to Top, Middle and ADC players only given the inherent difference support and junglers have in their statistics (i.e. low gold from minions).

為此,我們可以分析15萬鑽石遊戲。 請注意,我僅將這種情況限制在頂級,中級和ADC播放器中,僅出於對內在差異的支援,而打野者的統計資料也是如此(例如,從小兵中獲得的低價)。

After averaging the data for all Champions the first thing to note is that there are some very distinct correlations between many of the statistics. It shouldn’t be a surprise that attributes such as “killingSprees” and “kills” are almost perfectly correlated (the former indicating how many times a player has been on a killing spree, the latter is how many kills in total that game).

將所有冠軍的資料平均後,首先要注意的是,許多統計資料之間存在一些非常不同的相關性。 諸如“ killingSprees”和“ kills”之類的屬性幾乎完美相關(前者表示玩家進行一次殺戮狂潮的次數,後者是該遊戲總共殺滅了多少次),這並不奇怪。

Image for post
Graph illustrating the multicollinearity issue that occurs with such a large number of attributes.
該圖說明了由於大量屬性而發生的多重共線性問題。

A common approach to deal with this level of multicollinearity is either exclusion (pick kills, delete killingSprees) or aggregation (kills * killingSprees). However, there is a better solution known as Principle Component Analysis (PCA) which is able to extract the core relationship between these attributes without manual intervention or the removal of potential key drivers.

處理這種多重共線性的一種常見方法是排除(剔除殺死,刪除killingSprees)或聚合(殺死* killingSprees)。 但是,有一個更好的解決方案稱為主成分分析(PCA),它能夠提取這些屬性之間的核心關係,而無需人工干預或刪除潛在的關鍵驅動因素。

PCA is a fairly complex subject that requires an understanding of Eigenvectors/values and there are plenty of great articles on it so I won’t labour the subject here. Instead, I will say that what PCA is trying to do is capture as much of the variance in the data as possible, whilst minimising the amount of variables used.

PCA是一個相當複雜的主題,需要了解特徵向量/值,並且上面有很多不錯的文章,因此我在這裡不做任何工作。 相反,我要說的是PCA要做的是捕獲資料中儘可能多的方差,同時最大程度地減少使用的變數量。

Image for post
The percentage of variance each component explains of the original data, summing to 100%.
每個成分解釋原始資料的方差百分比,總計為100%。

After fitting PCA to the dataset, we find that well over 30% of the variance of the data can be fit inside a single component, just over 16% is then found in the second component, 11% or so in the third and so on..

將PCA擬合到資料集後,我們發現可以將資料方差的30%以上擬合到單個元件中,然後在第二個元件中找到16%以上,在第三個元件中找到11%左右,依此類推。 ..

But what are these components? To help understand what they are made of and where they have come from, take a look at the graph below illustrating which variables are part of the first component. It’s clear that goldEarned is the largest contributor to this component, alongside objective damage, the largest multi-kill achieved, the number of killing sprees, damage dealt and total kills. It’s safe to say that this component is capturing the variables relating to stomping lane. If we add on the fact that “physical” damage is specified, you can almost see the Fiora/Riven/Trynd one tricks appearing in front of your eyes.

但是這些成分是什麼? 為了幫助理解它們的構成以及它們的來源,請檢視下圖,其中說明了哪些變數是第一個元件的一部分。 顯然,goldEarned是這一部分的最大貢獻者,此外還有客觀傷害,所實現的最大多重殺傷力,殺傷力的數量,造成的傷害和總殺傷力。 可以肯定地說,此元件正在捕獲有關踩踏車道的變數。 如果加上指定了“物理”損壞的事實,您幾乎可以看到Fiora / Riven / Trynd一招出現在眼前。

Image for post
Graph illustrating which of the original variables are most highly correlated with the first component.
該圖說明了哪些原始變數與第一成分之間的相關性最高。

The 2nd component compromises of two main attributes: towers taken and damage self-mitigated (blocked/parried/immune/reduced etc..). However, you may be thinking how this all relates to content based recommendation models! Well, what we now have are two components that contain over 50% of the variance between the Champions. These can be considered as proxies for descriptions, where instead of “sports game” we have “Champion who kills everyone” and “produced by EA” becomes “high turret damage”! We can then plot these descriptive components out in a 2D space and we can start to see how it all comes together (warning, big old graph coming at you for visibility):

第二部分是兩個主要屬性的折衷方案:被奪取的塔和自減輕的傷害(受阻/格擋/免疫/降低等)。 但是,您可能正在考慮這一切與基於內容的推薦模型之間的關係! 好了,我們現在有兩個組成部分,其中包含冠軍之間方差的50%以上。 這些可以看作是描述的代理,在這裡我們不是“體育比賽”,而是“殺死所有人的冠軍”,而“ EA生產的”則變成了“高炮塔傷害”! 然後,我們可以在2D空間中繪製這些描述性元件,並且可以開始看到它們是如何組合在一起的(警告,較大的舊圖形會向您顯示):

Image for post
2D representation of the first two components, which can be used as the base for a recommendation engine. Champions are coloured depending on their main role, but the data is not necessarily gathered from players in that position.
前兩個元件的2D表示形式,可以用作推薦引擎的基礎。 冠軍的顏色取決於他們的主要角色,但資料不一定來自該位置的球員。

Note: Although “Support” champions are shown here in yellow, the data is actually derived from farming lanes only. I.e. the Zilean data you see above is from when the Champion is played in either Top, Mid or as the APC.

注意:雖然此處以黃色顯示“支援”冠軍,但這些資料實際上僅來自耕種車道。 也就是說,您在上方看到的Zilean資料來自當冠軍在上,中或作為APC比賽時。

Those of you paying attention will note that component 1 is inversed, where high damage/kills is scored low on the X-axis. Component 2 is not inversed, so a high number on the Y-axis indicates lots of turret taking and damage mitigation. To make sure it’s worked as expected, take a look at the Champions in the top left (i.e. that do lots of physical damage, take towers and mitigate damage); Fiora & Tryndamere (Trynd’s ult counts as damage mitigation). How about the bottom center where we see Katarina and Karthus who score relatively high on damage and kills but aren’t smashing turrets and mitigating damage. Sounds right to me.

那些需要注意的人會注意到,元件1相反,在X軸上,較高的傷害/殺傷力得分較低。 部件2沒有反轉,所以在Y軸的數字表示大量炮塔了結和減輕損失。 為了確保它能按預期工作,請檢視左上角的冠軍(即造成大量物理傷害,防禦塔並減輕傷害); Fiora&Tryndamere(Trynd的超值可算是減輕傷害)。 在底部中心,我們看到卡塔琳娜和卡爾薩斯在傷害和殺傷力上得分較高,但沒有砸破炮塔並減輕傷害的情況如何? 對我來說聽起來不錯。

The next step is simple, the recommendation is based on the Champion with the shortest Euclidean distance (straight line) from the Champion they currently play. You play a lot of Taric? Try Maokai. Akali? How about Fizz. Unkillable Dr. Mundo? You’ll love our boy Sion.

下一步很簡單,建議是基於距當前比賽冠軍最短歐幾里德距離(直線)的冠軍。 你玩很多塔裡克嗎? 試試茂凱。 阿卡利? 菲茲呢。 不可殺死的蒙多博士? 您會愛我們的男孩Sion。

If we wanted to expand on this, we’d move to higher dimensions. If you go back to the graph showing how much variance is captured in each component, I’d say there’s an argument to build the model based on 3, maybe even 5 dimensions. The rest works the same, but given the visualisation becomes tricky we’ll leave it there for now!

如果我們想對此進行擴充套件,我們將移至更高的維度。 如果返回到顯示每個元件捕獲了多少差異的圖表,我會說有一個論據可以基於3維甚至5維構建模型。 其餘的工作原理相同,但是鑑於視覺化變得棘手,我們現在就將其保留!

I hope this provides another insight into potential recommendation types that may be worth exploring and the benefits PCA provides, although I use League of Legends as my domain these can easily be applied to any other field. I recommend going back up to the large graph, find your main and seeing whether you’d agree that the ones surrounding it are a similar play-style — let me know below in the comments!

我希望這可以為潛在的推薦型別提供另一種見解,儘管PCA可以將其應用到其他領域,但我可能將PCA提供的優勢與英雄聯盟聯絡在一起。 我建議回到大型圖表,找到您的主要圖表,然後看看您是否同意圍繞它的圖表是類似的遊戲風格-在下面的評論中讓我知道!

Thanks for getting to the bottom of my article! My name is Jack J. and I’m a professional Data Scientist, writer and founder of the League of Legends analytics site JUNG.GG. You can also find me on my blog LeagueOfData, where I post less Data Science intense articles, it’s also the best place to get in contact with me.

感謝您深入我的文章! 我叫Jack J.,我是職業資料科學家,英雄聯盟分析網站JUNG.GG的作家和創始人。 您也可以在我的部落格LeagueOfData上找到我,我在該部落格上釋出了有關Data Science的文章較少,這也是與我聯絡的最佳場所。

翻譯自: https://towardsdatascience.com/pca-and-content-based-modelling-for-champion-recommendation-league-of-legends-80e909e56672

sklearn的pca建模