1. 程式人生 > 實用技巧 >圖上的機器學習任務

圖上的機器學習任務

相關文章(Related Articles)

A graph is an interesting type of data. We could’ve thought that we can make predictions and train the model in the same way as with “normal” data. Surprisingly, machine learning tasks are defined much differently on graphs and we can categorize it into 4 types: node classification, link prediction, learning over the whole graph, and community detection. In this article, we look closer at how they’re defined and understand why they’re so much different from standard machine learning tasks.

的曲線圖是一個有趣的型別的資料。 我們本以為可以像使用“正常”資料一樣進行預測和訓練模型。 令人驚訝的是,機器學習任務在圖上的定義大不相同,我們可以將其分為4種類型:節點分類,連結預測,整個圖上的學習和社群檢測。 在本文中,我們將仔細研究它們的定義方式,並理解為什麼它們與標準的機器學習任務有很大不同。

節點分類 (Node Classification)

Let’s imagine that we have a network of friendships between animals . The animal can be either a dog, a cat, or a duck

and we can describe it with additional features such as weight, height, and colour. A connection between two specific animals means that they like each other. Our goal, given this friendship network and features, will be to predict the missing types of the animal. This prediction task is known as the node classification.

假設我們在動物之間建立了友誼網路。 動物可以是鴨,而我們 可以用重量,高度顏色等其他功能來描述它 兩種特定動物之間的聯絡意味著它們彼此喜歡。 鑑於這種友誼網路和特徵,我們的目標將是預測動物的失蹤型別。 此預測任務稱為節點分類。

Image for post
Icon8 Icon8

Let’s formalize the notation. Mathematically, we can define this animal friendship network as G = (V, Ε), where V is a node (animal), and E is edge (friendship connection). Also, each node has a respective feature vector xi (weight, height, and colour), where i means that the vector belongs to the node Vi. The goal of the node classification task is to predict label yi given the node Vi and its neighbours.

讓我們形式化表示法。 從數學上講,我們可以將該動物友誼網路定義為G =(V, E ),其中V是節點(動物),E是邊緣(友誼連線)。 此外,每個節點都有一個相應的特徵向量x I(體重,身高,和顏色),其中i表示該向量屬於該節點V I。 節點歸類任務的目標是預測標記y給出的節點 V I和它的鄰國

Now, how can we use machine learning models to predict these animal types? Most of the machine learning models are based on the fact that data points are independent from each other (i.i.d assumption). Here this assumption fails because the node labels (animal type) might be dependent on other neighbouring nodes [1]. For example, it is more probable that the node closer to the cluster of cats is also a cat. Similar nodes are typically closer together which is known as homophily.

現在,我們如何使用機器學習模型來預測這些動物型別? 大多數機器學習模型都是基於資料點彼此獨立的事實(同上假設)。 這裡的假設失敗了,因為節點標籤(動物型別)可能依賴於其他相鄰節點[1]。 例如,更可能是靠近貓群的節點也是貓。 相似的節點通常更靠近在一起,這被稱為同

Because the data independence doesn’t work in this case anymore, we can’t classify this classification task as supervised learning. It is often referred by researchers as semi-supervised learning because we can use information from neighbouring nodes to predict a certain node [1].

因為在這種情況下資料獨立性不再起作用,所以我們不能將此分類任務分類為監督學習。 研究人員通常將其稱為半監督學習,因為我們可以使用來自相鄰節點的資訊來預測某個節點[1]。

連結預測 (Link Prediction)

The objective of the link prediction is to decide whether there is a connection between two nodes. In our previous example with animal friendships graph, it is just a prediction of whether the neighbouring animals are friends.

連結預測的目的是確定兩個節點之間是否存在連線。 在前面的帶有動物友誼圖的示例中,它只是對相鄰動物是否為朋友的預測。

Image for post
Icon8 Icon8

Similarly to the node classification, we can also leverage neighbourhood information to predict the link between two nodes. A popular group of approaches for link predictions is called heuristic methods [2]. They compute certain scores from the graph itself and convert it into likelihood of the link between two nodes. Heuristic methods can be divided by the maximum number of neighbourhood hops that have to occur [2]. For example, Common Neighbours is a first-order heuristics as it requires only the direct neighbourhood of the node to compute the score (and not neighbours of the neighbours). In the image below, we can see the 1st neighbourhood of nodes V1 and V3.

類似於節點分類,我們還可以利用鄰域資訊來預測兩個節點之間的連結。 一組流行的連結預測方法稱為啟發式方法[2]。 他們從圖本身計算某些分數,並將其轉換為兩個節點之間連結的可能性。 啟發式方法可以除以必須發生的最大鄰域跳數[2]。 例如,“公共鄰居”是一啟發式方法,因為它只需要節點的直接鄰居即可計算分數(而不需要鄰居的鄰居)。 在下圖中,我們可以看到節點V 1V 3的第一個鄰域。

Image for post
V1 and V1V3 nodes have two common neighbours: V2 3個節點有兩個共同的鄰居 V6 and :V 6和 V2. Icons by V 2通過圖示 Icon8 Icon8

Frog V1 and V3 have 2 friends (neighbouring nodes) in common: V6 and V2. With this simple score, the Common Neighbours algorithm decides whether there is a link between two nodes or not.

青蛙V 1和V 3有兩個共同的朋友(鄰居節點): V 6和V2 。有了這個簡單的分數,Common Neighbors演算法就可以確定兩個節點之間是否存在連結。

Of course, there are more complex approaches, for example, the Resource Allocation (2nd order heuristics) which uses information from 2nd-order hops [2]. In the case of the node V5, RA would use for its algorithm V4 and V6 from the 1st neighbourhood and V1 and V3 from the 2nd neighbourhood. Other approaches can use even higher order heuristics to generate graph features for link prediction.

當然,還有更復雜的方法,例如,使用來自第二階躍點的資訊的資源分配(第二階啟發法)[2]。 在節點V 5的情況下,RA將使用其來自第一個鄰域的V 4V 6以及來自第二個鄰域的V 1V 3的演算法。 其他方法甚至可以使用更高階的啟發式方法來生成用於連結預測的圖形特徵。

Similarly to the node classification, the link prediction is also known as semi-supervised learning as we use neighbourhood information to predict the link between two nodes.

類似於節點分類,連結預測也稱為半監督學習,因為我們使用鄰域資訊來預測兩個節點之間的連結。

學習整個圖:分類,迴歸,聚類 (Learning Over the Whole Graph: Classification, Regression, Clustering)

Let’s change the example. Consider, that we have now a molecular data and our task is to predict if the given molecule is toxic. The illustration below shows how this classification task can be designed using graph neural networks.

讓我們更改示例。 考慮一下,我們現在有了分子資料,我們的任務是預測給定的分子是否有毒。 下圖顯示瞭如何使用圖神經網路設計該分類任務。

Image for post
[3]. The highlighted red parts show parts of the molecules that trigger the toxic response. This is not related to this article so please ignore this highlighted red parts. If you are interested in learning more about them , have a look at [3] [3] 。 突出顯示的紅色部分顯示了觸發毒性React的分子部分。 這與本文無關,因此請忽略此突出顯示的紅色部分。 如果您有興趣瞭解有關它們的更多資訊,請檢視[3]

We can consider each molecule as a separate graph where an atom is a node and link between atoms is an edge. This is an example of the classification task over the whole graph. The difference here is that we are given multiple instances of different graphs and train our model on them [1]. We learn over the whole graph instead of predicting specific components within a single graph such as a node or an edge.

我們可以將每個分子視為一個單獨的圖,其中一個原子是一個節點,原子之間的連線是一個邊。 這是整個圖形中分類任務的一個示例。 此處的區別在於,我們獲得了不同圖的多個例項,並在它們上訓練了我們的模型[1]。 我們學習整個圖,而不是預測單個圖內的特定組成部分,例如節點或邊。

What is compelling about the learning task over multiple instances of graphs, is that datapoints are considered to be i.i.d. It means that this learning task on graphs is very similar, if not the same, to classification, regression, and clustering tasks that are used in standard machine learning problems. Those are great news for us because we can reuse methods from standard machine learning algorithms such as RandomForest, SVMs, or XGBoost.

關於圖的多個例項的學習任務的令人信服的是,資料點被認為是iid 。 這意味著該圖上的學習任務與標準機器學習問題中使用的分類,迴歸和聚類任務非常相似(甚至不同)。 這對我們來說是個好訊息,因為我們可以重用標準機器學習演算法(例如RandomForest,SVM或XGBoost)中的方法。

社群檢測 (Community Detection)

Shortly speaking, a community detection for graphs can be considered as a clustering task of nodes within a single graph. Let’s have a look at an example below that shows a network of scientist that co-authored at least one paper together.

簡而言之,可以將對圖的社群檢測視為單個圖內節點的聚類任務。 讓我們看看下面的示例,該示例顯示了由科學家共同組成至少一篇論文的網路。

Image for post
[5]. [5]

Here, the task of community detection would be to identify these clusters of scientists working in different fields. Although it seems to be intuitive, clusters of communities are rather vaguely defined and differ in size and shapes amongst different datasets [4]. The communities can also overlap making it even harder to differentiate between them.

在這裡,社群發現的任務是識別這些在不同領域工作的科學家群體。 儘管看起來很直觀,但是社群叢集的定義比較模糊,並且在不同資料集之間的大小和形狀也不同[4]。 社群也可以重疊,從而更加難以區分它們。

Intuitively, we can suspect that nodes inside the communities will have more connections to neighbouring edges. There will be also nodes that have fewer edges and connect different communities. Those are the theoretical foundations that most of the community detection algorithms are based on. There are many different types of community detection algorithms but the most popular are methods based on spectral clustering, statistical inference, optimization, and dynamics [6].

憑直覺,我們可以懷疑社群內部的節點與相鄰邊緣之間的連線將會更多。 也將有節點具有較少的邊緣並且連線不同的社群。 這些是大多數社群檢測演算法所基於的理論基礎。 社群檢測演算法有很多不同型別,但最流行的是基於頻譜聚類統計推斷優化動力學的方法[6]。

加起來 (Summing Up)

We’ve seen that there are 4 major types of machine learning tasks on graphs: node classification, link prediction, learning over the whole graph, and community detection. Most of these tasks are very different from normal supervised/unsupervised learning. This is because graphs are interconnected with each other and data independence assumption fails. Researchers refer to it as semi-supervised learning.

我們已經看到,圖上有4種主要的機器學習任務:節點分類,連結預測,整個圖上的學習和社群檢測。 這些任務中的大多數與正常的有監督/無監督學習大不相同。 這是因為圖形相互連線,並且資料獨立性假設失敗。 研究人員將其稱為半監督學習。

關於我 (About Me)

I am an MSc Artificial Intelligence student at the University of Amsterdam. In my spare time, you can find me fiddling with data or debugging my deep learning model (I swear it worked!). I also like hiking :)

我是阿姆斯特丹大學的人工智慧碩士研究生。 在業餘時間,您會發現我對資料不滿意或除錯我的深度學習模型(我發誓它能工作!)。 我也喜歡遠足:)

Here are my other social media profiles, if you want to stay in touch with my latest articles and other useful content:

如果您想與我的最新文章和其他有用內容保持聯絡,這是我的其他社交媒體資料:

翻譯自: https://towardsdatascience.com/machine-learning-tasks-on-graphs-7bc8f175119a