1. 程式人生 > >GraphSAGE 程式碼解析(四) - models.py

GraphSAGE 程式碼解析(四) - models.py

1. 類及其繼承關係

     Model 
     /   \
    /     \
  MLP   GeneralizedModel
          /  \
         /    \
Node2VecModel  SampleAndAggregate

首先看Model, GeneralizedModel, SampleAndAggregate這三個類的聯絡。

其中Model與 GeneralizedModel的區別在於,Model的build()函式中搭建了序列層模型,而在GeneralizedModel中被刪去。self.ouput必須在GeneralizedModel的子類build()中被賦值。

class Model(object) 中的build()函式如下:

 1 def build(self):
 2     """ Wrapper for _build() """
 3     with tf.variable_scope(self.name):
 4         self._build()
 5 
 6     # Build sequential layer model
 7     self.activations.append(self.inputs)
 8     for layer in self.layers:
 9         hidden = layer(self.activations[-1])
10 self.activations.append(hidden) 11 self.outputs = self.activations[-1] 12 # 這部分sequential layer model模型在GeneralizedModel的build()中被刪去 13 14 # Store model variables for easy access 15 variables = tf.get_collection( 16 tf.GraphKeys.GLOBAL_VARIABLES, scope=self.name) 17
self.vars = {var.name: var for var in variables} 18 19 # Build metrics 20 self._loss() 21 self._accuracy() 22 23 self.opt_op = self.optimizer.minimize(self.loss)
View Code

序列層實現的功能是,給輸入,通過layer()返回輸出,又將這個輸出再次作為輸入到下一個layer()中,最終,取最後一層layer的結果作為output.

2. class SampleAndAggregate(GeneralizedModel)

1. __init__():

(1) self.features的由來:

para: features    tf.get_variable()-> identity features
     |                   |
self.features     self.embeds   --> At least one is not None
      \                 /       --> Concat if both are not None 
       \               /
        \             /
         self.features

(2) self.dims:

self.dims是一個list, 每一位記錄各個神經網路層的維數。

self.dims[0]的值相當於self.features的列數 (0 if features is None else features.shape[1]) + identity_dim),(注意:括號裡features為傳入的引數,而非self.features)

之後各位為各層output_dim,也就是hidden units的個數。

(3) __init()__函式程式碼

 1 def __init__(self, placeholders, features, adj, degrees,
 2          layer_infos, concat=True, aggregator_type="mean",
 3          model_size="small", identity_dim=0,
 4          **kwargs):
 5 '''
 6 Args:
 7     - placeholders: Stanford TensorFlow placeholder object.
 8     - features: Numpy array with node features. 
 9                 NOTE: Pass a None object to train in featureless mode (identity features for nodes)!
10     - adj: Numpy array with adjacency lists (padded with random re-samples)
11     - degrees: Numpy array with node degrees. 
12     - layer_infos: List of SAGEInfo namedtuples that describe the parameters of all 
13            the recursive layers. See SAGEInfo definition above.
14     - concat: whether to concatenate during recursive iterations
15     - aggregator_type: how to aggregate neighbor information
16     - model_size: one of "small" and "big"
17     - identity_dim: Set to positive int to use identity features (slow and cannot generalize, but better accuracy)
18 '''
19 super(SampleAndAggregate, self).__init__(**kwargs)
20 if aggregator_type == "mean":
21     self.aggregator_cls = MeanAggregator
22 elif aggregator_type == "seq":
23     self.aggregator_cls = SeqAggregator
24 elif aggregator_type == "maxpool":
25     self.aggregator_cls = MaxPoolingAggregator
26 elif aggregator_type == "meanpool":
27     self.aggregator_cls = MeanPoolingAggregator
28 elif aggregator_type == "gcn":
29     self.aggregator_cls = GCNAggregator
30 else:
31     raise Exception("Unknown aggregator: ", self.aggregator_cls)
32 
33 # get info from placeholders...
34 self.inputs1 = placeholders["batch1"]
35 self.inputs2 = placeholders["batch2"]
36 self.model_size = model_size
37 self.adj_info = adj
38 if identity_dim > 0:
39     self.embeds = tf.get_variable(
40         "node_embeddings", [adj.get_shape().as_list()[0], identity_dim])
41     # self.embeds: record the neigh features embeddings
42     # number of features = identity_dim
43     # number of neighbors = adj.get_shape().as_list()[0]
44 else:
45     self.embeds = None
46 if features is None:
47     if identity_dim == 0:
48         raise Exception(
49             "Must have a positive value for identity feature dimension if no input features given.")
50     self.features = self.embeds
51 else:
52     self.features = tf.Variable(tf.constant(
53         features, dtype=tf.float32), trainable=False)
54     if not self.embeds is None:
55         self.features = tf.concat([self.embeds, self.features], axis=1)
56 self.degrees = degrees
57 self.concat = concat
58 
59 self.dims = [
60     (0 if features is None else features.shape[1]) + identity_dim]
61 self.dims.extend(
62     [layer_infos[i].output_dim for i in range(len(layer_infos))])
63 self.batch_size = placeholders["batch_size"]
64 self.placeholders = placeholders
65 self.layer_infos = layer_infos
66 
67 self.optimizer = tf.train.AdamOptimizer(
68     learning_rate=FLAGS.learning_rate)
69 
70 self.build()
View Code

(2) sample(inputs, layer_infos, batch_size=None)

對於sample的演算法描述,詳見論文Appendix A, algorithm 2.

程式碼:

 1 def sample(self, inputs, layer_infos, batch_size=None):
 2     """ Sample neighbors to be the supportive fields for multi-layer convolutions.
 3 
 4     Args:
 5         inputs: batch inputs
 6         batch_size: the number of inputs (different for batch inputs and negative samples).
 7     """
 8 
 9     if batch_size is None:
10         batch_size = self.batch_size
11     samples = [inputs]
12     # size of convolution support at each layer per node
13     support_size = 1
14     support_sizes = [support_size]
15 
16     for k in range(len(layer_infos)):
17         t = len(layer_infos) - k - 1
18         support_size *= layer_infos[t].num_samples
19         sampler = layer_infos[t].neigh_sampler
20         
21         node = sampler((samples[k], layer_infos[t].num_samples))
22         samples.append(tf.reshape(node, [support_size * batch_size, ]))
23         support_sizes.append(support_size)
24 
25     return samples, support_sizes
View Code

sampler = layer_infos[t].neigh_sampler

當函式被呼叫時,layer_infos會被賦值,在unsupervised_train.py中,其中neigh_sampler被賦為UniformNeighborSampler,其在neigh_samplers.py中定義:class UniformNeighborSampler(Layer)。

目的是對於輸入的samples[k] (即為上一步sample得到的節點,如上圖依次得到黃色區域表示的samples[0],橙色區域表示的samples[1], 粉色區域表示的samples[2]。其中samples[k]是有由對samples[k - 1]中各節點的鄰居取樣而得),選取num_samples個數的鄰居節點的序號(對應上圖N(u))。(返回值是adj_lists, 即為被截斷為num_samples列數的鄰接矩陣。)

這裡注意區別support_size與num_samples:

num_sample為當前深度每個節點u所選取的鄰居節點的個數為num_samples;

support_size表示當前節點u的embedding受多少節點資訊的影響。其既受當前層num_samples個直接鄰居的影響,其鄰居也受更先前深度num_samples個鄰居的影響,以此類推。故support_size是到目前深度為止的各深度下num_samples的連乘積。則對於batch_size個輸入節點,總的support個數為: support_size * batch_size。

最後將support_size存進support_sizes的陣列中。

sample() 函式最終返回包含各深度下采樣點的samples陣列與各深度下各點受支援節點數目的support_sizes陣列。

2. def _build(self):

1 self.neg_samples, _, _ = (tf.nn.fixed_unigram_candidate_sampler(
2     true_classes=labels,
3     num_true=1,
4     num_sampled=FLAGS.neg_sample_size,
5     unique=False,
6     range_max=len(self.degrees),
7     distortion=0.75,
8     unigrams=self.degrees.tolist()))

(1) tf.nn.fixed_unigram_candidate_sampler:

按照使用者提供的概率分佈進行取樣。
如果類別服從均勻分佈,我們就用uniform_candidate_sampler;
如果詞作類別,我們知道詞服從 Zipfian, 我們就用 log_uniform_candidate_sampler;
如果能夠通過統計或者其他渠道知道類別滿足某些分佈,用 nn.fixed_unigram_candidate_sampler;
如果實在不知道類別分佈,我們還可以用 tf.nn.learned_unigram_candidate_sampler。

(2) Paras:
a. num_sampled:

sampling_candidates的元素是在沒有替換(如果unique = True)的情況下繪製的,
或者是從基本分佈中替換(如果unique = False)。

unique = True 可以看作無放回抽樣;unique = False 可以看作有放回抽樣。

b. distortion:

distortion used the word2vec freq energy table formulation
f^(3/4) / total(f^(3/4))
in word2vec energy counted by freq;
in graphsage energy counted by degrees
so in unigrams = [] each ID recored each node's degree

c. unigrams: 

各個節點的度。

(3) Returns:
a. sampled_candidates: A tensor of type int64 and shape [num_sampled]. The sampled classes.
b. true_expected_count: A tensor of type float. Same shape as true_classes. The expected counts under the sampling distribution of each of true_classes.
c. sampled_expected_count: A tensor of type float. Same shape as sampled_candidates. The expected counts under the sampling distribution of each of sampled_candidates.
View Code