GraphSAGE 程式碼解析(四) - models.py
1. 類及其繼承關係
Model / \ / \ MLP GeneralizedModel / \ / \ Node2VecModel SampleAndAggregate
首先看Model, GeneralizedModel, SampleAndAggregate這三個類的聯絡。
其中Model與 GeneralizedModel的區別在於,Model的build()函式中搭建了序列層模型,而在GeneralizedModel中被刪去。self.ouput必須在GeneralizedModel的子類build()中被賦值。
class Model(object) 中的build()函式如下:
1 def build(self): 2 """ Wrapper for _build() """ 3 with tf.variable_scope(self.name): 4 self._build() 5 6 # Build sequential layer model 7 self.activations.append(self.inputs) 8 for layer in self.layers: 9 hidden = layer(self.activations[-1])View Code10 self.activations.append(hidden) 11 self.outputs = self.activations[-1] 12 # 這部分sequential layer model模型在GeneralizedModel的build()中被刪去 13 14 # Store model variables for easy access 15 variables = tf.get_collection( 16 tf.GraphKeys.GLOBAL_VARIABLES, scope=self.name) 17self.vars = {var.name: var for var in variables} 18 19 # Build metrics 20 self._loss() 21 self._accuracy() 22 23 self.opt_op = self.optimizer.minimize(self.loss)
序列層實現的功能是,給輸入,通過layer()返回輸出,又將這個輸出再次作為輸入到下一個layer()中,最終,取最後一層layer的結果作為output.
2. class SampleAndAggregate(GeneralizedModel)
1. __init__():
(1) self.features的由來:
para: features tf.get_variable()-> identity features | | self.features self.embeds --> At least one is not None \ / --> Concat if both are not None \ / \ / self.features
(2) self.dims:
self.dims是一個list, 每一位記錄各個神經網路層的維數。
self.dims[0]的值相當於self.features的列數 (0 if features is None else features.shape[1]) + identity_dim),(注意:括號裡features為傳入的引數,而非self.features)
之後各位為各層output_dim,也就是hidden units的個數。
(3) __init()__函式程式碼
1 def __init__(self, placeholders, features, adj, degrees, 2 layer_infos, concat=True, aggregator_type="mean", 3 model_size="small", identity_dim=0, 4 **kwargs): 5 ''' 6 Args: 7 - placeholders: Stanford TensorFlow placeholder object. 8 - features: Numpy array with node features. 9 NOTE: Pass a None object to train in featureless mode (identity features for nodes)! 10 - adj: Numpy array with adjacency lists (padded with random re-samples) 11 - degrees: Numpy array with node degrees. 12 - layer_infos: List of SAGEInfo namedtuples that describe the parameters of all 13 the recursive layers. See SAGEInfo definition above. 14 - concat: whether to concatenate during recursive iterations 15 - aggregator_type: how to aggregate neighbor information 16 - model_size: one of "small" and "big" 17 - identity_dim: Set to positive int to use identity features (slow and cannot generalize, but better accuracy) 18 ''' 19 super(SampleAndAggregate, self).__init__(**kwargs) 20 if aggregator_type == "mean": 21 self.aggregator_cls = MeanAggregator 22 elif aggregator_type == "seq": 23 self.aggregator_cls = SeqAggregator 24 elif aggregator_type == "maxpool": 25 self.aggregator_cls = MaxPoolingAggregator 26 elif aggregator_type == "meanpool": 27 self.aggregator_cls = MeanPoolingAggregator 28 elif aggregator_type == "gcn": 29 self.aggregator_cls = GCNAggregator 30 else: 31 raise Exception("Unknown aggregator: ", self.aggregator_cls) 32 33 # get info from placeholders... 34 self.inputs1 = placeholders["batch1"] 35 self.inputs2 = placeholders["batch2"] 36 self.model_size = model_size 37 self.adj_info = adj 38 if identity_dim > 0: 39 self.embeds = tf.get_variable( 40 "node_embeddings", [adj.get_shape().as_list()[0], identity_dim]) 41 # self.embeds: record the neigh features embeddings 42 # number of features = identity_dim 43 # number of neighbors = adj.get_shape().as_list()[0] 44 else: 45 self.embeds = None 46 if features is None: 47 if identity_dim == 0: 48 raise Exception( 49 "Must have a positive value for identity feature dimension if no input features given.") 50 self.features = self.embeds 51 else: 52 self.features = tf.Variable(tf.constant( 53 features, dtype=tf.float32), trainable=False) 54 if not self.embeds is None: 55 self.features = tf.concat([self.embeds, self.features], axis=1) 56 self.degrees = degrees 57 self.concat = concat 58 59 self.dims = [ 60 (0 if features is None else features.shape[1]) + identity_dim] 61 self.dims.extend( 62 [layer_infos[i].output_dim for i in range(len(layer_infos))]) 63 self.batch_size = placeholders["batch_size"] 64 self.placeholders = placeholders 65 self.layer_infos = layer_infos 66 67 self.optimizer = tf.train.AdamOptimizer( 68 learning_rate=FLAGS.learning_rate) 69 70 self.build()View Code
(2) sample(inputs, layer_infos, batch_size=None)
對於sample的演算法描述,詳見論文Appendix A, algorithm 2.
程式碼:
1 def sample(self, inputs, layer_infos, batch_size=None): 2 """ Sample neighbors to be the supportive fields for multi-layer convolutions. 3 4 Args: 5 inputs: batch inputs 6 batch_size: the number of inputs (different for batch inputs and negative samples). 7 """ 8 9 if batch_size is None: 10 batch_size = self.batch_size 11 samples = [inputs] 12 # size of convolution support at each layer per node 13 support_size = 1 14 support_sizes = [support_size] 15 16 for k in range(len(layer_infos)): 17 t = len(layer_infos) - k - 1 18 support_size *= layer_infos[t].num_samples 19 sampler = layer_infos[t].neigh_sampler 20 21 node = sampler((samples[k], layer_infos[t].num_samples)) 22 samples.append(tf.reshape(node, [support_size * batch_size, ])) 23 support_sizes.append(support_size) 24 25 return samples, support_sizesView Code
sampler = layer_infos[t].neigh_sampler
當函式被呼叫時,layer_infos會被賦值,在unsupervised_train.py中,其中neigh_sampler被賦為UniformNeighborSampler,其在neigh_samplers.py中定義:class UniformNeighborSampler(Layer)。
目的是對於輸入的samples[k] (即為上一步sample得到的節點,如上圖依次得到黃色區域表示的samples[0],橙色區域表示的samples[1], 粉色區域表示的samples[2]。其中samples[k]是有由對samples[k - 1]中各節點的鄰居取樣而得),選取num_samples個數的鄰居節點的序號(對應上圖N(u))。(返回值是adj_lists, 即為被截斷為num_samples列數的鄰接矩陣。)
這裡注意區別support_size與num_samples:
num_sample為當前深度每個節點u所選取的鄰居節點的個數為num_samples;
support_size表示當前節點u的embedding受多少節點資訊的影響。其既受當前層num_samples個直接鄰居的影響,其鄰居也受更先前深度num_samples個鄰居的影響,以此類推。故support_size是到目前深度為止的各深度下num_samples的連乘積。則對於batch_size個輸入節點,總的support個數為: support_size * batch_size。
最後將support_size存進support_sizes的陣列中。
sample() 函式最終返回包含各深度下采樣點的samples陣列與各深度下各點受支援節點數目的support_sizes陣列。
2. def _build(self):
1 self.neg_samples, _, _ = (tf.nn.fixed_unigram_candidate_sampler( 2 true_classes=labels, 3 num_true=1, 4 num_sampled=FLAGS.neg_sample_size, 5 unique=False, 6 range_max=len(self.degrees), 7 distortion=0.75, 8 unigrams=self.degrees.tolist()))
(1) tf.nn.fixed_unigram_candidate_sampler:
按照使用者提供的概率分佈進行取樣。 如果類別服從均勻分佈,我們就用uniform_candidate_sampler; 如果詞作類別,我們知道詞服從 Zipfian, 我們就用 log_uniform_candidate_sampler; 如果能夠通過統計或者其他渠道知道類別滿足某些分佈,用 nn.fixed_unigram_candidate_sampler; 如果實在不知道類別分佈,我們還可以用 tf.nn.learned_unigram_candidate_sampler。 (2) Paras: a. num_sampled: sampling_candidates的元素是在沒有替換(如果unique = True)的情況下繪製的, 或者是從基本分佈中替換(如果unique = False)。 unique = True 可以看作無放回抽樣;unique = False 可以看作有放回抽樣。 b. distortion: distortion used the word2vec freq energy table formulation f^(3/4) / total(f^(3/4)) in word2vec energy counted by freq; in graphsage energy counted by degrees so in unigrams = [] each ID recored each node's degree c. unigrams: 各個節點的度。 (3) Returns: a. sampled_candidates: A tensor of type int64 and shape [num_sampled]. The sampled classes. b. true_expected_count: A tensor of type float. Same shape as true_classes. The expected counts under the sampling distribution of each of true_classes. c. sampled_expected_count: A tensor of type float. Same shape as sampled_candidates. The expected counts under the sampling distribution of each of sampled_candidates.View Code