Attension Is All You Need
attention機制將整個句子作為輸入,從中抽取有用的資訊。
每個輸出都跟整個句子優化,輸出的值為輸入的句子的詞向量的一個加權求和值。
“This is what attention does, it extracts information from the whole sequence, aweighted sum of all the past encoder states”
https://towardsdatascience.com/attention-is-all-you-need-discovering-the-transformer-paper-73e5ff5e0634
https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/
self-attention:
Self-attention is a sequence-to-sequence operation: a sequence of vectors goes in, and a sequence of vectors comes out. Let’s call the input vectorsx1
,x2
,…,xt
and the corresponding output vectorsy1
,y2
,…,yt
. The vectors all have dimension k.To produce output vectoryi
, the self attention operation simply takesa weighted average over all the input vectors,
Q, K, V:
Every input vector is used in three different ways in the self-attention mechanism: the Query, the Key and the Value. In every role, it is compared to the other vectors to get its own outputyi
(Query), to get the j-th outputyj
(Key) and to compute each output vector once the weights have been established (Value).