《RETHINKING POSITIONAL ENCODING IN LANGUAGE PRE-TRAINING》TUPE論文復現

阿新 • • 發佈：2021-10-28

論文《TUPE》復現

原有的注意力計算公式拆分為四部分後發現，中間兩部分（word-to-position, position-to-word）對於識別並沒有什麼明顯的作用，並且第一部分（word-to-word）和第四部分論文提出將位置資訊與詞嵌入資訊分離開選擇各自的權重矩陣來更新引數，提出的原因是由於將原有的注意力計算公式拆分為四部分後發現，中間兩部分（word-to-position, position-to-word）對於識別並沒有什麼明顯的作用，並且第一部分（word-to-word）和第四部分

論文提出將位置資訊與詞嵌入資訊分離開選擇各自的權重矩陣來更新引數，提出的原因是由於將原有的注意力計算公式拆分為四部分後發現，中間兩部分（word-to-position, position-to-word）對於識別並沒有什麼明顯的作用，並且第一部分（word-to-word）和第四部分

（position-to-position）選擇的權重矩陣是相同的，但是位置資訊與詞嵌入資訊應該代表不同的作用，應該選擇不同的權重矩陣（position-to-position）選擇的權重矩陣是相同的，但是位置資訊與詞嵌入資訊應該代表不同的作用，應該選擇不同的權重矩陣

因此，復現該論文需要引入新的引數，新的權重矩陣：

但是值得注意的是編碼資訊與其權重矩陣相乘得到的結果應該是在進入注意力層之前所做完的，並且該文章只需要對編碼器進行改進，對於解碼器並沒有任何改變，所以編碼器和解碼器在呼叫多頭注意力模組時會發生變化，此時的編碼器需要傳入位置編碼的資訊，而解碼器和transformer一樣不需要單獨傳入位置編碼的資訊。但是值得注意的是編碼資訊與其權重矩陣相乘得到的結果應該是在進入注意力層之前所做完的，並且該文章只需要對編碼器進行改進，對於解碼器並沒有任何改變，所以編碼器和解碼器在呼叫多頭注意力模組時會發生變化，此時的編碼器需要傳入位置編碼的資訊，而解碼器和transformer一樣不需要單獨傳入位置編碼的資訊。

程式碼如下：

import torch
import numpy as np
from .attention import MultiHeadAttention   #引進多頭注意力模組
from .module import PositionalEncoding, PositionwiseFeedForward  #位置編碼和前饋網路
from .utils import get_non_pad_mask, get_attn_pad_mask  #padding mask:填充補齊使得輸入長度相同。attention mask：


class Encoder(nn.Module):
    """Encoder of Transformer including self-attention and feed forward.
    """

    def __init__(self, d_input=320, n_layers=6, n_head=8, d_k=64, d_v=64,
                 d_model=512, d_inner=2048, dropout=0.1, pe_maxlen=5000):
        super(Encoder, self).__init__()
        # parameters
        self.d_input = d_input   #輸入維度
        self.n_layers = n_layers #編碼解碼層數
        self.n_head = n_head     #自注意力頭數
        self.d_k = d_k           #鍵矩陣維度
        self.d_v = d_v           #值矩陣維度
        self.d_model = d_model   #模型維度
        self.d_inner = d_inner   #前饋網路隱層神經元個數（維度）
        self.dropout_rate = dropout #資訊漏失率
        self.pe_maxlen = pe_maxlen  #位置編碼最大長度

        # use linear transformation with layer norm to replace input embedding
        self.linear_in = nn.Linear(d_input, d_model) #全連線，輸入為320維 和輸出512維
        self.layer_norm_in = nn.LayerNorm(d_model)   #層歸一化
        self.positional_encoding = PositionalEncoding(d_model, max_len=pe_maxlen) #位置編碼
        self.dropout = nn.Dropout(dropout) #dropout

        self.w_pes1 = nn.Linear(d_model, n_head * d_v)  #定義位置編碼的權重矩陣
        self.w_pes2 = nn.Linear(d_model, n_head * d_v)
        nn.init.normal_(self.w_pes2.weight, mean=0, std=np.sqrt(2.0 / (d_model + d_v)))  #初始化權重
        nn.init.normal_(self.w_pes1.weight, mean=0, std=np.sqrt(2.0 / (d_model + d_v)))

        self.layer_stack = nn.ModuleList([
            EncoderLayer(d_model,d_model, d_inner, n_head, d_k, d_v, dropout=dropout)
            for _ in range(n_layers)])   #實現n_layers次編碼器
        #nn.ModuleList，它是一個儲存不同 module，並自動將每個 module 的 parameters 新增到網路之中的容器。

    def forward(self, padded_input, input_lengths, return_attns=False):
        """
        Args:
            padded_input: N x T x D
            input_lengths: N
        Returns:
            enc_output: N x T x H
        """
        enc_slf_attn_list = []

        d_k, d_v, n_head = self.d_k, self.d_v, self.n_head
        sz_b, len_v, _ = padded_input.size()

        # Prepare masks
        non_pad_mask = get_non_pad_mask(padded_input, input_lengths=input_lengths) #對輸入資料填充
        length = padded_input.size(1)  #獲得填充長度
        slf_attn_mask = get_attn_pad_mask(padded_input, input_lengths, length) #注意力填充



        # Forward
        # 進入編碼器前對資料的處理
        # enc_output = self.dropout(
        #     self.layer_norm_in(self.linear_in(padded_input)) +
        #     self.positional_encoding(padded_input)) 
        #     對資料線性變換（將320維的輸入變為512維）後歸一化，然後加上位置編碼後的資料進行dropout
        enc_output = self.dropout(self.layer_norm_in(self.linear_in(padded_input)))
        pe = self.positional_encoding(padded_input)
            
        pe1 = self.w_pes1(pe).view(sz_b, len_v, n_head, d_v)
        pe2 = self.w_pes2(pe).view(sz_b, len_v, n_head, d_v)

        pe1 = pe1.permute(2, 0, 1, 3).contiguous().view(-1, len_v, d_v)
        pe2 = pe2.permute(2, 0, 1, 3).contiguous().view(-1, len_v, d_v)

        pe = torch.bmm(pe1, pe2.transpose(1, 2))
        
        for enc_layer in self.layer_stack:  #進入編碼器
            enc_output, enc_slf_attn = enc_layer( 
                enc_output, pe,
                non_pad_mask=non_pad_mask,
                slf_attn_mask=slf_attn_mask)#經過編碼器輸出編碼結果和注意力
            if return_attns: #預設不對每層的注意力形成列表形式
                enc_slf_attn_list += [enc_slf_attn]

        if return_attns: #預設為false
            return enc_output, enc_slf_attn_list
        return enc_output, #返回最後層編碼器輸出


class EncoderLayer(nn.Module):
    """Compose with two sub-layers.
        1. A multi-head self-attention mechanism
        2. A simple, position-wise fully connected feed-forward network.
    """

    def __init__(self, d_model, d_inner, n_head, d_k, d_v, dropout=0.1):
        super(EncoderLayer, self).__init__()
        self.slf_attn = MultiHeadAttention(
            n_head, d_model, d_k, d_v , dropout=dropout)  #多頭注意力例項化
        self.pos_ffn = PositionwiseFeedForward(
            d_model, d_inner, dropout=dropout)           #前饋網路例項化

    def forward(self, enc_input, pe,non_pad_mask=None, slf_attn_mask=None):
        enc_output, enc_slf_attn = self.slf_attn(
            enc_input, enc_input, enc_input, pe, mask=slf_attn_mask) #獲得多頭注意力的輸出
        enc_output *= non_pad_mask     #防止經過注意力層後資料的長度發生變化

        enc_output = self.pos_ffn(enc_output)   #前饋網路的輸出
        enc_output *= non_pad_mask

        return enc_output, enc_slf_attn    #返回一個編碼器的輸出
`

《RETHINKING POSITIONAL ENCODING IN LANGUAGE PRE-TRAINING》TUPE論文復現

論文《TUPE》復現原有的注意力計算公式拆分為四部分後發現，中間兩部分（word-to-position, position-to-word）對於識別並沒有什麼明顯的作用，並且第一部分（word-to-word）和第四部分論文提出將位置資訊與詞嵌入

論文筆記1：Kaleido-BERT: Vision-Language Pre-training on Fashion Domain

Kaleido-BERT 引入了一種新穎的 kaleido 策略，基於transformer的時尚領域跨模態表示。同時設計了一種 alignment guided masking 策略，使模型更加關注影象-文字之間的語義關係。模型採用 NLP 中標準的 transformer

2021-ICLRw-How Does Supernet Help in Neural Architecture Search? - 論文閱讀

How Does Supernet Help in Neural Architecture Search? 2021-ICLRw-How Does Supernet Help in Neural Architecture Search?

硬幣系列三 | 硬幣自動分類的一個論文復現

書接上回，經過自動檢測和裁剪之後，已經有很多切割整齊的硬幣照片了，再來看看相似檢測的方法。

經典論文復現 | PyraNet：基於特徵金字塔網路的人體姿態估計

此文轉載自：https://my.oschina.net/u/4067628/blog/4767161 大咖揭祕Java人都栽在了哪？點選免費領取《大廠面試清單》，攻克面試難關~>>>

RF-14310: Arbitrary EL Evaluation | Arbitrary EL Evaluation in RichFaces | CVE-2013-2165 復現

緣起今天給客戶做掃描,掃描器掃到一個漏洞但這個漏洞是好久之前的,按照大佬的文章去復現,發現官方啥的都不提供下載了,而我自己java比較垃圾,就想著不復現了,直接copy個POC來用就行,但是遇到了一些問題

小白經典CNN論文復現系列（一）：LeNet1989

小白的經典CNN復現系列（一）：LeNet-1989 之前的浙大AI作業的那個系列，因為後面的NLP的東西我最近大概是不會接觸到，所以我們先換一個系列開始更新部落格，就是現在這個經典的CNN復現啦(｡･ω･｡)

【論文復現與改進】針對弱標註資料多標籤矩陣恢復問題，改進後的MCWD演算法，讓你的弱標註多標籤資料贏在起跑線上

技術標籤：機器學習演算法python機器學習改進後的MCWD演算法，讓你的弱標註多標籤資料贏在起跑線上

Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning 論文復現

技術標籤：深度學習機器學習python深度學習人工智慧安全 Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning 論文復現

論文復現——AutoRec: Autoencoders Meet Collaborative Filtering

《AutoRec: Autoencoders Meet Collaborative Filtering》是2015年Suvash等人發表在“The Web Conference”會議上的一篇論文，作者提出用自編碼器預測使用者對電影的評分。論文比較短，只有兩頁，可以說是深度學習在

Visualizing and Understanding Convolutional Networks論文復現筆記

目錄Visualizing and Understanding Convolutional Networks 論文復現筆記AbstractIntroductionApproachVisualization with a Deconvnet關於Deconvnet的實現Convnet Visualization對於一個給定的Feature map，論文中