1. 程式人生 > 實用技巧 >nlp模型中的編碼語言知識

nlp模型中的編碼語言知識

This article is authored by Keyur Faldu and Dr. Amit Sheth

本文由 Keyur Faldu Amit Sheth博士 撰寫

Modern NLP models (BERT, GPT, etc) are typically trained in the end to end manner, carefully crafted feature engineering is now extinct, and complex architectures of these NLP models enable it to learn end-to-end tasks (e.g. sentiment classification, question answering, etc.) without specifying the features explicitly [2]. Linguistic features (like part-of-speech, co-reference, etc) have played a key role in the classical NLP. Hence, it is important to understand how modern NLP models are arriving at decisions by “probing

” into what all they learn. Do these models learn linguistic features from unlabelled data automatically? How can we interpret the capabilities of modern NLP models? Lets probe.

現代的NLP模型(BERT,GPT等)通常以端到端的方式進行訓練,現在精心製作的特徵工程已經不存在了,這些NLP模型的複雜體系結構使它能夠學習端到端的任務(例如情感分類,問題回答等),而無需明確指定功能[2]。 語言特徵(如詞性,共指等)在經典NLP中起著關鍵作用。 因此,重要的是要了解現代NLP模型如何通過“ 探究

”他們所學的知識來做出決策。 這些模型是否自動從未標記的資料中學習語言功能? 我們如何解釋現代NLP模型的功能? 讓我們探討一下

語言學:背景 (Linguistics: The Background)

Linguistic knowledge is an essential aspect of natural language processing. We can think of it in the following dimensions,

語言知識是自然語言處理的重要方面。 我們可以從以下幾個方面來思考:

  • Syntax: analyzing the structure of sentences and the way words are connected.

    語法 :分析句子的結構和單詞的連線方式。

  • Morphology: deals with the inner structure of individual words and how new words are formed from morphs of these base words.

    形態 :處理單個單詞的內部結構以及這些基本單詞的變體如何形成新單詞。

  • Phonology: the study of the system of sounds comprising speech, that constitute fundamental components of language.

    語音學 :對構成語音的聲音系統的研究,語音構成語言的基本組成部分。

  • Semantics: deals with the meaning of individual words and entire texts.

    語義 :處理單個單詞和整個文字的含義。

In statistical methods and classical machine learning, solving any problem related to natural language processing involves deriving linguistic knowledge described above. Thus, the research community gave attention to numerous tasks related to linguistic knowledge. We can see a few examples as below:

在統計方法和經典機器學習中,解決與自然語言處理有關的任何問題都涉及派生上述語言知識。 因此,研究界對與語言知識有關的許多工給予了關注。 我們可以看到一些示例,如下所示:

Image for post
Figure 2: Example of linguistic knowledge in a sentence. (Image from the other article )
圖2:句子中的語言知識示例。 (圖片來自 另一篇文章 )
  • Part-of-speech: Syntactic category of words, i.e., noun, verb, adjective, pronoun, etc.

    詞性 :詞的句法類別,即名詞,動詞,形容詞,代詞等。

  • Constituency Trees (or phrase structure grammar): Phrase structure rules consider that sentence structure is constituency-based, and a parse tree arranges these constituents in a tree structure with constituency relation.

    選區樹 (或短語結構語法):短語結構規則認為句子結構是基於選區的,而解析樹將這些組成部分安排在具有選區關係的樹結構中。

  • Dependency Trees (or dependency grammars): Dependency grammar rules consider that sentence structure is dependency-based, and the dependency parse tree arranges words in a tree structure with dependency relation.

    依賴關係樹 (或依賴關係語法):依賴關係語法規則認為句子結構是基於依賴關係的,並且依賴關係分析樹以具有依賴關係的樹結構排列單詞。

  • Coreference: Relationship between two words or phrases with the common referent.

    共指 :兩個詞或短語與一般所指的關係。

  • Lemmatization: Deriving base lemma word after removing prefixes or suffixes using morphological analysis.

    詞形還原 :使用形態學分析去除字首或字尾之後推導基引理字。

Above are a few examples of important tasks related to linguistic knowledge, where part-of-speech mainly deals with syntactic knowledge, dependency trees, and co-references are important to further understand semantics, and lemmatization is an example of morphology.

上面是與語言知識有關的重要任務的一些示例,其中詞性主要處理語法知識,依存關係樹和共指對進一步理解語義很重要,而詞形化是形態的一個示例。

Numerous other tasks further analyze the linguistic properties of a sentence, like semantic roles, semantic proto-roles, relation classification (lexical and semantic), subject noun, main auxiliary verb, subject-verb agreement, etc.

許多其他任務進一步分析了句子的語言屬性,例如語義角色,語義協議,關係分類(詞彙和語義),主語名詞,主要助動詞,主語-動詞一致等。

現代NLP模型 (Modern NLP Models)

Modern NLP models are either LSTM based or transformer based. ELMO and ULMFIT are examples of LSTM architecture based language models. In contrast, BERT [1] and GPT are examples of transformers architecture based language models. For the rest of the study, let’s take an example of “BERT” as a reference.

現代的NLP模型基於LSTM或基於變壓器。 ELMO和ULMFIT是基於LSTM體系結構的語言模型的示例。 相反,BERT [1]和GPT是基於變壓器體系結構的語言模型的示例。 在其餘的研究中,我們以“ BERT”為例進行參考。

  • The BERT model is pre-trained with an objective of masked word prediction, and next sentence prediction on massive unlabeled data.

    對BERT模型進行預訓練,其目標是對大量未標記資料進行掩蓋字詞預測和下一句話預測。
  • The pre-trained BERT model is fine-tuned by extending it with the task-specific layers for tasks like ‘sentiment analysis,’ ‘text classification,’ or ‘question answering’ with limited labeled data.

    預訓練的BERT模型可以通過在特定任務層上擴充套件以適應諸如有限的標記資料的“情感分析”,“文字分類”或“問題回答”等任務來進行微調。

Representations produced by the pre-trained BERT models encode relevant information, which enables task-specific fine-tuning with very limited labeled data. The question is,

預訓練的BERT模型產生的表示對相關資訊進行編碼,從而可以使用非常有限的標記資料進行特定於任務的微調。 問題是,

BERT編碼什麼語言知識? (What Linguistic Knowledge is Encoded in BERT?)

Image for post
To probe is to investigate. Photo by Markus Winkler on Unsplash
探測就是調查。 Markus WinklerUnsplash拍攝的照片

As a result, a flurry of research sought to understand what kind of linguistic information is captured in neural networks. The most common theme across different approaches can be grouped as “probes” (or probing classifiers, diagnostic classifiers, auxiliary prediction tasks), which probes how internal mechanisms of neural networks can classify (or perform on) auxiliary linguistic tasks (or probe tasks).

結果,一系列的研究試圖瞭解在神經網路中捕獲了哪種語言資訊。 跨不同方法的最常見主題可以歸為“探針 ”(或探針分類器,診斷分類器,輔助預測任務),該探針探討神經網路的內部機制如何對輔助語言任務(或探針任務)進行分類(或執行) 。

Image for post
Figure 3. The illustration diagram of Probes on the BERT model. It shows how input tokens are contextualized in successive layers using attention mechanisms. Two types of Probes are shown, (1) representation based, and (2) attention-based. Note, the diagram is for broader illustration, so special tokens like CLS and SEP are not shown.
圖3. BERT模型上的探針的示意圖。 它顯示瞭如何使用注意機制在連續的層中將輸入標記進行上下文化。 顯示了兩種型別的探針,(1)基於表示,(2)基於關注。 注意,該圖用於更廣泛的說明,因此未顯示諸如CLS和SEP之類的特殊標記。

“探針”如何工作? (How do “Probes” work?)

  • Probing is an approach to investigate how a model (and its components) trained on the primary objective performs on some different auxiliary tasks. Typically, a model is extended with a classifier layer to transform its representations according to auxiliary task output space. Probing facilitates a better understanding of encoded representations.

    探測被調查的模型(及其元件)如何在訓練的主要目標進行了一些不同的輔助任務的方法。 通常,模型通過分類器層進行擴充套件,以根據輔助任務輸出空間來轉換其表示。 探測有助於更好地理解編碼表示。

  • Let say, we want to investigate if encoded representations from the BERT model capture linguistic information, like “if a verb is an auxiliary verb” or “if a phrase is a subject noun”. Auxiliary verbs are the helping verbs, and subject nouns are noun phrases that act as a subject. These tasks can be framed as “auxiliary tasks” for probes.

    假設我們要研究的是BERT模型的編碼表示形式是否捕獲語言資訊,例如“如果動詞是輔助動詞”或“如果短語是主語名詞”。 輔助動詞是幫助動詞,主語名詞是充當主語的名詞短語。 這些任務可以被構造為探針的“輔助任務”。
  • For example, in the sentence “Kids are playing cricket all day,” — are” is an auxiliary verb, “playing” is the main verb, “Kids” is the subject noun, and “Cricket” is an object noun

    例如,在“孩子們整天都在打板球”的句子中,“-輔助動詞,“ playing ”是主要動詞,“ Kids ”是主名詞,“ Cricket ”是賓語。

  • Probing classifiers can be attached to models at different layers. That would give insights about what layers of the BERT model capture how much information being sought as an auxiliary task.

    探測分類器可以附加到不同層的模型。 這將提供有關BERT模型的哪些層捕獲作為輔助任務正在尋求多少資訊的見解。
  • If a probe classifier is not able to do well on auxiliary task for linguistic information, that means such information is not encoded in internal representations of a model, also possible because it might not be needed to solve primary objectives of the model.

    如果探針分類器不能很好地完成語言資訊的輔助任務,則意味著此類資訊未編碼在模型的內部表示中,這也是可能的,因為可能不需要解決模型的主要目標。

“探針”與微調或多工學習有何不同? (How are “Probes” different from Fine-Tuning or Multi-Task Learning?)

Image for post
Table 1. Probes vs Fine-Tuning vs Multi-Tasks Learning
表1.探針與精調與多工學習
  • “Probes” are not related to fine-tuning for downstream tasks neither in its goal nor in approach.

    “探頭” 相關的微調下游任務既不是其目標也不是辦法。

  • Table 1 shows the comparative landscape.

    表1顯示了比較情況。
  • “Probes” are to discover encoded linguistic knowledge, whereas fine-tuning and multi-tasks learning trains the model on one or multiple primary tasks.

    “探針”用於發現編碼的語言知識,而微調和多工學習則可以在一個或多個主要任務上訓練模型。
Image for post
Figure 4. Multi-task learning vs Probes
圖4.多工學習與探索
  • As illustrated in figure 4, “Probes” can access model internals but can not update model parameters, on the other hand, fine-tuning and multi-tasks learning does not access model internals, but they can update model parameters.

    如圖4所示,“ Probes”可以訪問模型內​​部,但不能更新模型引數,另一方面,微調和多工學習不能訪問模型內​​部,但是它們可以更新模型引數。
  • “Probes” should be shallow in terms of complexity, (i.e. a single layer classifier on top of the model), whereas fine-tuning and multi-task learning can stack up deep layers depending upon the downstream tasks complexity [7][8].

    就複雜性而言,“探針”應該淺一些(即,模型頂部的單層分類器),而微調和多工學習則可以根據下游任務的複雜性堆疊更深的層次[7] [8]。 。

什麼是“探針”的不同型別? (What are Different Types of “Probes”?)

These probing classifiers can be categorized based on what neural network mechanisms they are leveraging to probe for the linguistic knowledge. These are mainly

這些探測分類器可以基於它們利用何種神經網路機制來探究語言知識進行分類。 這些主要是

  • Internal Representations: A small probe classifier is built on top of internal representations from different layers to analyze what linguistic information is encoded at different layers.

    內部表示:一個小型探針分類器建立在來自不同層的內部表示之上,以分析在不同層上編碼的語言資訊。

  • Attention weights: Probe classifiers are built on top of attention weights to discover if there is an underlying linguistic phenomenon in attention weights patterns.

    注意權重:探針分類器建立在注意權重之上,以發現注意權重模式中是否存在潛在的語言現象。

(A)基於“代表”的內部表示: ((A) Internal Representations based “Probes”:)

Quite a few techniques are probing how much linguistic knowledge is encoded in internal representation at different layers of models like BERT. Let’s take a look at a couple of examples.

相當多的技術正在探索在不同模型(例如BERT)的內部表示形式中編碼了多少語言知識。 讓我們看幾個例子。

(A.1) Edge Probing: A framework introduced by Tenney et al. [4][5] aims to probe linguistic knowledge encoded in contextualized representations of a model.

(A.1)邊緣探測: Tenney等人介紹的框架。 [4] [5]旨在探究以模型的上下文表示形式編碼的語言知識。

  • For auxiliary tasks like Part-of-Speech, Constituents, Dependencies, Entities, Semantic Role Labelling, Semantic Proto Roles, and Coreference resolutions, it has compared the performance of contextualized representations of models like BERT, GPT, ELMO, and CoVe.

    對於諸如詞性,成分,依賴關係,實體,語義角色標籤,語義原型角色和共指解析等輔助任務,它比較了諸如BERT,GPT,ELMO和CoVe等模型的上下文表示的效能。
  • Edge probing decomposes structured-prediction tasks into a common format, where a probing classifier receives a text span (or two spans) from the sentence and must predict a label such as a constituent or relation type, etc. from per-token embeddings for tokens within those target spans.

    邊緣探測將結構化預測任務分解為一種通用格式,其中探測分類器從句子中接收一個文字跨度(或兩個跨度),並且必須根據令牌的每個令牌嵌入來預測諸如構成或關係型別之類的標籤在這些目標範圍內。
  • The macro average of performance overall the auxiliary tasks for the BERT-Large model was 87.3, whereas the baseline probe using non-contextualized representations achieved 75.2. So, about 20% of additional linguistic knowledge was injected into as part of contextualization.

    總體上,BERT-Large模型的輔助任務的總體效能巨集平均值為87.3,而使用非上下文表示形式的基線探測則達到75.2 。 因此,作為語境化的一部分,大約有20%的額外語言知識被注入。

(A.2) BERT Rediscovers the Classical NLP Pipeline: Tenny et al. [3][9] further analyzed where linguistic knowledge comes from.

(A.2)BERT重新發現了經典的NLP管道 :Tenny等。 [3] [9]進一步分析了語言知識的來源。

  • Center of Gravity: Center of gravity reflects the average layer attended to compute scalar mixing (weighted pooling) of internal representations at different layers. For each task, intuitively, we can interpret a higher center of gravity means that the information needed for that task is captured by higher layers.

    重心:重心反映了用於計算不同層內部表示的標量混合(加權池)的平均層。 對於每個任務,我們可以直觀地解釋更高的重心,這意味著該任務所需的資訊被更高的層捕獲。

  • Expected layer: Probe classifier is trained with the scalar mixing of internal representations of different layers. Contribution (or differential score) of layer i is computed by taking the difference of “performance of probe trained with layers 0 to i” with “performance of probe trained with layer 0 to i-1”. The expected layer is the expectation of differential score over each layer.

    預期層:探針分類器是通過不同層內部表示的標量混合進行訓練的。 貢獻(或差分得分)層的i由取的差來計算與“ 使用層0至i訓練探針的效能 ”,“ 探針與層0的訓練效能至i-1”。 預期層是每個層上差異分數的期望。

Image for post
Figure 5: Probe performance, and layers contribution to auxiliary tasks (Image source, 圖5:探針效能,以及層對輔助任務的貢獻(圖片來源, Tenney et al.Tenney等人 [5]) [5])
  • In figure 5, row labels are auxiliary tasks for probing linguistic knowledge. F1 scores for probe classifiers for each task are mentioned in the first two columns, where l=0, indicates auxiliary tasks performance on non-contextual representations, and l=24 indicates auxiliary tasks performance by mixing contextual representations from all 24 layers of the BERT model. Expected layers are shown in purple color (and the center of gravity is shown in dark blue color).

    在圖5中,行標籤是探測語言知識的輔助任務。 在前兩列中提到了針對每個任務的探針分類器的F1分數,其中l = 0表示在非上下文表示中的輔助任務效能,l = 24通過混合BERT的所有24層的上下文表示來表示輔助任務效能模型。 預期的圖層以紫色顯示(重心以深藍色顯示)。
  • The expected layer is where the maximum additional linguistic knowledge comes from. And, it can be seen that linguistic knowledge about syntactic tasks gets acquired in initial layers, and for semantic tasks gets acquired in later layers.

    預期層是最大的額外語言知識所來自的地方。 並且,可以看出,關於語法任務的語言知識是在初始層中獲得的,而關於語義任務的語言知識是在較後層中獲得的。

(B)基於“探針”的注意力權重: ((B) Attention weights based “Probes”:)

“What Does BERT Look At? An Analysis of BERT’s Attention,” Clark et al. [2] probe attention weights for linguistic knowledge in BERT. It was intriguing to notice how specific attention heads are expressing linguistic phenomena, and attention heads combinations predict linguistic tasks such as dependency grammar that is comparable to the state of the art performance.

“ BERT看什麼? BERT的注意力分析”,Clark等。 [2]探討BERT中語言知識的注意力權重。 有趣的是,特定的關注頭是如何表達語言現象的,而關注頭的組合預測的語言任務(例如依存語法)可與最新的技術水平相提並論。

(B.1) Specific Attention Heads

(B.1)具體的注意頭

  • As can be seen in figure 6, specific attention heads in BERT express specific linguistic phenomena, where a token attends other tokens depending on the linguistic relation expressed by the attention head.

    如圖6所示,BERT中的特定注意頭表示特定的語言現象,其中一個標記與其他標記相關,取決於該注意頭表示的語言關係。
Image for post
Figure 6: Linguistic phenomena expressed by specific attention heads in BERT. (Image source: Clark et al. [3])
圖6:由BERT中特定的關注頭表達的語言現象。 (圖片來源: Clark等人 [3])
  • Visualizations of six different attention heads are shown above. The BERT base model has 12 layers, and each layer has 12 attention heads. The top-left plot in figure 5 represents the 10th attention head in the 8th layer. And the patterns where objects are attending to their nouns are evident. Similarly, in the 11th attention head of the 8-th layer, noun modifiers (determiners, etc.) are attending to their nouns. Similarly, we can notice how attention heads in other plots are expressing linguistic knowledge.

    上面顯示了六個不同的關注頭的視覺化。 BERT基本模型有12層,每層有12個關注頭。 圖5的左上圖表示第8層中的第10個關注頭。 而且物件在其名詞上的出現方式也很明顯。 同樣,在第8層的第11注意頭中,名詞修飾語(確定符等)也伴隨著它們的名詞。 同樣,我們可以注意到其他情節中的關注頭如何表達語言知識。
Image for post
Table 1: Dependency relation classification accuracy by specific attention heads. Clark et al. [3]
表1:特定關注頭對依賴關係分類的準確性。 克拉克等。 [3]
  • It is really surprising to notice how attention heads perform as readily available probe classifiers.

    真是令人驚訝地注意到,關注頭作為隨時可用的探針分類器的效能如何。

A figure, shows for each dependency relationship, how a specific attention head achieves classification performance of predicting dependent token. For cases like determinant (det), direct object (dobj), possessive word (poss), passive auxiliary (auxpass), etc performance gain was huge compared to the baseline model (predicting a token at the best fixed offset).

一張圖顯示了每個依賴關係,特定的關注頭如何實現預測依賴標記的分類效能。 對於行列式(det),直接物件(dobj),所有格詞(poss),被動輔助(auxpass)等情況,與基線模型相比,效能增益是巨大的(預測最佳固定偏移量的令牌)。

(B.2) Attention Head Combinations

(B.2)注意頭組合

Image for post
Table 2: Performance of different baselines and probe techniques. UAS is an unlabelled attachment score for dependency head token prediction. Clark et al. [3]
表2:不同基準和探測技術的效能。 UAS是用於依賴性頭令牌預測的未標記附件分數。 克拉克等。 [3]
  • Probe classifiers trained on directly taking linear combinations of attention weights, and attention weights with non-contextual embeddings like GloVe, gave a comparable performance to relatively complex models depending on internal contextual representations for dependency parsing tasks.

    經過訓練的探針分類器直接採用注意力權重的線性組合以及具有非上下文嵌入的注意力權重(例如GloVe),根據依賴項解析任務的內部上下文表示,其效能可與相對複雜的模型相媲美。
  • Similarly, experiments on coreference, resolution tasks also suggested similar potential. That said, we can conclude that attention mechanisms in BERT also encode and express linguistic phenomena.

    同樣,關於共指,解決任務的實驗也表明了類似的潛力。 就是說,我們可以得出結論,BERT中的注意力機制也可以編碼和表達語言現象。

探索“探針” (Probing the “Probes”)

Now that we got introduced to representation based probes and attention weights based probes to discover the encoded linguistic knowledge using auxiliary tasks, it would be interesting to ask deeper questions:

既然我們已經介紹了基於表示的探針和基於注意力權重的探針,以使用輔助任務發現編碼的語言知識,那麼提出更深層次的問題將很有趣:

  • Are bigger models better to encode linguistic knowledge?

    更大的模型更好地編碼語言知識嗎?

  • How to check for the generalization ability of a model to encode linguistic knowledge?

    如何檢查模型對語言知識進行編碼的泛化能力?

  • Can we decode linguistic knowledge instead of relying on shallow probe classifier labels?

    我們可以解碼語言知識,而不是依賴淺層探針分類器標籤嗎?

  • What are the limitations of probes, and how to draw conclusions?

    探究的侷限性是什麼,如何得出結論?

  • Can we infuse linguistic knowledge?

    我們可以注入語言知識嗎?

  • Does encoded linguistic knowledge capture meaning?

    編碼語言知識是否捕獲含義?

  • Is encoded linguistic knowledge good enough for natural language understanding?

    編碼的語言知識是否足以使自然語言理解?

Lets elaborate further on the above questions in the next article “Analysing Encoded Linguistic Capabilities of NLP Models & Beyond”. (coming soon)

在下一篇文章“分析NLP模型的編碼語言能力及以後”中,讓我們進一步詳細說明上述問題。 (快來了)

  1. Devlin et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”. NAACL 2019.

    Devlin等。 “ BERT:為理解語言而對深度雙向變壓器進行的預訓練”。 NAACL 2019。
  2. Belinkov et al. “Analysis Methods in Neural Language Processing: A Survey”, ACL 2019

    Belinkov等。 《神經語言處理中的分析方法:調查》 ,ACL 2019

  3. Kevin Clark, Urvashi Khandelwal, Omer Levy, Christopher D. Manning, “What Does BERT Look At? An Analysis of BERT’s Attention”, 2019

    凱文·克拉克,瓦什Khandelwal,奧馬爾·利維,克里斯托弗·D·曼寧, “這是什麼BERT參考嗎? BERT的注意力分析” ,2019年

  4. Ian Tenney, Dipanjan Das, Ellie Pavlick, “BERT Rediscovers the Classical NLP Pipeline”, 2019

    Ian Tenney,Dipanjan Das,Ellie Pavlick, “ BERT Rediscovers the Classical NLP Pipeline” ,2019年

  5. Tenney et al. “WHAT DO YOU LEARN FROM CONTEXT? PROBING FOR SENTENCE STRUCTURE IN CONTEXTUALIZED WORD REPRESENTATIONS”, ICLR 2019

    Tenney等。 “您從上下文中學到了什麼? 語境化語言表達中的句子結構探究 ,ICLR 2019

  6. Adi et al. “FINE-GRAINED ANALYSIS OF SENTENCE EMBEDDINGS USING AUXILIARY PREDICTION TASKS”, ICLR 2017

    Adi等。 “使用輔助預測任務對句子嵌入進行細粒度分析” ,ICLR,2017年

  7. Stickland et al. “BERT and PALs: Projected Attention Layers for Efficient Adaptation in Multi-Task Learning”, ICML 2019

    Stickland等。 “ BERT和PAL:在多工學習中有效適應的預計注意力層” ,ICML 2019

  8. Zhou et al. “LIMIT-BERT : Linguistic Informed Multi-Task BERT”, 2019

    周等。 “ LIMIT-BERT:語言告知的多工BERT” ,2019年

翻譯自: https://towardsdatascience.com/encoded-linguistic-knowledge-in-nlp-models-b9558ba90943