1. 程式人生 > 實用技巧 >xlm跨語言模型

xlm跨語言模型

Models like BERT (Devlin et. al.) or GPT (Radford et. al.) have achieved the state of the art in language understanding. However, these models are pre-trained only on one language. Recently, efforts have been made towards mitigating monolingual representations and building universal cross-lingual models that would be capable of encoding any sentence into a shared embedding space.

像BERT( Devlin等人 )或GPT( Radford等人 )這樣的模型已經在語言理解方面達到了最先進的水平。 但是,這些模型僅在一種語言上進行了預訓練。 近來,已經做出努力以減輕單語言表示並建立通用的跨語言模型,該模型能夠將任何句子編碼到共享的嵌入空間中。

In this article, we will be discussing the paper, Cross-lingual Language Model Pretraining, proposed by Facebook AI. The authors propose 2 approaches for cross-lingual language modeling:

在本文中,我們將討論Facebook AI提出的論文《 跨語言模型預訓練》 。 作者提出了兩種用於跨語言建模的方法:

  1. Unsupervised, relies on monolingual data

    無監督,依靠單語資料
  2. Supervised, relies on parallel data.

    受監督,依賴於並行資料。

跨語言語言模型(XLM) (Cross-lingual Language Model (XLM))

In this section, we will discuss the approaches proposed for training the XLM.

在本節中,我們將討論為訓練XLM提出的方法。

共享子詞詞彙 (Shared Sub-Word Vocabulary)

The model uses the same shared vocabulary for all the languages. This helps in establishing a common embedding space for tokens from all languages. Hence, it is evident that languages that have the same script (alphabets), or similar words map better to this common embedding space.

該模型對所有語言使用相同的共享詞彙 。 這有助於為來自所有語言的令牌建立通用的嵌入空間。 因此,很明顯,具有相同指令碼(字母)或類似單詞的語言可以更好地對映到此公共嵌入空間。

For tokenizing the corpora, Byte-Pair Encoding (BPE) is used.

為了標記語料庫,使用了位元組對編碼(BPE)。

因果語言建模(CLM) (Causal Language Modeling (CLM))

This is the regular Language Modeling objective where we maximize the probability of a token x_t to appear at the ‘t’th position in a given sequence given all the tokens x_<t (all the tokens preceding the ‘t’th token) in that sequence. i.e.

這是常規語言建模的目標,在該目標中,在給定所有標記x_ <t (所有在' t '標記之前的所有標記)的情況下,我們最大化標記x_t在給定序列中出現在第' t '位置的概率序列。 即

Image for post
XLNet Paper XLNet論文進行因果語言建模

OpenAI’s GPT and GPT-2 are trained on this objective. You can refer to my articles on GPT and GPT-2 if you’re interested in the details of this objective.

OpenAI的GPT和GPT-2就此目標進行了培訓。 如果您對此目標的細節感興趣,可以參考我在GPTGPT-2上的文章。

遮蔽語言建模(MLM) (Masked Language Modeling (MLM))

Image for post
XLM Paper XLM紙進行傳銷

This is a type of the Denoising Autoencoding objective, also known as the Cloze task. Here, we maximize the probability of a given masked token x_t to appear at the ‘t’th position in a given sequence given all the tokens in that sequence, x_hat. i.e.

這是“降噪自動編碼”目標的一種,也稱為“結束任務”。 在這裡,我們給定被遮蔽令牌x_t出現在給定序列x_hat中所有令牌的給定序列中第t個位置的概率 。 即

Image for post
Masked Language Modeling via XLNet Paper
通過 XLNet紙進行 掩蓋語言建模

BERT and RoBERTa are trained on this objective. You can refer to my articles on BERT and RoBERTa if you’re interested in the details of this objective.

BERT和RoBERTa就此目標進行了培訓。 如果您對該目標的細節感興趣,可以參考我在BERTRoBERTa上的文章。

Note that the only difference between BERT’s and XLM’s approach is that BERT uses pairs of sentences whereas XLM uses streams of an arbitrary number of sentences and truncate once the length is 256.

請注意,BERT和XLM的方法之間的唯一區別是BERT使用成對的句子,而XLM使用任意數量的句子的流並在長度為256時截斷。

翻譯語言建模(TLM) (Translation Language Modeling (TLM))

Image for post
XLM Paper XLM Paper進行 TLM

The CLM and MLM tasks work well on monolingual corpora, however, they do not take advantage of the available parallel translation data. Hence, the authors propose a Translation Language Modeling objective wherein we take a sequence of parallel sentences from the translation data and randomly mask tokens from the source as well as from the target sentence. For example, in the figure above, we have masked words from English as well as from the French sentence. All the words in the sequence contribute to the prediction of a given masked word, hence establishing a cross-lingual mapping among the tokens.

CLM和MLM任務在單語語料庫上可以很好地工作,但是它們沒有利用可用的並行翻譯資料。 因此,作者提出了一種翻譯語言建模目標,其中我們從翻譯資料中提取了一系列平行句子,並從源和目標句子中隨機遮蔽了標記 。 例如,在上圖中,我們遮蔽了英語和法語句子中的單詞。 序列中的所有單詞都有助於預測給定的遮蔽單詞 ,從而在標記之間建立跨語言對映。

XLM (XLM)

In this work, we consider cross-lingual language model pretraining with either CLM, MLM, or MLM used in combination with TLM.

在這項工作中,我們考慮使用CLM,MLM或與TLM結合使用的MLM進行跨語言語言模型預訓練。

XLM Paper

XLM紙

XLM預培訓 (XLM Pre-training)

In this section, we’ll discuss how XLM Pre-training is leveraged for downstream tasks like:

在本節中,我們將討論如何將XLM預培訓用於下游任務,例如:

  1. Zero-shot cross-lingual classification

    零鏡頭跨語言分類
  2. Supervised and unsupervised neural machine translation

    有監督和無監督的神經機器翻譯
  3. Language models for low-resource languages

    資源匱乏的語言的語言模型
  4. Unsupervised cross-lingual word embeddings

    無監督的跨語言詞嵌入

零鏡頭跨語言分類 (Zero-shot Cross-lingual Classification)

Just like in any other Transformer-based monolingual model, XLM too, is fine-tuned on the XNLI dataset for obtaining the cross-lingual classification.

就像在任何其他基於Transformer的單語言模型中一樣,XLM也在XNLI資料集上進行了微調,以獲取跨語言分類。

A classification layer is added on top of XLM and it is trained on the English NLI training dataset. Then the model is evaluated on 15 XNLI languages.

在XLM之上添加了一個分類層,並在英語NLI訓練資料集中對其進行了訓練。 然後,使用15種XNLI語言對模型進行評估。

Since the model hasn’t been tuned to classify sentences from any of these languages, it is a zero-shot learning example.

由於尚未模型進行調整以對來自這些語言中的任何一種的句子進行分類 ,因此這是零學習示例。

無人監督NMT (Unsupervised NMT)

For this task, the authors propose pre-training a complete encoder-decoder architecture with a cross-lingual language modeling objective. The model is evaluated on several translation benchmarks including WMT’14 English-French, WMT’16 English-German, and WMT’16 English-Romanian.

為此,作者提出了使用跨語言建模目標對完整的編解碼器架構進行預訓練的建議。 該模型在多個翻譯基準上進行了評估,包括WMT'14英語-法語,WMT'16英語-德語和WMT'16英語-羅馬尼亞語。

受監管的NMT (Supervised NMT)

Here, the encoder and decoder are loaded with pre-trained weights from XLM and then fine-tuned over the supervised translation dataset. This essentially achieves multi-lingual language translation.

在這裡, 編碼器和解碼器會載入來自XLM的預訓練權重 ,然後在監督的翻譯資料集中進行微調。 這實質上實現了多語言翻譯。

For more on multi-lingual NMT, refer to this blog.

有關多語言NMT的更多資訊, 請參閱此部落格

低資源語言建模 (Low-resource language modeling)

Here’s where “languages with the same script or similar words provide better mapping” comes into the picture. For example, there are 100k sentences written in Nepali on Wikipedia and about 6 times more in Hindi. Moreover, these languages have 80% of tokens in common.

這就是“具有相同指令碼或相似單詞的語言提供更好的對映”的地方。 例如,在Wikipedia上用尼泊爾語寫的句子有10萬個,在印地語中寫的句子大約是6萬個。 而且,這些語言有80%的共同標記。

Hence, a cross-lingual language model will be evidently beneficial for a language model in Nepali as it is trained on relatively more data of similar correspondence.

因此,跨語言模型對於尼泊爾語中的語言模型顯然是有利的,因為它是在相對較多的相似對應資料上進行訓練的。

無監督的跨語言詞嵌入 (Unsupervised Cross-lingual Word Embeddings)

Finally, since we have a shared vocabulary, the lookup table (or embedding matrix) of the XLM model gives us the cross-lingual word embeddings.

最後,由於我們有一個共享的詞彙 ,因此XLM模型的查詢表 (或嵌入矩陣)為我們提供了跨語言的詞嵌入。

結論 (Conclusion)

In this article, we discussed how a cross-lingual language model is beneficial not only for obtaining better results in generic downstream tasks but also for the fact that it improves the quality of the model for low-resource languages by training on similar high-resource languages, hence getting exposure to more relevant data.

在本文中,我們討論了跨語言模型如何不僅有益於在通用下游任務中獲得更好的結果,而且還因為它通過在相似的高資源資源上進行培訓而提高了低資源語言模型的質量。語言,因此可以接觸更多相關資料。

Here is a link to the original XLM GitHub repository.

這是原始XLM GitHub儲存庫的連結

Here is a link to huggingface’s XLM architecture implementation and pre-trained weights.

這是擁抱面部的XLM體系結構實現和預訓練權重的連結。

翻譯自: https://towardsdatascience.com/xlm-cross-lingual-language-model-33c1fd1adf82