深度學習：語言模型的評估標準

阿新 • • 發佈：2019-01-14

語言模型的評估主要measure the closeness，即生成語言和真實語言的近似度。

Classification accuracy

provides additional information about the power of a model, even if it is not being designed explicitly for text classification. [Jaech, et al. "Low-Rank RNN Adaptation for Context-Aware Language Modeling." arXiv 2017]

Perplexity

perplexity is the geometric mean of the inverse probability for each predicted word.

Test set perplexity is the most widely accepted method for evaluating language models, both for use in recognition/translation applications and generation. It has the advantage that it is easy to measure and is widely used as a criteria for model fit, but the limitation that it is not directly matched to most tasks that language models are directly used for.

For fair comparison, when computing the perplexity with the 5-gram LM, exclude all test words marked as 〈unk〉 (i.e., with low counts or OOVs) from consideration.

Meteor

[S. Banerjee and A. Lavie, “Meteor: An automatic metric for MT evaluation with improved correlation with human judgments,” in Proc. ACL Workshop Intrinsic Extrinsic Eval. Measures Mach. Transl. Summarization]

ROUGE

is a recall-oriented measure widely used in the summarization literature. It measures the n-gram recall between the candidate text and the reference text(s).

where count match denotes the number of n-grams co-occurring in the input and output.

一般ROUGE-1, 2 and W (based on weighted longest common subsequence).

[C.-Y. Lin, “Rouge: A package for automatic evaluation of summaries,” in Proc. ACL Workshop Text Summarization Branches Out, 2004]

Blue

a form of precision of word n-grams between generated and reference sentences.Purely measuring recall will inappropriately reward long outputs. BLEU is designed to address such an issue by emphasizing precision.

n-gram precision scores are given by:

BLEU then combines the average logarithm of precision scores with exceeded length penalization.

most previous work report BLEU-1, i.e., they only compute precision at the unigram level, whereas BLEU-n is a geometric average of precision over 1- to n-grams.

[K. Papineni, S. Roukos, T. Ward, and W. J. Zhu, “BLEU: A method for automatic evaluation of machine translation,” ACL2002]

某小皮

Coherence Evaluation

Neither BLEU nor ROUGE attempts to evaluate true coherence. There is no generally accepted and readily available coherence evaluation metric. simple approximations like number of overlapped tokens or topic distribution similarity (e.g., (Yan et al., 2011b; Yan et al., 2011a; Celikyilmaz and Hakkani-Tür, 2011)). [Li, Jiwei, et al. "A hierarchical neural autoencoder for paragraphs and documents." ACL2015]

Cider

it measures consistency between n-gram occurrences in generated and reference sentences, where this consistency is weighted by n-gram saliency and rarity.

不同評估方法的缺點討論亦可參考[Vedantam, R., Lawrence Zitnick, C., & Parikh, D. Cider: Consensus-based image description evaluation. CVPR2015]

人工評估

ask for raters to give a subjective score

使用Amazon Mechanical Turk

如imge caption中following the guidelines proposed in [M. Hodosh, P. Young, and J. Hockenmaier, “Framing image description as a ranking task: Data, models and evaluation metrics,” J. Artif. Intell. Res.2013]

或者[Jaech, et al "Low-Rank RNN Adaptation for Context-Aware Language Modeling." arXiv2017]

ref: [Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. Show and tell: Lessons learned from the 2015 mscoco image captioning challenge. TPAMI2017]

[Hoang, et al "Incorporating Side Information into Recurrent Neural Network Language Models." NAACL2016]

深度學習：語言模型的評估標準

Classification accuracy

Perplexity

Meteor

ROUGE

Blue

Coherence Evaluation

Cider

人工評估

深度學習：語言模型的評估標準

深度學習-自然語言模型隨記

深度學習：transformer模型

【深度學習】深度學習分類與模型評估

深度學習：Seq2seq模型

深度學習常用的模型評估指標

21個專案玩轉深度學習：基於TensorFlow的實踐詳解03—打造自己的影象識別模型

深度學習目標檢測模型全面綜述：Faster R-CNN、R-FCN和SSD

【自然語言處理】預測電影影評情感的深度學習詞袋模型

NLP學習記錄：語言模型

機器學習-分類模型評估標準

CCAI 2017 | 香港科技大學計算機系主任楊強：論深度學習的遷移模型

機器學習模型評估標準選擇循環圖

spaCy 學習第二篇：語言模型

快速上手深度學習掌握TensorFlow模型構建與開發

《深度學習：原理與應用實踐》中文版PDF

使用函式 initializer 介面優化深度學習場景下模型載入的冷啟動延時

深度學習：YOLO系列

深度學習：keras學習

深度學習：zero-shot-learning(四)_DAEZSL_2017

深度學習：語言模型的評估標準

Classification accuracy

Perplexity

Meteor

ROUGE

Blue

Coherence Evaluation

Cider

人工評估

相關推薦