語言模型的評估主要measure the closeness,即生成語言和真實語言的近似度。
Classification accuracy
provides additional information about the power of a model, even if it is not being designed explicitly for text classification. [Jaech, et al. "Low-Rank RNN Adaptation for Context-Aware Language Modeling." arXiv 2017]Perplexity
perplexity is the geometric mean of the inverse probability for each predicted word.
For fair comparison, when computing the perplexity with the 5-gram LM, exclude all test words marked as 〈unk〉 (i.e., with low counts or OOVs) from consideration.
is a recall-oriented measure widely used in the summarization literature. It measures the n-gram recall between the candidate text and the reference text(s).
where count match denotes the number of n-grams co-occurring in the input and output.
一般ROUGE-1, 2 and W (based on weighted longest common subsequence).
a form of precision of word n-grams between generated and reference sentences.Purely measuring recall will inappropriately reward long outputs. BLEU is designed to address such an issue by emphasizing precision.
n-gram precision scores are given by:
BLEU then combines the average logarithm of precision scores with exceeded length penalization.
most previous work report BLEU-1, i.e., they only compute precision at the unigram level, whereas BLEU-n is a geometric average of precision over 1- to n-grams.
Coherence Evaluation
Neither BLEU nor ROUGE attempts to evaluate true coherence. There is no generally accepted and readily available coherence evaluation metric. simple approximations like number of overlapped tokens or topic distribution similarity (e.g., (Yan et al., 2011b; Yan et al., 2011a; Celikyilmaz and Hakkani-Tür, 2011)). [Li, Jiwei, et al. "A hierarchical neural autoencoder for paragraphs and documents." ACL2015] Cider
it measures consistency between n-gram occurrences in generated and reference sentences, where this consistency is weighted by n-gram saliency and rarity.
ask for raters to give a subjective score
使用Amazon Mechanical Turk
