natural language processing blog: machine translation
阿新 • • 發佈:2018-12-29
Happy new year, all... Apologies for being remiss about posting recently. (Can I say "apologies about my remission"?) This post is a bit more of a review of what's out there, rather than trying out new ideas. Citations are not meant to get anywhere close to full coverage.There are a handful of ways of viewing the syntactic MT problem, essentially delineated by which side(s) of the translation does the tree appear on, and whether the tree is induced entirely from data or whether it's induced from a treebank. There's additionally the constituent- versus dependency-tree issue.
- String-to-tree (constituent). Here, we have a string on the input and a tree on the output. In Chinese-to-English translation, this would mean that at translation time, we're essentially trying to parse Chinese into an English tree by means of local reordering. This was essentially what Yamada and Knight were doing back in 2001 and 2002. In order to train such a model, we need (a) parallel data and (b) a parser on the target language. Translation is typically done by some variant of CKY.
- Tree-to-string (constituent). Here, we map an input tree to an output string. In C2E, this means that at translation time, we first parse the Chinese, then reorder and flatten it out into an English string. This is essentially what the JHU workshop on adding syntactic features to phrase-based translation was trying to do (though in somewhat of a weak way), and also what Liang Huang has been doing lately. In order to train, we need (a) parallel data and (b) a parser on the source side. Translation can be done any way you want, though once parsing is done, it's usually a lot easier than running CKY.
- Tree-to-tree (constituent). Here, we map an input tree to an output tree. In C2E, this means that at translation time, we first parse the Chinese, then reorder the tree and translate subtrees/leaves into English. Koehn and Collins have worked on this problem. In order to train, we need (a) parallel data, (b) a parser on the source side and (c) a parser on the target side. Translation is similar to Tree-to-string.
- String-to-tree-to-string (constituent). I'm not sure if there's agreed-upon terminology for this task, but the idea here is to translate without having a parser on either side. At translation time, we essentially parse and translate simultaneously (rather like string-to-tree). But at training time, we don't have access to source trees: we have to induce them from just (a) parallel data. This is typified by Wu's inverse transduction grammars and more recently by David Chiang's Hiero system (plus others).
- Any of the above but on dependency trees. To my knowledge, the only groups that are seriously working on this are Microsoft (see, eg., Chris Quirk) and our friends at Charles University.
S2T | T2S | T2T | S2T2S | |
resource | med | med | bad | good |
train-ease | good | good | good | bad |
test-ease | bad | good | good | bad |
ngram-lm | med | good | med | good |
syntax-lm | good | bad | good | bad |
assumptions | good | good | bad | good |
extensibility | med | good | good | bad |