1. 程式人生 > >natural language processing blog: machine translation

natural language processing blog: machine translation

Happy new year, all... Apologies for being remiss about posting recently. (Can I say "apologies about my remission"?) This post is a bit more of a review of what's out there, rather than trying out new ideas. Citations are not meant to get anywhere close to full coverage.There are a handful of ways of viewing the syntactic MT problem, essentially delineated by which side(s) of the translation does the tree appear on, and whether the tree is induced entirely from data or whether it's induced from a treebank. There's additionally the constituent- versus dependency-tree issue.

  1. String-to-tree (constituent). Here, we have a string on the input and a tree on the output. In Chinese-to-English translation, this would mean that at translation time, we're essentially trying to parse Chinese into an English tree by means of local reordering. This was essentially what Yamada and Knight were doing back in 2001 and 2002. In order to train such a model, we need (a) parallel data and (b) a parser on the target language. Translation is typically done by some variant of CKY.
  2. Tree-to-string (constituent). Here, we map an input tree to an output string. In C2E, this means that at translation time, we first parse the Chinese, then reorder and flatten it out into an English string. This is essentially what the JHU workshop on adding syntactic features to phrase-based translation was trying to do (though in somewhat of a weak way), and also what Liang Huang has been doing lately. In order to train, we need (a) parallel data and (b) a parser on the source side. Translation can be done any way you want, though once parsing is done, it's usually a lot easier than running CKY.
  3. Tree-to-tree (constituent). Here, we map an input tree to an output tree. In C2E, this means that at translation time, we first parse the Chinese, then reorder the tree and translate subtrees/leaves into English. Koehn and Collins have worked on this problem. In order to train, we need (a) parallel data, (b) a parser on the source side and (c) a parser on the target side. Translation is similar to Tree-to-string.
  4. String-to-tree-to-string (constituent). I'm not sure if there's agreed-upon terminology for this task, but the idea here is to translate without having a parser on either side. At translation time, we essentially parse and translate simultaneously (rather like string-to-tree). But at training time, we don't have access to source trees: we have to induce them from just (a) parallel data. This is typified by Wu's inverse transduction grammars and more recently by David Chiang's Hiero system (plus others).
  5. Any of the above but on dependency trees. To my knowledge, the only groups that are seriously working on this are Microsoft (see, eg., Chris Quirk) and our friends at Charles University.
Below is my attempt to categorize the pros and cons of these different approaches along a wide variety of metrics. The metrics are: resource usage (how much labeled data do we need), training ease (computationally), testing ease (computationally), ngram LM (is it easy to include a target ngram language model), syntactic LM (is it easy to include a target syntactic language model), minimal assumptions (does the model make strong assumptions about, for instance, the degree of matchy-ness between source and target?), extensibility (is it easy to add additional features to the input side, such as NE info without breaking things).
S2TT2ST2TS2T2S
resource med med bad good
train-ease good good good bad
test-ease bad good good bad
ngram-lm med good med good
syntax-lm good bad good bad
assumptions good good bad good
extensibility med good good bad
I didn't include dependencies on this list because it essentially depends on what the underlying model is, and then remains the same as in the constituent case. The only caveat here is that syntactic language modeling with dependency trees is not as well proven as with constituency trees.As far as syntactic language modeling goes, pretty much this is easy if you are producing trees on the target; otherwise, not. Tree-to-tree gets the unique distinction of being bad with respect to assumptions, essentially because you have to do tree-matching which is hard to do well. You could argue that certain forms of s2t2s should be here too (Wellington, Turian and Melamed had a paper on the assumptions that ITG makes at ACL 2006).In terms of resource usage, T2T is obviously the worst, S2T2S is obviously the best. Whether S2T or T2s is better depends on the language pair.There's of course the issue of which work best. For this, I don't really know (I'm sure people involved in MTeval and GALE can chime in here, though of course these competitions are always "unfair" in the sense that they also measure engineering ability).At this point, I'd like to relate this back to the recent discussion of Translation out of English. One thing to notice is that when translating into English, S2T has the advantage that all we need is an English parser (there are a few of those lying around if you search hard enough) whereas T2S requires a foreign parser. However, when translating from English, T2S starts looking a lot more attractive because we have good English parsers. Actually, we have a lot of good annotations for English (NE coref, for instance; discourse parsing to some degree; etc.). From this perspective, the relative ease of extensibility of T2S models begins to look pretty attractive. That is to say, if I were actually to work myself on translation out of English, I would probably seriously consider T2S models. Of course, if you're translating from X to Y, where neither X nor Y is English, then you're kind of hosed and might prefer S2T2S.