Browse DORAS
Browse Theses
Latest Additions
Creative Commons License
Except where otherwise noted, content on this site is licensed for use under a:

Incorporating translation quality-oriented features into log-linear models of machine translation

Penkale, Sergio (2011) Incorporating translation quality-oriented features into log-linear models of machine translation. PhD thesis, Dublin City University.

Full text available as:

PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader


The current state-of-the-art approach to Machine Translation (MT) has limitations which could be alleviated by the use of syntax-based models. Although the benefits of syntax use in MT are becoming clear with the ongoing improvements in string-to-tree and tree-to-string systems, tree-to-tree systems such as Data Oriented Translation (DOT) have, until recently, suffered from lack of training resources, and as a consequence are currently immature, lacking key features compared to Phrase-Based Statistical MT (PB-SMT) systems. In this thesis we propose avenues to bridge the gap between our syntax-based DOT model and state-of-the-art PB-SMT systems. Noting that both types of systems score translations using probabilities not necessarily related to the quality of the translations they produce, we introduce a training mechanism which takes translation quality into account by averaging the edit distance between a translation unit and translation units used in oracle translations. This training mechanism could in principle be adapted to a very broad class of MT systems. In particular, we show how when translating Spanish sentences into English, it leads to improvements in the translation quality of both PB-SMT and DOT. In addition, we show how our method leads to a PB-SMT system which uses significantly less resources and translates significantly faster than the original, while maintaining the improvements in translation quality. We then address the issue of the limited feature set in DOT by defining a new DOT model which is able to exploit features of the complete source sentence. We introduce a feature into this new model which conditions each target word to the source-context it is associated with, and we also make the first attempt at incorporating a language model (LM) to a DOT system. We investigate different estimation methods for our lexical feature (namely Maximum Entropy and improved Kneser-Ney), reporting on their empirical performance. After describing methods which enable us to improve the efficiency of our system, and which allows us to scale to larger training data sizes, we evaluate the performance of our new model on English-to-Spanish translation, obtaining significant translation quality improvements compared to the original DOT system.

Item Type:Thesis (PhD)
Date of Award:November 2011
Supervisor(s):Way, Andy
Uncontrolled Keywords:Data Oriented Translation; DOT; PhraseBased Statistical Translation; PB-SMT
Subjects:Computer Science > Computational linguistics
Computer Science > Machine translating
Computer Science > Machine learning
DCU Faculties and Centres:Research Initiatives and Centres > Centre for Next Generation Localisation (CNGL)
DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing
Use License:This item is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 3.0 License. View License
Funders:Science Foundation Ireland
ID Code:16464
Deposited On:02 Dec 2011 12:02 by Andrew Way. Last Modified 02 Dec 2011 12:02

Download statistics

Archive Staff Only: edit this record