Penkale, Sergio (2011) Incorporating translation quality-oriented features into log-linear models of machine translation. PhD thesis, Dublin City University.
Abstract
The current state-of-the-art approach to Machine Translation (MT) has limitations which could be alleviated by the use of syntax-based models. Although the benefits
of syntax use in MT are becoming clear with the ongoing improvements in string-to-tree and tree-to-string systems, tree-to-tree systems such as Data Oriented Translation (DOT) have, until recently, suffered from lack of training resources, and as a consequence are currently immature, lacking key features compared to Phrase-Based Statistical MT (PB-SMT) systems. In this thesis we propose avenues to bridge the gap between our syntax-based DOT model and state-of-the-art PB-SMT systems. Noting that both types of systems
score translations using probabilities not necessarily related to the quality of the translations they produce, we introduce a training mechanism which takes translation
quality into account by averaging the edit distance between a translation unit and translation units used in oracle translations. This training mechanism could in principle be adapted to a very broad class of MT systems. In particular, we show how when translating Spanish sentences into English, it leads to improvements in the translation quality of both PB-SMT and DOT. In addition, we show how our
method leads to a PB-SMT system which uses significantly less resources and translates significantly faster than the original, while maintaining the improvements in translation quality. We then address the issue of the limited feature set in DOT by defining a new DOT model which is able to exploit features of the complete source sentence. We
introduce a feature into this new model which conditions each target word to the source-context it is associated with, and we also make the first attempt at incorporating
a language model (LM) to a DOT system. We investigate different estimation methods for our lexical feature (namely Maximum Entropy and improved Kneser-Ney), reporting on their empirical performance. After describing methods which enable us to improve the efficiency of our system, and which allows us to scale to larger training data sizes, we evaluate the performance of our new model on English-to-Spanish translation, obtaining significant translation quality improvements compared to the original DOT system.
Metadata
Item Type: | Thesis (PhD) |
---|---|
Date of Award: | November 2011 |
Refereed: | No |
Supervisor(s): | Way, Andy |
Uncontrolled Keywords: | Data Oriented Translation; DOT; PhraseBased Statistical Translation; PB-SMT |
Subjects: | Computer Science > Computational linguistics Computer Science > Machine translating Computer Science > Machine learning |
DCU Faculties and Centres: | Research Institutes and Centres > Centre for Next Generation Localisation (CNGL) DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing |
Use License: | This item is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 3.0 License. View License |
Funders: | Science Foundation Ireland |
ID Code: | 16464 |
Deposited On: | 02 Dec 2011 12:02 by Andrew Way . Last Modified 19 Jul 2018 14:54 |
Documents
Full text available as:
Preview |
PDF
- Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
1MB |
Downloads
Downloads
Downloads per month over past year
Archive Staff Only: edit this record