Haque, Rejwanul ORCID: 0000-0003-1680-0099 (2011) Integrating source-language context into log-linear models of statistical machine translation. PhD thesis, Dublin City University.
Abstract
The translation features typically used in state-of-the-art statistical machine translation (SMT) model dependencies between the source and target phrases, but not among the phrases in the source language themselves. A swathe of research has demonstrated that integrating source context modelling directly into log-linear phrase-based SMT (PB-SMT) and hierarchical PB-SMT (HPB-SMT), and can positively
influence the weighting and selection of target phrases, and thus improve translation quality. In this thesis we present novel approaches to incorporate source-language contextual modelling into the state-of-the-art SMT models in order to enhance the quality of lexical selection. We investigate the effectiveness of use of a range of contextual features, including lexical features of neighbouring words, part-of-speech tags, supertags, sentence-similarity features, dependency information, and semantic roles. We explored a series of language pairs featuring typologically different languages, and examined the scalability of our research to larger amounts of training data.
While our results are mixed across feature selections, language pairs, and learning curves, we observe that including contextual features of the source sentence
in general produces improvements. The most significant improvements involve the integration of long-distance contextual features, such as dependency relations in
combination with part-of-speech tags in Dutch-to-English subtitle translation, the combination of dependency parse and semantic role information in English-to-Dutch parliamentary debate translation, supertag features in English-to-Chinese translation, or combination of supertag and lexical features in English-to-Dutch subtitle
translation. Furthermore, we investigate the applicability of our lexical contextual model in another closely related NLP problem, namely machine transliteration.
Metadata
Item Type: | Thesis (PhD) |
---|---|
Date of Award: | November 2011 |
Refereed: | No |
Supervisor(s): | Way, Andy |
Uncontrolled Keywords: | source-language; context |
Subjects: | Computer Science > Machine translating Computer Science > Computational linguistics Computer Science > Machine learning |
DCU Faculties and Centres: | Research Institutes and Centres > Centre for Next Generation Localisation (CNGL) DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing |
Use License: | This item is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 3.0 License. View License |
Funders: | Science Foundation Ireland |
ID Code: | 16458 |
Deposited On: | 02 Dec 2011 11:32 by Andrew Way . Last Modified 13 Aug 2020 16:03 |
Documents
Full text available as:
Preview |
PDF
- Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
1MB |
Downloads
Downloads
Downloads per month over past year
Archive Staff Only: edit this record