Srivastava, Ankit Kumar (2014) Phrase extraction and rescoring in statistical machine translation. PhD thesis, Dublin City University.
Abstract
The lack of linguistically motivated translation units or phrase pairs in Phrase-based Statistical Machine Translation (PB-SMT) systems is a well-known source of error. One approach to minimise such errors is to supplement the standard PB-SMT models with phrase pairs extracted from parallel treebanks (linguistically annotated and aligned corpora). In this thesis, we extend the treebank-based phrase extraction framework with percolated dependencies – a hitherto unutilised knowledge source – and evaluate its usability through more than a dozen syntax-aware phrase extraction models.
However, the improvement in system performance is neither consistent nor conclusive despite the proven advantages of linguistically motivated phrase pairs. This leads us to hypothesize that the PB-SMT pipeline is flawed as it often fails to access perfectly good phrase-pairs while searching for the highest scoring translation (decoding). A model error occurs when the highest-probability translation (actual output of a PB-SMT system) according to a statistical machine translation model is not the most accurate translation it can produce. In the second part of this thesis, we identify and attempt to trace these model errors across state-of-the-art PB-SMT decoders by locating the position of oracle translations (the translation most similar to a reference translation or expected output of a PB-SMT system) in the n-best lists generated by a PB-SMT decoder. We analyse the impact of individual decoding features on the quality of translation output and introduce two rescoring algorithms to minimise the lower ranking of oracles in the n-best lists. Finally, we extend our oracle-based rescoring approach to a reranking framework by rescoring the n-best lists with additional reranking features. We observe limited but optimistic success and conclude by speculating on how our oracle-based rescoring of n-best lists can help the PB-SMT system (supplemented with multiple treebank-based phrase extractions) get optimal performance out of linguistically motivated phrase pairs.
Metadata
Item Type: | Thesis (PhD) |
---|---|
Date of Award: | November 2014 |
Refereed: | No |
Supervisor(s): | Way, Andy |
Uncontrolled Keywords: | Phrase-based Statistical Machine Translation (PB-SMT) systems; Treebank-based phrase extraction framework |
Subjects: | Computer Science > Machine translating Computer Science > Computational linguistics Computer Science > Machine learning |
DCU Faculties and Centres: | Research Institutes and Centres > Centre for Next Generation Localisation (CNGL) DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing |
Use License: | This item is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 3.0 License. View License |
Funders: | Science Foundation Ireland |
ID Code: | 19971 |
Deposited On: | 04 Dec 2014 11:30 by Andrew Way . Last Modified 19 Jul 2018 15:03 |
Documents
Full text available as:
Preview |
PDF
- Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
2MB |
Downloads
Downloads
Downloads per month over past year
Archive Staff Only: edit this record