Domain adaptation for statistical machine translation and neural machine translation

Zhang, Jian

Zhang, Jian (2017) Domain adaptation for statistical machine translation and neural machine translation. PhD thesis, Dublin City University.

Abstract
Metadata
Downloads
Documents

[+][-]

Abstract

Both Statistical Machine Translation and Neural Machine Translation (NMT) are data-dependent learning approaches to Machine Translation (MT). The prerequisite is a large volume of training data in order to generate good statistical models. However, even if large volume of training corpora are available for MT, ﬁnding training data which are similar to the speciﬁc domains is still difﬁcult. The MT models trained using the limited speciﬁc domain data cannot have sufﬁcient coverage on the linguistic phenomena in that domain, which makes this a very challenging task. Because word meanings, genres or topics differ between domains, using the additional data from other domains can increase the dissimilar- ities between the training and testing data, and result in reduced translation quality. Such a challenge is deﬁned as ‘domain adaptation’ challenge in the literature. In this thesis, we investigate domain adaptation in two different scenarios, namely a domain-awareness scenario and a domain-unawareness scenario. In a domain-awareness scenario, the domain information is given explicitly in the training data. We are interested in developing domain-adaptation techniques which transfer knowledge gained from the other domains to a desired domain. In the approach proposed here probabilistic values indicating the domain-likeness features for words are estimated by the context rather than by the words themselves. We then apply those features to the combined translation models in an MT system. We empirically show that translation quality can be signiﬁcantly improved compared with previous related work. We then turn our interest to the recently proposed neural network training. We describe a domain-adaptation approach which can exploit large pre-trained word vector models. We evaluate our approach on both language modelling and machine translation tasks to demonstrate its efﬁciency, effectiveness and ﬂexibility in a domain-awareness scenario. xiiIn a domain-unawareness scenario, the domain information is not given explicitly in the training data. The training data is heterogeneous, e.g. originated from tens or even hundreds of different resources without well-deﬁned domain labels. We overcome such a challenge by deriving the topic information from the training corpora using well-estimated topic modelling algorithms. In this scenario, we pay particular attention to the most recent NMT framework. We are concerning with making a better lexical choice and improving the overall translation quality. Experimentally, we show that our model can perform better lexical choice, improve the overall translation quality and reduce the number of unknown words.

Metadata

Item Type:	Thesis (PhD)
Date of Award:	November 2017
Refereed:	No
Supervisor(s):	Liu, Qun and Way, Andy
Subjects:	Computer Science > Computational linguistics Computer Science > Machine translating Computer Science > Machine learning Computer Science > Artificial intelligence
DCU Faculties and Centres:	DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing Research Institutes and Centres > ADAPT
Use License:	This item is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 3.0 License. View License
Funders:	Science Foundation Ireland
ID Code:	21949
Deposited On:	10 Nov 2017 14:23 by Qun Liu . Last Modified 27 Apr 2023 11:28

Documents

Full text available as:

Preview

PDF (Jian Zhang's PhD Thesis) - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
1MB

Downloads

Downloads per month over past year

Archive Staff Only: edit this record

DORAS | DCU Research Repository

Domain adaptation for statistical machine translation and neural machine translation

Downloads