Hybrid data-driven models of machine translation
Groves, Declan (2007) Hybrid data-driven models of machine translation. PhD thesis, Dublin City University.
Full text available as:
Corpus-based approaches to Machine Translation (MT) dominate the MT research field today, with Example-Based MT (EBMT) and Statistical MT (SMT) representing two different frameworks within the data-driven paradigm. EBMT has always made use of both phrasal and lexical correspondences to produce high-quality translations. Early SMT models, on the other hand, were based on word-level correpsondences, but with the advent of more sophisticated phrase-based approaches, the line between EBMT and SMT has become increasingly blurred.
In this thesis we carry out a number of translation experiments comparing the performance of the state-of-the-art marker-based EBMT system of Gough and Way (2004a, 2004b), Way and Gough (2005) and Gough (2005) against a phrase-based SMT (PBSMT) system built using the state-of-the-art PHARAOphHra se-based decoder (Koehn, 2004a) and employing standard phrasal extraction in euristics (Koehn et al., 2003). In additin e describe experiments investigating the possibility of combining elements of EBMT and SMT in order to create a hybrid data-driven model of MT capable of outperforming either approach from which it is derived.
Making use of training and testlng data taken from a French-Enghsh translation memory of Sun Microsystems computer documentation, we find that while better results are seen when the PBSMT system is seeded with GIZA++ word- and phrasebased data compared to EBMT marker-based sub-sentential alignments, in general improvements are obtained when combinations of this 'hybrid' data are used to construct the translation and probability models. While for the most part the baseline marker-based EBMT system outperforms any flavour of the PBSbIT systems constructed in these experiments, combining the data sets automatically induced by both GIZA++ and the EBMT system leads to a hybrid system which improves on the EBMT system per se for French-English.
On a different data set, taken from the Europarl corpus (Koehn, 2005), we perform a number of experiments maklng use of incremental training data sizes of 78K, 156K and 322K sentence pairs. On this data set, we show that similar gains are to be had from constructing a hybrid 'statistical EBMT' system capable of outperforming the baseline EBMT system. This time around, although all 'hybrid' variants of the EBMT system fall short of the quality achieved by the baseline PBSMT system, merging elements of the marker-based and SMT data, as in the Sun Mzcrosystems experiments, to create a hybrid 'example-based SMT' system, outperforms the baseline SMT and EBMT systems from which it is derlved. Furthermore, we provide further evidence in favour of hybrid data-dr~ven approaches by adding an SMT target language model to all EBMT system variants and demonstrate that this too has a positive effect on translation quality.
Following on from these findings we present a new hybrid data-driven MT architecture, together with a novel marker-based decoder which improves upon the performance of the marker-based EBMT system of Gough and Way (2004a, 2004b), Way and Gough (2005) and Gough (2005), and compares favourably with the stateof-the-art PHARAOH SMHT decoder (Koehn, 2004a).
Archive Staff Only: edit this record