Lattice score based data cleaning for phrase-based statistical machine translation
Jiang, Jie, Way, AndyORCID: 0000-0001-5736-5930 and Carson-Berndsen, Julie
(2010)
Lattice score based data cleaning for phrase-based statistical machine translation.
In: EAMT 2010 - 14th Annual Conference of the European Association for Machine Translation, 27-28 May 2010, Saint-Raphaël, France.
Statistical machine translation relies heavily
on parallel corpora to train its models
for translation tasks. While more and
more bilingual corpora are readily available,
the quality of the sentence pairs
should be taken into consideration. This
paper presents a novel lattice score-based
data cleaning method to select proper sentence
pairs from the ones extracted from a
bilingual corpus by the sentence alignment
methods. The proposed method is carried
out as follows: firstly, an initial phrasebased
model is trained on the full sentencealigned
corpus; then for each of the sentence
pairs in the corpus, word alignments
are used to create anchor pairs and sourceside
lattices; thirdly, based on the translation
model, target-side phrase networks
are expanded on the lattices and Viterbi
searching is used to find approximated decoding
results; finally, BLEU score thresholds
are used to filter out the low-score
sentence pairs for the data cleaning purpose.
Our experiments on the FBIS corpus
showed improvements of BLEU score
from 23.78 to 24.02 in Chinese-English.