Jiang, Jie, Way, Andy ORCID: 0000-0001-5736-5930 and Carson-Berndsen, Julie (2010) Lattice score based data cleaning for phrase-based statistical machine translation. In: EAMT 2010 - 14th Annual Conference of the European Association for Machine Translation, 27-28 May 2010, Saint-Raphaël, France.
Abstract
Statistical machine translation relies heavily
on parallel corpora to train its models
for translation tasks. While more and
more bilingual corpora are readily available,
the quality of the sentence pairs
should be taken into consideration. This
paper presents a novel lattice score-based
data cleaning method to select proper sentence
pairs from the ones extracted from a
bilingual corpus by the sentence alignment
methods. The proposed method is carried
out as follows: firstly, an initial phrasebased
model is trained on the full sentencealigned
corpus; then for each of the sentence
pairs in the corpus, word alignments
are used to create anchor pairs and sourceside
lattices; thirdly, based on the translation
model, target-side phrase networks
are expanded on the lattices and Viterbi
searching is used to find approximated decoding
results; finally, BLEU score thresholds
are used to filter out the low-score
sentence pairs for the data cleaning purpose.
Our experiments on the FBIS corpus
showed improvements of BLEU score
from 23.78 to 24.02 in Chinese-English.
Metadata
Item Type: | Conference or Workshop Item (Paper) |
---|---|
Event Type: | Conference |
Refereed: | Yes |
Subjects: | Computer Science > Machine translating |
DCU Faculties and Centres: | Research Institutes and Centres > Centre for Next Generation Localisation (CNGL) |
Published in: | Proceedings of the 14th Annual Conference of the EAMT. . European Association for Machine Translation. |
Publisher: | European Association for Machine Translation |
Official URL: | http://www.mt-archive.info/EAMT-2010-TOC.htm |
Funders: | Science Foundation Ireland |
ID Code: | 15789 |
Deposited On: | 09 Nov 2010 16:59 by Shane Harper . Last Modified 09 Nov 2018 15:33 |
Documents
Full text available as:
Preview |
PDF
- Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
191kB |
Downloads
Downloads
Downloads per month over past year
Archive Staff Only: edit this record