Data cleaning for word alignment

Okita, Tsuyoshi

Okita, Tsuyoshi (2009) Data cleaning for word alignment. In: ACL-IJCNLP 2009 - Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, 2-7 August 2009, Singapore.

Abstract
Metadata
Downloads
Documents

[+][-]

Abstract

Parallel corpora are made by human beings. However, as an MT system is an aggregation of state-of-the-art NLP technologies without any intervention of human beings, it is unavoidable that quite a few sentence pairs are beyond its analysis and that will therefore not contribute to the system. Furthermore, they in turn may act against our objectives to make the overall performance worse. Possible unfavorable items are n : m mapping objects, such as paraphrases, non-literal translations, and multiword expressions. This paper presents a pre-processing method which detects such unfavorable items before supplying them to the word aligner under the assumption that their frequency is low, such as below 5 percent. We show an improvement of Bleu score from 28.0 to 31.4 in English-Spanish and from 16.9 to 22.1 in German-English.

Metadata

Item Type:	Conference or Workshop Item (Paper)
Event Type:	Conference
Refereed:	Yes
Subjects:	Computer Science > Machine translating
DCU Faculties and Centres:	Research Institutes and Centres > Centre for Next Generation Localisation (CNGL) Research Institutes and Centres > National Centre for Language Technology (NCLT) DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing
Published in:	Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. . Association for Computational Linguistics.
Publisher:	Association for Computational Linguistics
Official URL:	http://www.aclweb.org/anthology/P/P09/
Copyright Information:	© 2009 ACL and AFNLP
Use License:	This item is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 3.0 License. View License
Funders:	Science Foundation Ireland, SFI 07/CE/I1142
ID Code:	15178
Deposited On:	15 Feb 2010 15:50 by DORAS Administrator . Last Modified 19 Jul 2018 14:49

Documents

Full text available as:

Preview

PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
599kB

Downloads

Downloads per month over past year

Archive Staff Only: edit this record

DORAS | DCU Research Repository

Data cleaning for word alignment

Downloads