Benchmarking SMT performance for Farsi using the TEP++ Corpus

Passban, Peyman; Way, Andy; Liu, Qun

Passban, Peyman, Way, Andy ORCID: 0000-0001-5736-5930 and Liu, Qun ORCID: 0000-0002-7000-1792 (2015) Benchmarking SMT performance for Farsi using the TEP++ Corpus. In: 18th Annual Conference of the European Association for Machine Translation, 11 - 13 May 2015., Antalya, Turkey.

Abstract
Metadata
Downloads
Documents

[+][-]

Abstract

Statistical machine translation (SMT) suffers from various problems which are exacerbated where training data is in short supply. In this paper we address the data sparsity problem in the Farsi (Persian) language and introduce a new parallel corpus, TEP++. Compared to previous results the new dataset is more efficient for Farsi SMT engines and yields better output. In our experiments using TEP++ as bilingual training data and BLEU as a metric, we achieved improvements of +11.17 (60%) and +7.76 (63.92%) in the Farsi– English and English–Farsi directions, respectively. Furthermore we describe an engine (SF2FF) to translate between formal and informal Farsi which in terms of syntax and terminology can be seen as different languages. The SF2FF engine also works as an intelligent normalizer for Farsi texts. To demonstrate its use, SF2FF was used to clean the IWSLT–2013 dataset to produce normalized data, which gave improvements in translation quality over FBK’s Farsi engine when used as training data

Metadata

Item Type:	Conference or Workshop Item (Paper)
Event Type:	Conference
Refereed:	Yes
Uncontrolled Keywords:	BLEU; SF2FF engine; FBK
Subjects:	Computer Science > Machine translating
DCU Faculties and Centres:	DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing Research Institutes and Centres > ADAPT
Published in:	Proceedings of the 18th Annual Conference of the European Association for Machine Translation. . Association for Computational Linguistics.
Publisher:	Association for Computational Linguistics
Official URL:	https://www.aclweb.org/anthology/W15-4911
Copyright Information:	© 2015 The Authors
Use License:	This item is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 3.0 License. View License
Funders:	Science Foundation Ireland through the CNGL Programme (Grant 12/CE/I2267) in the ADAPT Centre (www.adaptcentre.ie) at Dublin City University.
ID Code:	23218
Deposited On:	01 May 2019 15:32 by INVALID USER. Last Modified 01 May 2019 15:32

Documents

Full text available as:

[thumbnail of Benchmarking SMT Performance for Farsi Using the TEP++ Corpus.pdf]

Preview

PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
193kB

Downloads

Downloads per month over past year

Archive Staff Only: edit this record

DORAS | DCU Research Repository

Benchmarking SMT performance for Farsi using the TEP++ Corpus

Downloads