Word alignment and smoothing methods in statistical machine translation: Noise, prior knowledge and overfitting

Okita, Tsuyoshi

Home
Browse By

Author

DCU Faculties and Centres

Theses

Subject

Year

Publication Type

Year of Award

Supervisors
About / FAQ
Statistics
Login (DCU Staff Only)

Word alignment and smoothing methods in statistical machine translation: Noise, prior knowledge and overfitting

Okita, Tsuyoshi (2012) Word alignment and smoothing methods in statistical machine translation: Noise, prior knowledge and overfitting. PhD thesis, Dublin City University.

Abstract
Metadata
Downloads
Documents

[+][-]

Abstract

This thesis discusses how to incorporate linguistic knowledge into an SMT system. Although one important category of linguistic knowledge is that obtained by a constituent / dependency parser, a POS / super tagger, and a morphological analyser, linguistic knowledge here includes larger domains than this: Multi-Word Expressions, Out-Of-Vocabulary words, paraphrases, lexical semantics (or non-literal translations), named-entities, coreferences, and transliterations. The ﬁrst discussion is about word alignment where we propose a MWE-sensitive word aligner. The second discussion is about the smoothing methods for a language model and a translation model where we propose a hierarchical Pitman-Yor process-based smoothing method. The common grounds for these discussion are the examination of three exceptional cases from real-world data: the presence of noise, the availability of prior knowledge, and the problem of underﬁtting. Notable characteristics of this design are the careful usage of (Bayesian) priors in order that it can capture both frequent and linguistically important phenomena. This can be considered to provide one example to solve the problems of statistical models which often aim to learn from frequent examples only, and often overlook less frequent but linguistically important phenomena.

Metadata

Item Type:	Thesis (PhD)
Date of Award:	March 2012
Refereed:	No
Supervisor(s):	Way, Andy
Uncontrolled Keywords:	statisitcal machine translation; SMT; Multi-Word Expressions; linguistic knowledge
Subjects:	Computer Science > Computational linguistics Computer Science > Machine translating
DCU Faculties and Centres:	Research Institutes and Centres > Centre for Next Generation Localisation (CNGL) DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing
Use License:	This item is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 3.0 License. View License
Funders:	Science Foundation Ireland
ID Code:	16759
Deposited On:	28 Mar 2012 13:20 by Declan Groves . Last Modified 19 Jul 2018 14:55

Documents

Full text available as:

[thumbnail of Word Alignment and Smoothing Methods in Statistical Machine Translation: Noise, Prior Knowledge and Overfitting]

Preview

PDF (Word Alignment and Smoothing Methods in Statistical Machine Translation: Noise, Prior Knowledge and Overfitting) - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
1MB

Downloads

Downloads per month over past year

Archive Staff Only: edit this record

DORAS | DCU Research Repository

Word alignment and smoothing methods in statistical machine translation: Noise, prior knowledge and overfitting

Downloads