Browse DORAS
Browse Theses
Latest Additions
Creative Commons License
Except where otherwise noted, content on this site is licensed for use under a:

Word alignment and smoothing methods in statistical machine translation: Noise, prior knowledge and overfitting

Okita, Tsuyoshi (2012) Word alignment and smoothing methods in statistical machine translation: Noise, prior knowledge and overfitting. PhD thesis, Dublin City University.

Full text available as:

PDF (Word Alignment and Smoothing Methods in Statistical Machine Translation: Noise, Prior Knowledge and Overfitting) - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader


This thesis discusses how to incorporate linguistic knowledge into an SMT system. Although one important category of linguistic knowledge is that obtained by a constituent / dependency parser, a POS / super tagger, and a morphological analyser, linguistic knowledge here includes larger domains than this: Multi-Word Expressions, Out-Of-Vocabulary words, paraphrases, lexical semantics (or non-literal translations), named-entities, coreferences, and transliterations. The first discussion is about word alignment where we propose a MWE-sensitive word aligner. The second discussion is about the smoothing methods for a language model and a translation model where we propose a hierarchical Pitman-Yor process-based smoothing method. The common grounds for these discussion are the examination of three exceptional cases from real-world data: the presence of noise, the availability of prior knowledge, and the problem of underfitting. Notable characteristics of this design are the careful usage of (Bayesian) priors in order that it can capture both frequent and linguistically important phenomena. This can be considered to provide one example to solve the problems of statistical models which often aim to learn from frequent examples only, and often overlook less frequent but linguistically important phenomena.

Item Type:Thesis (PhD)
Date of Award:March 2012
Supervisor(s):Way, Andy
Uncontrolled Keywords:statisitcal machine translation; SMT; Multi-Word Expressions; linguistic knowledge
Subjects:Computer Science > Computational linguistics
Computer Science > Machine translating
DCU Faculties and Centres:Research Initiatives and Centres > Centre for Next Generation Localisation (CNGL)
DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing
Use License:This item is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 3.0 License. View License
Funders:Science Foundation Ireland
ID Code:16759
Deposited On:28 Mar 2012 14:20 by Declan Groves. Last Modified 28 Mar 2012 14:20

Download statistics

Archive Staff Only: edit this record