Login (DCU Staff Only)
Login (DCU Staff Only)

DORAS | DCU Research Repository

Explore open access research and scholarly works from DCU

Advanced Search

Word alignment and smoothing methods in statistical machine translation: Noise, prior knowledge and overfitting

Okita, Tsuyoshi (2012) Word alignment and smoothing methods in statistical machine translation: Noise, prior knowledge and overfitting. PhD thesis, Dublin City University.

This thesis discusses how to incorporate linguistic knowledge into an SMT system. Although one important category of linguistic knowledge is that obtained by a constituent / dependency parser, a POS / super tagger, and a morphological analyser, linguistic knowledge here includes larger domains than this: Multi-Word Expressions, Out-Of-Vocabulary words, paraphrases, lexical semantics (or non-literal translations), named-entities, coreferences, and transliterations. The first discussion is about word alignment where we propose a MWE-sensitive word aligner. The second discussion is about the smoothing methods for a language model and a translation model where we propose a hierarchical Pitman-Yor process-based smoothing method. The common grounds for these discussion are the examination of three exceptional cases from real-world data: the presence of noise, the availability of prior knowledge, and the problem of underfitting. Notable characteristics of this design are the careful usage of (Bayesian) priors in order that it can capture both frequent and linguistically important phenomena. This can be considered to provide one example to solve the problems of statistical models which often aim to learn from frequent examples only, and often overlook less frequent but linguistically important phenomena.
Item Type:Thesis (PhD)
Date of Award:March 2012
Supervisor(s):Way, Andy
Uncontrolled Keywords:statisitcal machine translation; SMT; Multi-Word Expressions; linguistic knowledge
Subjects:Computer Science > Computational linguistics
Computer Science > Machine translating
DCU Faculties and Centres:Research Institutes and Centres > Centre for Next Generation Localisation (CNGL)
DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing
Use License:This item is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 3.0 License. View License
Funders:Science Foundation Ireland
ID Code:16759
Deposited On:28 Mar 2012 13:20 by Declan Groves . Last Modified 19 Jul 2018 14:55


Downloads per month over past year

Archive Staff Only: edit this record