Chrupała, Grzegorz (2006) Simple data-driven context-sensitive lemmatization. In: SEPLN 2006, 13-15 September 2006, Zaragoza, Spain.
Abstract
Lemmatization for languages with rich inflectional morphology is one of the basic, indispensable steps in a language processing pipeline. In this paper we present a simple data-driven context-sensitive approach to lemmatizating word forms in running text. We treat lemmatization as a classification task for Machine Learning, and automatically induce class labels. We achieve this by computing a Shortest Edit Script (SES) between reversed input and output strings. A SES describes the transformations that have to be applied to the
input string (word form) in order to convert it to the output string (lemma). Our approach shows competitive performance on a range of typologically different languages.
Metadata
Item Type: | Conference or Workshop Item (Paper) |
---|---|
Event Type: | Conference |
Refereed: | Yes |
Uncontrolled Keywords: | lemmatization; |
Subjects: | Computer Science > Machine learning |
DCU Faculties and Centres: | Research Institutes and Centres > National Centre for Language Technology (NCLT) |
Official URL: | http://www.unizar.es/departamentos/filologia_ingle... |
Use License: | This item is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 3.0 License. View License |
Funders: | Science Foundation Ireland, SFI 04/IN/I527 |
ID Code: | 15272 |
Deposited On: | 10 Mar 2010 14:30 by DORAS Administrator . Last Modified 19 Jul 2018 14:50 |
Documents
Full text available as:
Preview |
PDF
- Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
157kB |
Downloads
Downloads
Downloads per month over past year
Archive Staff Only: edit this record