Login (DCU Staff Only)
Login (DCU Staff Only)

DORAS | DCU Research Repository

Explore open access research and scholarly works from DCU

Advanced Search

Treebanks gone bad: parser evaluation and retraining using a treebank of ungrammatical sentences

Foster, Jennifer orcid logoORCID: 0000-0002-7789-4853 (2007) Treebanks gone bad: parser evaluation and retraining using a treebank of ungrammatical sentences. International Journal of Document Analysis and Recognition (IJDAR), 10 (3-4). pp. 129-145. ISSN 1433-2833

Abstract
This article describes how a treebank of ungrammatical sentences can be created from a treebank of well-formed sentences. The treebank creation procedure involves the automatic introduction of frequently occurring grammatical errors into the sentences in an existing treebank, and the minimal transformation of the original analyses in the treebank so that they describe the newly created ill-formed sentences. Such a treebank can be used to test how well a parser is able to ignore grammatical errors in texts (as people do), and can be used to induce a grammar capable of analysing such sentences. This article demonstrates these two applications using the Penn Treebank. In a robustness evaluation experiment, two state-of-the-art statistical parsers are evaluated on an ungrammatical version of Sect. 23 of the Wall Street Journal (WSJ) portion of the Penn treebank. This experiment shows that the performance of both parsers degrades with grammatical noise. A breakdown by error type is provided for both parsers. A second experiment retrains both parsers using an ungrammatical version of WSJ Sections 2–21. This experiment indicates that an ungrammatical treebank is a useful resource in improving parser robustness to grammatical errors, but that the correct combination of grammatical and ungrammatical training data has yet to be determined.
Metadata
Item Type:Article (Published)
Refereed:Yes
Additional Information:The original publication is available at www.springerlink.com
Uncontrolled Keywords:treebanks; parser evaluation; robust parsing; ungrammatical language;
Subjects:Computer Science > Machine translating
DCU Faculties and Centres:Research Institutes and Centres > National Centre for Language Technology (NCLT)
DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing
Publisher:Springer Berlin / Heidelberg
Official URL:http://dx.doi.org/10.1007/s10032-007-0059-8
Copyright Information:© Springer-Verlag 2007
Use License:This item is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 3.0 License. View License
Funders:Irish Research Council for Science Engineering and Technology, IRCSET P04232
ID Code:15207
Deposited On:17 Feb 2010 15:40 by DORAS Administrator . Last Modified 10 Oct 2018 15:16
Documents

Full text available as:

[thumbnail of Foster_ijdar_07.pdf]
Preview
PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
234kB
Metrics

Altmetric Badge

Dimensions Badge

Downloads

Downloads

Downloads per month over past year

Archive Staff Only: edit this record