Detecting grammatical errors with treebank-induced, probabilistic parsers

Wagner, Joachim

Wagner, Joachim ORCID: 0000-0002-8290-3849 (2012) Detecting grammatical errors with treebank-induced, probabilistic parsers. PhD thesis, Dublin City University.

Abstract
Metadata
Downloads
Documents

[+][-]

Abstract

Today's grammar checkers often use hand-crafted rule systems that define acceptable language. The development of such rule systems is labour-intensive and has to be repeated for each language. At the same time, grammars automatically induced from syntactically annotated corpora (treebanks) are successfully employed in other applications, for example text understanding and machine translation. At first glance, treebank-induced grammars seem to be unsuitable for grammar checking as they massively over-generate and fail to reject ungrammatical input due to their high robustness. We present three new methods for judging the grammaticality of a sentence with probabilistic, treebank-induced grammars, demonstrating that such grammars can be successfully applied to automatically judge the grammaticality of an input string. Our best-performing method exploits the differences between parse results for grammars trained on grammatical and ungrammatical treebanks. The second approach builds an estimator of the probability of the most likely parse using grammatical training data that has previously been parsed and annotated with parse probabilities. If the estimated probability of an input sentence (whose grammaticality is to be judged by the system) is higher by a certain amount than the actual parse probability, the sentence is flagged as ungrammatical. The third approach extracts discriminative parse tree fragments in the form of CFG rules from parsed grammatical and ungrammatical corpora and trains a binary classifier to distinguish grammatical from ungrammatical sentences. The three approaches are evaluated on a large test set of grammatical and ungrammatical sentences. The ungrammatical test set is generated automatically by inserting common grammatical errors into the British National Corpus. The results are compared to two traditional approaches, one that uses a hand-crafted, discriminative grammar, the XLE ParGram English LFG, and one based on part-of-speech n-grams. In addition, the baseline methods and the new methods are combined in a machine learning-based framework, yielding further improvements.

Metadata

Item Type:	Thesis (PhD)
Date of Award:	March 2012
Refereed:	No
Supervisor(s):	Foster, Jennifer and van Genabith, Josef
Uncontrolled Keywords:	grammar checker; error detection; natural language processing; probabilistic grammar; precision grammar; decision tree learning; ROC curve; voting classifier, n-gram language models; learner corpus; error corpora
Subjects:	Computer Science > Computational linguistics Computer Science > Machine learning Computer Science > Artificial intelligence Humanities > Language Humanities > Linguistics
DCU Faculties and Centres:	DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing
Use License:	This item is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 3.0 License. View License
Funders:	Irish Research Council for Science Engineering and Technology
ID Code:	16776
Deposited On:	29 Mar 2012 09:22 by Jennifer Foster . Last Modified 24 Jan 2019 16:30

Documents

Full text available as:

[thumbnail of PhD Thesis of Joachim Wagner]

Preview

PDF (PhD Thesis of Joachim Wagner) - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
5MB

Downloads

Downloads per month over past year

Archive Staff Only: edit this record

DORAS | DCU Research Repository

Detecting grammatical errors with treebank-induced, probabilistic parsers

Downloads