Login (DCU Staff Only)
Login (DCU Staff Only)

DORAS | DCU Research Repository

Explore open access research and scholarly works from DCU

Advanced Search

Detecting grammatical errors with treebank-induced, probabilistic parsers

Wagner, Joachim orcid logoORCID: 0000-0002-8290-3849 (2012) Detecting grammatical errors with treebank-induced, probabilistic parsers. PhD thesis, Dublin City University.

Abstract
Today's grammar checkers often use hand-crafted rule systems that define acceptable language. The development of such rule systems is labour-intensive and has to be repeated for each language. At the same time, grammars automatically induced from syntactically annotated corpora (treebanks) are successfully employed in other applications, for example text understanding and machine translation. At first glance, treebank-induced grammars seem to be unsuitable for grammar checking as they massively over-generate and fail to reject ungrammatical input due to their high robustness. We present three new methods for judging the grammaticality of a sentence with probabilistic, treebank-induced grammars, demonstrating that such grammars can be successfully applied to automatically judge the grammaticality of an input string. Our best-performing method exploits the differences between parse results for grammars trained on grammatical and ungrammatical treebanks. The second approach builds an estimator of the probability of the most likely parse using grammatical training data that has previously been parsed and annotated with parse probabilities. If the estimated probability of an input sentence (whose grammaticality is to be judged by the system) is higher by a certain amount than the actual parse probability, the sentence is flagged as ungrammatical. The third approach extracts discriminative parse tree fragments in the form of CFG rules from parsed grammatical and ungrammatical corpora and trains a binary classifier to distinguish grammatical from ungrammatical sentences. The three approaches are evaluated on a large test set of grammatical and ungrammatical sentences. The ungrammatical test set is generated automatically by inserting common grammatical errors into the British National Corpus. The results are compared to two traditional approaches, one that uses a hand-crafted, discriminative grammar, the XLE ParGram English LFG, and one based on part-of-speech n-grams. In addition, the baseline methods and the new methods are combined in a machine learning-based framework, yielding further improvements.
Metadata
Item Type:Thesis (PhD)
Date of Award:March 2012
Refereed:No
Supervisor(s):Foster, Jennifer and van Genabith, Josef
Uncontrolled Keywords:grammar checker; error detection; natural language processing; probabilistic grammar; precision grammar; decision tree learning; ROC curve; voting classifier, n-gram language models; learner corpus; error corpora
Subjects:Computer Science > Computational linguistics
Computer Science > Machine learning
Computer Science > Artificial intelligence
Humanities > Language
Humanities > Linguistics
DCU Faculties and Centres:DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing
Use License:This item is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 3.0 License. View License
Funders:Irish Research Council for Science Engineering and Technology
ID Code:16776
Deposited On:29 Mar 2012 09:22 by Jennifer Foster . Last Modified 24 Jan 2019 16:30
Documents

Full text available as:

[thumbnail of PhD Thesis of Joachim Wagner]
Preview
PDF (PhD Thesis of Joachim Wagner) - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
5MB
Downloads

Downloads

Downloads per month over past year

Archive Staff Only: edit this record