Judge, John (2006) Adapting and developing linguistic resources for question answering. PhD thesis, Dublin City University.
Abstract
As information retrieval becomes more focussed, so too must the techniques involved in the retrieval process. More precise responses to queries require more precise linguistic analysis of both the queries and the factual documents from which the information is being retrieved.
In this thesis, I present research into using existing linguistic tools to analyse questions. These tools, as supplied, often underperform on question analysis. I present my work on adapting these tools, and creating new resources for use in developing new tools tailored to question analysis.
My work has shown that in order to adapt the treebank- and f-structure annotation algorithmbased wide coverage LFG parsing resources of Cahill et al. (2004) to analyse questions from the ATIS corpus, only the c-structure parser needs to be retrained, the annotation algorithm remains unchanged. The retrained c-structure parser needs only a small amount of appropriate training data added to its training corpus to gain a significant improvement in both c-structure parsing and f-structure annotation.
Given the improvements made with a relatively small amount of question data, I developed QuestionBank, a question treebank, to determine what further gains can be made using a larger amount of question data. My question treebank is a corpus of 4000 parse annotated questions. The questions were taken from a number of sources and the question treebank was “bootstrapped” in an incremental parsing, hand correction and retraining approach from raw data using existing probabilistic parsing resources.
Experiments with QuestionBank show that it is an effective resource for training parsers to analyse questions with an improvement of over 10% on the baseline parsing results. In further experiments I show that a parser retrained with QuestionBank can also parse newspaper text (Penn-II Treebank Section 23) with state-of-the-art accuracy.
Long distance dependencies (LDDs) are a vital part of question analysis in determining semantic roles and question focus. I have designed and implemented a novel method to recover WH-traces and coindexed antecedents in c-structure trees from parser output which uses the f-structure LDD resolution method of Cahill et al (2004) to resolve the dependencies and then “reverse engineers” the corresponding syntactic components in the c-structure tree.
Metadata
Item Type: | Thesis (PhD) |
---|---|
Date of Award: | 2006 |
Refereed: | No |
Supervisor(s): | van Genabith, Josef and Cahill, Aoife |
Uncontrolled Keywords: | Question analysis |
Subjects: | Computer Science > Computational linguistics Computer Science > Computer software Computer Science > Information retrieval |
DCU Faculties and Centres: | DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing |
Use License: | This item is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 3.0 License. View License |
ID Code: | 17937 |
Deposited On: | 24 Apr 2013 13:02 by Celine Campbell . Last Modified 30 Jul 2019 10:23 |
Documents
Full text available as:
Preview |
PDF
- Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
3MB |
Downloads
Downloads
Downloads per month over past year
Archive Staff Only: edit this record