Human evaluation and statistical analyses on machine reading comprehension, question generation and open-domain dialogue

Abstract

Evaluation is a critical element in the development process of many natural language based systems. In this thesis, we will present critical analyses of standard evaluation methodologies applied in the following Natural Language Processing (NLP) domains: machine reading comprehension (MRC), question generation (QG), and open-domain dialogue. Generally speaking, systems from tasks like MRC are usually evaluated by comparing the similarity between hand-crafted references and system generated outputs using automatic evaluation metrics, thus these metrics are mainly borrowed from other NLP tasks that have been well-developed, such as machine translation and text summarization. Meanwhile, the evaluation of QG and dialogues is even a known open problem as such tasks do not have the corresponding references for computing the similarity, and human evaluation is indispensable when assessing the performance of the systems from these tasks. However, human evaluation is unfortunately not always valid because: i) human evaluation may cost too much and be hard to deploy when experts are involved; ii) human assessors can lack reliability in the crowd-sourcing environment. To overcome the challenges from both automatic metrics and human evaluation, we first design specific crowdsourcing human evaluation methods for these three target tasks, respectively. We then show that these human evaluation methods are reproducible, highly reliable, easy to deploy, and cost-effective. Additionally, with the data collected from our experiments, we measure the accuracy of existing automatic metrics and analyse the potential limitations and disadvantages of the direct application of these metrics. Furthermore, in allusion to the specific features of different tasks, we provide detailed statistical analyses on the collected data to discover their underlying trends, and further give suggestions about the directions to improving systems on different aspects.

Item Type:	Thesis (PhD)
Date of Award:	November 2022
Refereed:	No
Supervisor(s):	Jones, Gareth, Graham, Yvette and Liu, Qun
Uncontrolled Keywords:	natural language processing evaluation; human evaluation; machine reading comprehension evaluation; question generation evaluation; open-domain dialogue evaluation
Subjects:	Computer Science > Computational linguistics Computer Science > Machine learning
DCU Faculties and Centres:	DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing Research Institutes and Centres > ADAPT
Funders:	Science Foundation Ireland
ID Code:	27703
Deposited On:	10 Nov 2022 14:19 by Gareth Jones . Last Modified 10 Nov 2022 14:19

Item Type:

Thesis (PhD)

Date of Award:

November 2022

Refereed:

Supervisor(s):

Jones, Gareth, Graham, Yvette and Liu, Qun

Uncontrolled Keywords:

natural language processing evaluation; human evaluation; machine reading comprehension evaluation; question generation evaluation; open-domain dialogue evaluation

Subjects:

Computer Science > Computational linguistics
Computer Science > Machine learning

DCU Faculties and Centres:

DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing
Research Institutes and Centres > ADAPT

Funders:

Science Foundation Ireland

ID Code:

27703

Deposited On:

10 Nov 2022 14:19 by Gareth Jones . Last Modified 10 Nov 2022 14:19

DORAS | DCU Research Repository

Human evaluation and statistical analyses on machine reading comprehension, question generation and open-domain dialogue

Downloads