Login (DCU Staff Only)
Login (DCU Staff Only)

DORAS | DCU Research Repository

Explore open access research and scholarly works from DCU

Advanced Search

Human evaluation and statistical analyses on machine reading comprehension, question generation and open-domain dialogue

Ji, Tianbo orcid logoORCID: 0000-0003-0143-6220 (2022) Human evaluation and statistical analyses on machine reading comprehension, question generation and open-domain dialogue. PhD thesis, Dublin City University.

Abstract
Evaluation is a critical element in the development process of many natural language based systems. In this thesis, we will present critical analyses of standard evaluation methodologies applied in the following Natural Language Processing (NLP) domains: machine reading comprehension (MRC), question generation (QG), and open-domain dialogue. Generally speaking, systems from tasks like MRC are usually evaluated by comparing the similarity between hand-crafted references and system generated outputs using automatic evaluation metrics, thus these metrics are mainly borrowed from other NLP tasks that have been well-developed, such as machine translation and text summarization. Meanwhile, the evaluation of QG and dialogues is even a known open problem as such tasks do not have the corresponding references for computing the similarity, and human evaluation is indispensable when assessing the performance of the systems from these tasks. However, human evaluation is unfortunately not always valid because: i) human evaluation may cost too much and be hard to deploy when experts are involved; ii) human assessors can lack reliability in the crowd-sourcing environment. To overcome the challenges from both automatic metrics and human evaluation, we first design specific crowdsourcing human evaluation methods for these three target tasks, respectively. We then show that these human evaluation methods are reproducible, highly reliable, easy to deploy, and cost-effective. Additionally, with the data collected from our experiments, we measure the accuracy of existing automatic metrics and analyse the potential limitations and disadvantages of the direct application of these metrics. Furthermore, in allusion to the specific features of different tasks, we provide detailed statistical analyses on the collected data to discover their underlying trends, and further give suggestions about the directions to improving systems on different aspects.
Metadata
Item Type:Thesis (PhD)
Date of Award:November 2022
Refereed:No
Supervisor(s):Jones, Gareth, Graham, Yvette and Liu, Qun
Uncontrolled Keywords:natural language processing evaluation; human evaluation; machine reading comprehension evaluation; question generation evaluation; open-domain dialogue evaluation
Subjects:Computer Science > Computational linguistics
Computer Science > Machine learning
DCU Faculties and Centres:DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing
Research Institutes and Centres > ADAPT
Funders:Science Foundation Ireland
ID Code:27703
Deposited On:10 Nov 2022 14:19 by Gareth Jones . Last Modified 10 Nov 2022 14:19
Documents

Full text available as:

[thumbnail of TianboJi_PhD_final.pdf]
Preview
PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
Creative Commons: Attribution-Noncommercial-No Derivative Works 4.0
4MB
Downloads

Downloads

Downloads per month over past year

Archive Staff Only: edit this record