Twenty years of confusion in human evaluation: NLG needs evaluation sheets and standardised definitions

Howcroft, David; Belz, Anya; Gkatzia, Dimitra; Clinciu, Miruna; Hasan, Sadid; Mahamood, Saad; Mille, Simon; van Miltenburg, Emiel; Santhanam, Sashank; Rieser, Verena

Home
Browse By

Author

DCU Faculties and Centres

Theses

Subject

Year

Publication Type

Year of Award

Supervisors
About / FAQ
Statistics
Login (DCU Staff Only)

Twenty years of confusion in human evaluation: NLG needs evaluation sheets and standardised definitions

Howcroft, David ORCID: 0000-0002-0810-9065, Belz, Anya ORCID: 0000-0002-0552-8096, Gkatzia, Dimitra, Clinciu, Miruna, Hasan, Sadid, Mahamood, Saad ORCID: 0000-0003-2332-8749, Mille, Simon ORCID: 0000-0002-8852-2764, van Miltenburg, Emiel ORCID: 0000-0002-7143-8961, Santhanam, Sashank ORCID: 0000-0002-9412-3495 and Rieser, Verena ORCID: 0000-0001-6117-4395 (2020) Twenty years of confusion in human evaluation: NLG needs evaluation sheets and standardised definitions. In: 13th International Natural Language Generation Conference 2020 (INLG'20), 15-18 Dec 2020, Dublin, Ireland.

Abstract
Metadata
Downloads
Documents

[+][-]

Abstract

Human assessment remains the most trusted form of evaluation in NLG, but highly diverse approaches and a proliferation of different quality criteria used by researchers make it difficult to compare results and draw conclusions across papers, with adverse implications for meta-evaluation and reproducibility. In this paper, we present (i) our dataset of 165 NLG papers with human evaluations, (ii) the annotation scheme we developed to label the papers for different aspects of evaluations, (iii) quantitative analyses of the annotations, and (iv) a set of recommendations for improving standards in evaluation reporting. We use the annotations as a basis for examining information included in evaluation reports, and levels of consistency in approaches, experimental design and terminology, focusing in particular on the 200+ different terms that have been used for evaluated aspects of quality. We conclude that due to a pervasive lack of clarity in reports and extreme diversity in approaches, human evaluation in NLG presents as extremely confused in 2020, and that the field is in urgent need of standard methods and terminology.

Metadata

Item Type:	Conference or Workshop Item (Paper)
Event Type:	Conference
Refereed:	Yes
Subjects:	Computer Science > Computational linguistics
DCU Faculties and Centres:	DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing Research Institutes and Centres > ADAPT
Published in:	Davis, Brian, Graham, Yvette, Kelleher, John D. and Sripada, Yaji, (eds.) Proceedings of the 13th International Conference on Natural Language Generation. . Association for Computational Linguistics (ACL).
Publisher:	Association for Computational Linguistics (ACL)
Official URL:	https://aclanthology.org/2020.inlg-1.23
Copyright Information:	© 2020 Association for Computational Linguistics
ID Code:	28631
Deposited On:	06 Jul 2023 15:46 by Anya Belz . Last Modified 06 Jul 2023 16:07

Documents

Full text available as:

Preview

PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
Creative Commons: Attribution 4.0
1MB

Downloads

Downloads per month over past year

Archive Staff Only: edit this record

DORAS | DCU Research Repository

Twenty years of confusion in human evaluation: NLG needs evaluation sheets and standardised definitions

Downloads