In this paper, we present the results of two re- production studies for the human evaluation originally reported by Dušek and Kasner (2020) in which the authors comparatively evaluated outputs produced by a semantic error detection system for data-to-text generation against ref- erence outputs. In the first reproduction, the original evaluators repeat the evaluation, in a test of the repeatability of the original evaluation. In the second study, two new evaluators carry out the evaluation task, in a test of the reproducibility of the original evaluation under otherwise identical conditions. We describe our approach to reproduction, and present and analyse results, finding different degrees of re- producibility depending on result type, data and labelling task. Our resources are available and open-sourced.
Shaikh, Samira, Ferreira, Thiago and Stent, Amanda, (eds.)
Proceedings of the 15th International Conference on Natural Language Generation: Generation Challenges.
.
Association for Computational Linguistics (ACL).
Faculty of Engineering and Computing, Dublin City University, EPSRC Grant No .EP/V05645X/1 for the Repro Hum project., ERC Grant No.101039303 NG-NLG, Czech Ministry of Education project No. LM2018101LINDAT/CLARIAH-CZ, CharlesUniversityprojectsGAUK140320andSVV260575
ID Code:
28650
Deposited On:
05 Jul 2023 09:46 by
Anya Belz
. Last Modified 05 Jul 2023 09:46