Learning to evaluate video captioning systems through human assessment

Lebron Casas, Luis (2024) Learning to evaluate video captioning systems through human assessment. PhD thesis, Dublin City University.

Abstract
Metadata
Downloads
Documents

[+]

Multimodal content analysis has attracted the attention of numerous researchers in the computer vision community. One of the most representative tasks of this sub-field is video captioning. There are numerous difficulties involved in creating these descriptions, ranging from effectively conveying the essence of the scene to generating text that is grammatically accurate and flows smoothly. In this thesis, we examine an inherent issue with these techniques. The question being asked is “How can we effectively evaluate video captioning?”. Our research has identified several shortcomings in the metrics, such as skewness in the length of the sentences or the practice of only measuring textual similarity on a small set of human reference captions. The widely used TRECVid video-to-text task is used as the basis for our investigations. Shortcomings are identified by comparing the correlation of various metrics against the direct assessment. To improve the quality of the evaluation, we propose to fine-tune a large language model to maximise the correlation. Our results show that this metric uses other qualities to evaluate the system output and obtains good performance. This idea is then extended using contrastive learning to learn an embedding space where a more human-like similarity can be used in the evaluation. We show that vision and language models can be used to measure the similarity between visual and textual features. We also identified discrepancies in how human judgment scores were distributed. Some experiments on reducing bias by collecting multiple scores per caption or filtering outliers were carried out. These finally motivated us to create a new dataset for video captioning evaluation. This new dataset divides the qualitative scores into five sub-aspects and the captions were also post-edited. The final dataset provides the basis for a more comprehensive and robust reference upon which to compare metrics for video captioning evaluation.

Item Type:	Thesis (PhD)
Date of Award:	August 2024
Refereed:	No
Supervisor(s):	O'Connor, Noel, Mcguinness, Kevin and Graham, Yvette
Subjects:	Computer Science > Information retrieval Computer Science > Machine learning Computer Science > Digital video
DCU Faculties and Centres:	DCU Faculties and Schools > Faculty of Engineering and Computing > School of Electronic Engineering Research Institutes and Centres > INSIGHT Centre for Data Analytics
Use License:	This item is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 License. View License
Funders:	Irish Research Council Enterprise Partnership Scheme
ID Code:	30081
Deposited On:	18 Nov 2024 14:58 by Noel Edward O'connor . Last Modified 18 Nov 2024 14:58

Full text available as:

[thumbnail of Dublin_City_University_Thesis (5).pdf]

Preview

PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
Creative Commons: Attribution-Noncommercial-No Derivative Works 4.0
76MB

Downloads

Downloads per month over past year

Archive Staff Only: edit this record

DORAS | DCU Research Repository

Learning to evaluate video captioning systems through human assessment

Downloads