Lebron Casas, Luis (2024) Learning to evaluate video captioning systems through human assessment. PhD thesis, Dublin City University.
Abstract
Multimodal content analysis has attracted the attention of numerous researchers in the computer vision community. One of the most representative tasks of this sub-field is video captioning. There are numerous difficulties involved in creating these descriptions, ranging from effectively conveying the essence of the scene to generating text that is grammatically accurate and flows smoothly. In this thesis, we examine an inherent issue with these techniques. The question being asked is “How can we effectively evaluate video captioning?”. Our research has identified several shortcomings in the metrics, such as skewness in the length of the sentences or the practice of only measuring textual similarity on a small set of human reference captions. The widely used TRECVid video-to-text task is used as the basis for our investigations. Shortcomings are identified by comparing the correlation of various metrics against the direct assessment. To improve the quality of the evaluation, we propose to fine-tune a large language model to maximise the correlation. Our results show that this metric uses other qualities to evaluate the system output and obtains good performance. This idea is then extended using contrastive learning to learn an embedding space where a more human-like similarity can be used in the evaluation. We show that vision and language models can be used to measure the similarity between visual and textual features. We also identified discrepancies in how human judgment scores were distributed.
Some experiments on reducing bias by collecting multiple scores per caption or filtering outliers were carried out. These finally motivated us to create a new dataset for video captioning evaluation. This new dataset divides the qualitative scores into five sub-aspects and the captions were also post-edited. The final dataset provides the basis for a more comprehensive and robust reference upon which to compare
metrics for video captioning evaluation.
Metadata
Item Type: | Thesis (PhD) |
---|---|
Date of Award: | August 2024 |
Refereed: | No |
Supervisor(s): | O'Connor, Noel, Mcguinness, Kevin and Graham, Yvette |
Subjects: | Computer Science > Information retrieval Computer Science > Machine learning Computer Science > Digital video |
DCU Faculties and Centres: | DCU Faculties and Schools > Faculty of Engineering and Computing > School of Electronic Engineering Research Institutes and Centres > INSIGHT Centre for Data Analytics |
Use License: | This item is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 License. View License |
Funders: | Irish Research Council Enterprise Partnership Scheme |
ID Code: | 30081 |
Deposited On: | 18 Nov 2024 14:58 by Noel Edward O'connor . Last Modified 18 Nov 2024 14:58 |
Documents
Full text available as:
Preview |
PDF
- Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
Creative Commons: Attribution-Noncommercial-No Derivative Works 4.0 76MB |
Downloads
Downloads
Downloads per month over past year
Archive Staff Only: edit this record