In order to provide content-based search on image media, including images and video, they are typically accessed based on manual or automatically assigned concepts or tags, or sometimes based on image-image similarity depending on the use case. While great progress has been made in very recent years in automatic concept detection using machine learning, we are still left with a mis-match between the semantics of the concepts we can automatically detect, and the semantics of the words used in a user’s query, for example. In this paper we report on a large collection of images from wearable cameras gathered as part of the Kids’Cam project, which have been both manually annotated from a vocabulary of 83 concepts, and automatically annotated from a vocabulary of 1,000 concepts. This collection allows us to explore issues around how language, in the form of two distinct concept vocabularies or spaces, one manually assigned and thus forming a ground-truth, is used to represent images, in our case taken using wearable cameras. It also allows us to discuss, in general terms, issues around mis-match of concepts in visual media, which derive from language mis-matches. We report the data processing we have completed on this collection and some of our initial experimentation in mapping across the two language vocabularies.
iV&L-MM '16 Proceedings of the 2016 ACM workshop on Vision and Language Integration Meets Multimedia Fusion.
.
Association for Computing Machinery. ISBN 978-1-4503-4519-4/16/10
This item is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 3.0 License. View License
Funders:
Science Foundation Ireland, National Natural Science Foundation of China, Beijing Key Laboratory of Networked Multimedia, Health Research Council of New Zealand Programme Grant
ID Code:
21434
Deposited On:
18 Oct 2016 09:23 by
Alan Smeaton
. Last Modified 07 Apr 2021 13:02