Human speech processing is often a multimodal process combining
audio and visual processing. Eyes and Ears Together proposes two
benchmark multimodal speech processing tasks: (1) multimodal automatic speech recognition (ASR) and (2) multimodal co-reference
resolution on the spoken multimedia. These tasks are motivated by
our desire to address the difficulties of ASR for multimedia spoken
content. We review prior work on the integration of multimodal
signals into speech processing for multimedia data, introduce a
multimedia dataset for our proposed tasks, and outline these tasks.
Larson, Martha, Arora, Piyush, Demarty, Claire-Hélène and Riegler, Michael, (eds.)
Working Notes Proceedings of the MediaEval 2018 Workshop.
2283.
CEUR-WS.