LSTM language model adaptation with images and titles for multimedia automatic speech recognition
Moriya, Yasufumi and Jones, Gareth J.F.ORCID: 0000-0003-2923-8365
(2019)
LSTM language model adaptation with images and titles for multimedia automatic speech recognition.
In: IEEE SLT 2018 - Workshop on Spoken Language Technology, 18-21 Dec 2018, Athens, Greece.
ISBN 978-1-5386-4334-1
Transcription of multimedia data sources is often a challenging automatic speech recognition (ASR) task. The incorporation of visual features as additional contextual information
as a means to improve ASR for this data has recently drawn
attention from researchers. Our investigation extends existing ASR methods by using images and video titles to adapt a
recurrent neural network (RNN) language model with a longshort term memory (LSTM) network. Our language model
is tested on transcription of an existing corpus of instruction
videos and on a new corpus consisting of lecture videos. Consistent reduction in perplexity by 5-10 is observed on both
datasets. When the non-adapted model is combined with the
image adaptation and video title adaptation models for n-best
ASR hypotheses re-ranking, additionally the word error rate
(WER) is decreased by around 0.5% on both datasets. By
analysing the output word probabilities of the model, it is
found that both image adaptation and video title adaptation
give the model more confidence in the choice of contextually
correct informative words