Temporal bilinear encoding network of audio-visual features at low sampling rates

Hu, Feiyan; Mohedano, Eva; O'Connor, Noel E.; McGuinness, Kevin

Hu, Feiyan ORCID: 0000-0001-7451-6438, Mohedano, Eva, O'Connor, Noel E. ORCID: 0000-0002-4033-9135 and McGuinness, Kevin ORCID: 0000-0003-1336-6477 (2021) Temporal bilinear encoding network of audio-visual features at low sampling rates. In: 16th International Conference on Computer Vision Theory and Applications - VISAPP 2021, 8-10 Feb 2021, Vienna, Austria (Online). ISBN 978-989-758-488-6

Abstract
Metadata
Downloads
Documents

[+][-]

Abstract

Current deep learning based video classification architectures are typically trained end-to-end on large volumes of data and require extensive computational resources. This paper aims to exploit audio-visual information in video classification with a 1 frame per second sampling rate. We propose Temporal Bilinear Encoding Networks (TBEN) for encoding both audio and visual long range temporal information using bilinear pooling and demonstrate bilinear pooling is better than average pooling on the temporal dimension for videos with low sampling rate. We also embed the label hierarchy in TBEN to further improve the robustness of the classifier. Experiments on the FGA240 fine-grained classification dataset using TBEN achieve a new state-of-the-art (hit@1=47.95%). We also exploit the possibility of incorporating TBEN with multiple decoupled modalities like visual semantic and motion features: experiments on UCF101 sampled at 1 FPS achieve close to state-of-the-art accuracy (hit@1=91.03%) while requiring significantly less computational resources than competing approaches for both training and prediction.

Metadata

Item Type:	Conference or Workshop Item (Paper)
Event Type:	Conference
Refereed:	Yes
Uncontrolled Keywords:	Video classification; bilinear pooling; Action classification; Deep learning; Audio-visual; Compact Bilinear Pooling
Subjects:	Computer Science > Artificial intelligence Computer Science > Image processing Computer Science > Machine learning Computer Science > Digital video Computer Science > Video compression
DCU Faculties and Centres:	DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing DCU Faculties and Schools > Faculty of Engineering and Computing > School of Electronic Engineering Research Institutes and Centres > INSIGHT Centre for Data Analytics
Published in:	Proceedings of the 16th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications: VISAPP,. 5. SciTePress. ISBN 978-989-758-488-6
Publisher:	SciTePress
Official URL:	http://dx.doi.org/10.5220/0010337306370644
Copyright Information:	© 2021 The Authors (CC BY-NC-ND 4.0)
Funders:	Science Foundation Ireland (SFI) under grant number SFI/15/SIRG/3283 and SFI/12/RC/2289_P2.
ID Code:	26253
Deposited On:	13 Sep 2021 10:16 by Feiyan Hu . Last Modified 13 Sep 2021 10:16

Documents

Full text available as:

Preview

PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
217kB

Downloads

Downloads per month over past year

Available Versions of this Item

Temporal bilinear encoding network of audio-visual features at low sampling rates. (deposited 09 Feb 2021 14:05)
- Temporal bilinear encoding network of audio-visual features at low sampling rates. (deposited 13 Sep 2021 10:16) [Currently Displayed]

Archive Staff Only: edit this record

DORAS | DCU Research Repository

Temporal bilinear encoding network of audio-visual features at low sampling rates

Downloads

Available Versions of this Item