Skip to main content
DORAS
DCU Online Research Access Service
Login (DCU Staff Only)
Temporal bilinear encoding network of audio-visual features at low sampling rates

Hu, Feiyan ORCID: 0000-0001-7451-6438, Eva, Mohedano, O'Connor, Noel E. ORCID: 0000-0002-4033-9135 and McGuinness, Kevin ORCID: 0000-0003-1336-6477 (2021) Temporal bilinear encoding network of audio-visual features at low sampling rates. In: 16th International Conference on Computer Vision Theory and Applications - VISAPP 2021, 8-10 Feb 2021, Vienna, Austria (Online). ISBN 978-989-758-488-6

WarningThere is a more recent version of this item available.

Full text available as:

[img]
Preview
PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
217kB

Abstract

Current deep learning based video classification architectures are typically trained end-to-end on large volumes of data and require extensive computational resources. This paper aims to exploit audio-visual information in video classification with a 1 frame per second sampling rate. We propose Temporal Bilinear Encoding Networks (TBEN) for encoding both audio and visual long range temporal information using bilinear pooling and demonstrate bilinear pooling is better than average pooling on the temporal dimension for videos with low sampling rate. We also embed the label hierarchy in TBEN to further improve the robustness of the classifier. Experiments on the FGA240 fine-grained classification dataset using TBEN achieve a new state-of-the-art (hit@1=47.95%). We also exploit the possibility of incorporating TBEN with multiple decoupled modalities like visual semantic and motion features: experiments on UCF101 sampled at 1 FPS achieve close to state-of-the-art accuracy (hit@1=91.03%) while requiring significantly less computational resources than competing approaches for both training and prediction.

Item Type:Conference or Workshop Item (Paper)
Event Type:Conference
Refereed:Yes
Uncontrolled Keywords:Video classification; bilinear pooling; Action classification; Deep learning; Audio-visual; Compact Bilinear Pooling
Subjects:Computer Science > Artificial intelligence
Computer Science > Image processing
Computer Science > Machine learning
Computer Science > Digital video
Computer Science > Video compression
DCU Faculties and Centres:DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing
DCU Faculties and Schools > Faculty of Engineering and Computing > School of Electronic Engineering
Research Initiatives and Centres > INSIGHT Centre for Data Analytics
Published in: Proceedings of the 16th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications: VISAPP,. 5. SciTePress. ISBN 978-989-758-488-6
Publisher:SciTePress
Official URL:http://dx.doi.org/10.5220/0010337306370644
Copyright Information:© 2021 The Authors (CC BY-NC-ND 4.0)
Funders:Science Foundation Ireland (SFI) under grant number SFI/15/SIRG/3283 and SFI/12/RC/2289_P2.
ID Code:25289
Deposited On:09 Feb 2021 14:05 by Feiyan Hu . Last Modified 13 Sep 2021 10:15

Available Versions of this Item

  • Temporal bilinear encoding network of audio-visual features at low sampling rates. (deposited 09 Feb 2021 14:05) [Currently Displayed]
    • Temporal bilinear encoding network of audio-visual features at low sampling rates. (deposited 13 Sep 2021 10:16)

Downloads

Downloads per month over past year

Archive Staff Only: edit this record

Altmetric
- Altmetric
+ Altmetric
  • Student Email
  • Staff Email
  • Student Apps
  • Staff Apps
  • Loop
  • Disclaimer
  • Privacy
  • Contact Us