Video saliency prediction has recently attracted atten- tion of the research community, as it is an upstream task for several practical applications. However, current so- lutions are particurly computationally demanding, espe- cially due to the wide usage of spatio-temporal 3D convolu- tions. We observe that, while different model architectures achieve similar performance on benchmarks, visual varia- tions between predicted saliency maps are still significant. Inspired by this intuition, we propose a lightweight model that employs multiple simple heterogeneous decoders and adopts several practical approaches to improve accuracy while keeping computational costs low, such as hierarchi- cal multi-map knowledge distillation, multi-output saliency prediction, unlabeled auxiliary datasets and channel re- duction with teacher assistant supervision. Our approach achieves saliency prediction accuracy on par or better than state-of-the-art methods on DFH1K, UCF-Sports and Hol- lywood2 benchmarks, while enhancing significantly the ef- ficiency of the model.
Science Foundation Ireland (SFI) under grant number SFI/12/RC/2289 P2, Regione Sicilia, Italy, RehaStart project (grant identifier: PO FESR 2014/2020, Azione 1.1.5, N. 08ME6201000222, CUP G79J18000610007), University of Catania, Piano della Ricerca di Ateneo, 2020/2022,Linea2D, MIUR,Italy,Azione1.2“Mobilita` dei Ricercatori” (grant identifier: Asse I, PON R&I 2014- 2020, id. AIM 1889410, CUP: E64I18002520007)
ID Code:
27962
Deposited On:
09 Jan 2023 14:10 by
Feiyan Hu
. Last Modified 14 Feb 2023 14:54