Krishna, Tarun (2024) Towards an efficient synergistic paradigm for self-supervised visual representation learning. PhD thesis, Dublin City University.
This thesis investigates the latest developments in self-supervised representation learning, which enables learning from a large un-labelled data corpus. The overarching objective of this work is to comprehensively assess, devise and harness self-supervised models with efficiency and effectiveness at the forefront. Taking an initial step in this direction, this research begins by evaluating the efficacy of contrastive models for instance-based image retrieval, demonstrating their capability to
encode semantic similarity among instances induced through discriminative learning. Through extensive evaluation on Oxford5k/Oxford5k, Paris6k/rParis6k and INSTRE, it is shown that these models perform comparably with, and in some cases outperform pre-trained supervised baselines, highlighting their potential for building robust image retrieval engines without explicit supervision. Building upon this foundation, this work further delves into the realm of 360° image visual attention modeling, a domain largely unexplored in the context of self-supervised representation learning. More importantly, the solutions proposed for learning have been validated in realistic
benchmarks (Salient 360 [Rai et al., 2017], VR-Eye Tracking, Sitzmann) built with datasets gathered from the Web. Further, contributions are made towards optimizing self-supervised learning strategies, particularly addressing challenges such as redundant channel features and computational complexity. Dynamic channel selection methods originally developed for supervised learning are adapted to self-supervised
networks, resulting in signifcant reductions in computation without compromising performance. Additionally, a novel perspective is introduced on the synergy between self-supervised learning and dynamic computation paradigms. Through simultaneous learning of dense and gated sub-networks, a generic and efficient architecture is
proposed, achieving comparable performance to vanilla self-supervised settings but with reduced computational costs. These approaches are rigorously benchmarked on the CIFAR-10/100, STL-10 and ImageNet-100 datasets. Finally, the conclusion of this thesis summarizes the contribution of this work and discusses some thoughts on
directions for future research in this area.
Item Type: | Thesis (PhD) |
---|---|
Date of Award: | 5 December 2024 |
Refereed: | No |
Additional Information: | Industry collaboration with Xperi |
Supervisor(s): | O'Connor, Noel and McGuinness, Kevin |
Subjects: | Computer Science > Image processing Computer Science > Machine learning Computer Science > Digital video |
DCU Faculties and Centres: | DCU Faculties and Schools > Faculty of Engineering and Computing > School of Electronic Engineering Research Institutes and Centres > INSIGHT Centre for Data Analytics |
Use License: | This item is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 License. View License |
Funders: | Research Ireland |
ID Code: | 30567 |
Deposited On: | 10 Mar 2025 15:07 by Noel Edward O'connor . Last Modified 10 Mar 2025 15:07 |
Full text available as:
Preview |
PDF
- Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
Creative Commons: Attribution-Noncommercial-No Derivative Works 4.0 24MB |
Downloads
Downloads per month over past year
Archive Staff Only: edit this record