Rai, Ayush K. (2025) Understanding Videos by Learning Structured, Robust and Efficient Representations. PhD thesis, Dublin City University.
Abstract
With an enormous volume of unstructured video content constantly being generated online, designing intelligent systems for automatic understanding of visual data could have a direct and beneficial effect on several fields such as real-world surveillance, robotics, healthcare, entertainment, content retrieval etc. However, extracting meaningful and relevant information from videos still remains a challenging task and an open area of research. Learning powerful representations in the
video domain involves multiple facets such as structural feature learning, modeling motion, multi-modal feature learning, feature disentanglement etc., with the primary goal of holistic video understanding. Recently, self-supervised learning has gained prominence as an effective paradigm for representation learning in images and videos, eliminating the need for additional label annotations. The objective of this thesis is to thoroughly investigate various video modeling techniques, primarily aimed at learning structured, robust, and efficient video representations within the
framework of self-supervised learning. To focus on learning structured video representations, this work first addresses the task of generic event boundary detection by revisiting a self-supervised method
and enhancing it by incorporating a differentiable motion estimation module to capture the generic spatial and temporal diversities in the videos. Extensive experiments on the Kinetics-GEBD and TAPOS datasets demonstrate the efficacy of the proposed approach compared to the other self-supervised state-of-the-art methods. In order to embed robustness into learned video representations, the thesis then
tackles the problem of video anomaly detection from the perspective of recognizing out of distribution samples. A novel method is proposed to generate spatio-temporal pseudo-anomalies by inpainting masked image regions with a pre-trained Latent Diffusion Model and perturbing optical flow using mixup to simulate spatio-temporal distortions. Additionally, a unified framework is introduced to detect real-world anomalies under the one-class classification setting by learning three anomaly indicators: reconstruction quality, temporal irregularity, and semantic inconsistency. Rigorous evaluations on Ped2, Avenue, ShanghaiTech, and UBnormal benchmarks highlight the method’s effectiveness compared to existing state-of-the-art approaches. To learn video representations efficiently, this research proposes a novel and generalizable Trajectory-Aware Adaptive Token Sampler (TATS) module that learns to adaptively sample motion-centric tokens for masked autoencoder (MAE) pretraining by modeling their motion trajectories in videos. Additionally, a unified training recipe is also introduced that facilitates the joint optimization of both
MAE and TATS from scratch using Proximal Policy Optimization to ensure stable convergence during pre-training even with aggressive masking. Comprehensive evaluation on benchmark datasets (Kinetics-400, Something-Something v2, UCF101, HMDB51) for action recognition demonstrates the effectiveness, generalization, transferability, and efficiency of our work compared to the state-of-the-art methods.
Metadata
| Item Type: | Thesis (PhD) |
|---|---|
| Date of Award: | 19 August 2025 |
| Refereed: | No |
| Supervisor(s): | Smeaton, Alan and O'Connor, Noel |
| Subjects: | Computer Science > Artificial intelligence Computer Science > Image processing Computer Science > Machine learning Computer Science > Digital video |
| DCU Faculties and Centres: | DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing DCU Faculties and Schools > Faculty of Engineering and Computing > School of Electronic Engineering Research Institutes and Centres > INSIGHT Centre for Data Analytics |
| Use License: | This item is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 License. View License |
| Funders: | Research Ireland |
| ID Code: | 31433 |
| Deposited On: | 21 Nov 2025 14:23 by Noel Edward O'connor . Last Modified 21 Nov 2025 14:23 |
Documents
Full text available as:
Preview |
PDF
- Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
Creative Commons: Attribution-Noncommercial-No Derivative Works 4.0 29MB |
Downloads
Downloads
Downloads per month over past year
Archive Staff Only: edit this record