Understanding Videos by Learning Structured, Robust and Efficient
Representations

Rai, Ayush K.

Rai, Ayush K. (2025) Understanding Videos by Learning Structured, Robust and Efficient Representations. PhD thesis, Dublin City University.

Abstract
Metadata
Downloads
Documents

[+][-]

Abstract

With an enormous volume of unstructured video content constantly being generated online, designing intelligent systems for automatic understanding of visual data could have a direct and beneficial effect on several fields such as real-world surveillance, robotics, healthcare, entertainment, content retrieval etc. However, extracting meaningful and relevant information from videos still remains a challenging task and an open area of research. Learning powerful representations in the video domain involves multiple facets such as structural feature learning, modeling motion, multi-modal feature learning, feature disentanglement etc., with the primary goal of holistic video understanding. Recently, self-supervised learning has gained prominence as an effective paradigm for representation learning in images and videos, eliminating the need for additional label annotations. The objective of this thesis is to thoroughly investigate various video modeling techniques, primarily aimed at learning structured, robust, and efficient video representations within the framework of self-supervised learning. To focus on learning structured video representations, this work first addresses the task of generic event boundary detection by revisiting a self-supervised method and enhancing it by incorporating a differentiable motion estimation module to capture the generic spatial and temporal diversities in the videos. Extensive experiments on the Kinetics-GEBD and TAPOS datasets demonstrate the efficacy of the proposed approach compared to the other self-supervised state-of-the-art methods. In order to embed robustness into learned video representations, the thesis then tackles the problem of video anomaly detection from the perspective of recognizing out of distribution samples. A novel method is proposed to generate spatio-temporal pseudo-anomalies by inpainting masked image regions with a pre-trained Latent Diffusion Model and perturbing optical flow using mixup to simulate spatio-temporal distortions. Additionally, a unified framework is introduced to detect real-world anomalies under the one-class classification setting by learning three anomaly indicators: reconstruction quality, temporal irregularity, and semantic inconsistency. Rigorous evaluations on Ped2, Avenue, ShanghaiTech, and UBnormal benchmarks highlight the method’s effectiveness compared to existing state-of-the-art approaches. To learn video representations efficiently, this research proposes a novel and generalizable Trajectory-Aware Adaptive Token Sampler (TATS) module that learns to adaptively sample motion-centric tokens for masked autoencoder (MAE) pretraining by modeling their motion trajectories in videos. Additionally, a unified training recipe is also introduced that facilitates the joint optimization of both MAE and TATS from scratch using Proximal Policy Optimization to ensure stable convergence during pre-training even with aggressive masking. Comprehensive evaluation on benchmark datasets (Kinetics-400, Something-Something v2, UCF101, HMDB51) for action recognition demonstrates the effectiveness, generalization, transferability, and efficiency of our work compared to the state-of-the-art methods.

Metadata

Item Type:	Thesis (PhD)
Date of Award:	19 August 2025
Refereed:	No
Supervisor(s):	Smeaton, Alan and O'Connor, Noel
Subjects:	Computer Science > Artificial intelligence Computer Science > Image processing Computer Science > Machine learning Computer Science > Digital video
DCU Faculties and Centres:	DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing DCU Faculties and Schools > Faculty of Engineering and Computing > School of Electronic Engineering Research Institutes and Centres > INSIGHT Centre for Data Analytics
Use License:	This item is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 License. View License
Funders:	Research Ireland
ID Code:	31433
Deposited On:	21 Nov 2025 14:23 by Noel Edward O'connor . Last Modified 21 Nov 2025 14:23

Documents

Full text available as:

[thumbnail of PhD_Thesis_DCU_Ayush-4.pdf]

Preview

PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
Creative Commons: Attribution-Noncommercial-No Derivative Works 4.0
29MB

Downloads

Downloads per month over past year

Archive Staff Only: edit this record

DORAS | DCU Research Repository

Understanding Videos by Learning Structured, Robust and Efficient Representations

Downloads