Login (DCU Staff Only)
Login (DCU Staff Only)

DORAS | DCU Research Repository

Explore open access research and scholarly works from DCU

Advanced Search

Aligning Vision and Language: Harnessing Language Semantics for Efficient Vision Models

Maniparambil, Mayug orcid logoORCID: 0000-0002-9976-1920 (2025) Aligning Vision and Language: Harnessing Language Semantics for Efficient Vision Models. PhD thesis, Dublin City University.

Abstract
This thesis explores methods to enhance the efficiency, flexibility, and generalization of vision models by leveraging semantic information from language. Inspired by human multi-modal perception, the research conducted addresses key limitations in current models: high data and compute demands, limited generalization to new concepts, and suboptimal unimodal features. The thesis begins by exploring how structured semantic information, such as domain-expert knowledge, can enhance the few-shot learning of visual concepts. A novel algorithm, BaseTransformers, is proposed to integrate semantic information, enabling computer vision models to learn new concepts with minimal labeled data by associating them with semantically similar and well-represented base concepts. Extensive evaluations on benchmark datasets highlight improvements over few-shot vision models that do not leverage semantic information. Recognizing the scalability challenges of curated semantics, this thesis introduces a strategy to leverage large language models (LLMs) as a scalable source of semantic knowledge. The proposed VDT-Adapter learns to dynamically select and aggregate LLM-generated semantic information, supporting zero-shot and few-shot domain transfer of CLIP models, as validated through evaluations on 12 benchmark datasets. The research further identifies challenges faced by CLIP models regarding compute and data requirements for pretraining, as well as issues with flexibility and generalization due to suboptimal unimodal features in the joint embedding space. To address these challenges, recent advancements in unimodal vision and language encoders are leveraged, with an analysis of these models conducted through the perspective of representational similarity. Motivated by the hypothesis that vision and language encoders model the same physical reality, this thesis studies their semantic similarity, revealing that their representations often share a high degree of alignment, comparable to that of aligned vision-language encoders. Building on this insight, a lightweight framework that aligns pre-trained, strong unimodal encoders using simple projection transformations is developed. This approach is significantly more compute/data efficient while outperforming CLIP on 0-shot domain transfer to classification/retrieval tasks. Furthermore, the framework’s flexibility and generalization across diverse tasks like multi-lingual retrieval/classification, 0-shot localization, and long-context retrieval are demonstrated. The findings pave the way for flexible, efficient, and generalizable solutions for open-world understanding, contributing to broader applications of multi-modal systems. Finally, the conclusion of this thesis summarizes the contributions and future research directions
Metadata
Item Type:Thesis (PhD)
Date of Award:7 April 2025
Refereed:No
Supervisor(s):O'Connor, Noel E and McGuinness, Kevin
Subjects:Computer Science > Artificial intelligence
Computer Science > Image processing
Computer Science > Machine learning
Computer Science > Digital video
DCU Faculties and Centres:DCU Faculties and Schools > Faculty of Engineering and Computing > School of Electronic Engineering
Research Institutes and Centres > INSIGHT Centre for Data Analytics
Use License:This item is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 License. View License
Funders:Science Foundation Ireland
ID Code:30928
Deposited On:21 Nov 2025 15:13 by Noel Edward O'connor . Last Modified 21 Nov 2025 15:13
Documents

Full text available as:

[thumbnail of Thesis (6).pdf]
Preview
PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
Creative Commons: Attribution-Noncommercial-No Derivative Works 4.0
17MB
Downloads

Downloads

Downloads per month over past year

Archive Staff Only: edit this record