Le, Hoang-Bao ORCID: 0009-0000-2496-4347, Cuong, Dinh Viet, Nguyen, An Pham Ngoc
ORCID: 0000-0002-0041-9747, Liting, Zhou
ORCID: 0000-0002-7778-8743 and Gurrin, Cathal
ORCID: 0000-0003-4395-7702
(2025)
Quizzard@INOVA Challenge 2025 -- Track A: Plug-and-Play Technique in Interleaved Multi-Image Model.
ICME Workshop
.
Abstract
This paper addresses two main objectives. Firstly, we demonstrate the impressive performance of the LLaVA-NeXT-interleave on 22 datasets across three different tasks: Multi-Image Reasoning, Documents and
Knowledge-Based Understanding and Interactive MultiModal communication. Secondly, we add the Dense Channel Integration (DCI) connector to the LLaVA-NeXTInterleave and compare its performance against the standard model. We find that the standard model achieves the highest overall accuracy, excelling in vision-heavy tasks like
VISION, NLVR2, and Fashion200K. Meanwhile, the DCIenhanced version shows particular strength on datasets requiring deeper semantic coherence or structured change understanding such as MIT-States PropertyCoherence and SlideVQA. Our results highlight the potential of combining powerful foundation models with plug-and-play
techniques for Interleave tasks. The code is available at
https://github.com/dinhvietcuong1996/icme25-inova.
Metadata
Item Type: | Article (Published) |
---|---|
Refereed: | Yes |
Uncontrolled Keywords: | Interleave, llava, comprehension, dense connector |
Subjects: | Computer Science > Information retrieval |
DCU Faculties and Centres: | UNSPECIFIED |
Official URL: | https://arxiv.org/abs/2506.11737 |
ID Code: | 31152 |
Deposited On: | 30 Jun 2025 11:40 by Hoang Bao Le . Last Modified 30 Jun 2025 11:40 |
Documents
Full text available as:
Preview |
PDF
- Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
Creative Commons: Attribution 4.0 698kB |
Downloads
Downloads
Downloads per month over past year
Archive Staff Only: edit this record