Multimodal learning has received a lot of attention in the recent years.
Associating a description to an image in any language is a challenging task as it
involves identifying the objects within the image and determining the relationships between them. Often, the documents are multimodal, and hence they may
contain text as well as images. Various methodologies have been put forward to
match an image to its corresponding description at sentence level. In this work,
we are the first to propose a novel joint image-paragraph (i.e. news article) ranking model trained with images and its corresponding paragraphs (i.e. news articles). The image-paragraph ranking model works in such a way that, given an image, the model ranks the best matching news articles and vice-versa. We achieve
this correspondence by using a pairwise ranking function and evaluate the model
performance on benchmark datasets using Image-Sentence Ranking task evaluation metric. The experimental results show that our model achieves comparable
performance to the cutting edge technique.