Typically it is seen that multimodal neural machine translation (MNMT) systems
trained on a combination of visual and textual inputs produce better translations
than systems trained using only textual inputs. The task of such systems can be
decomposed into two sub-tasks: learning visually grounded representations from
images and translation of the textual counterparts using those representations. In a
multi-task learning framework, translations are generated from an attention-based
encoder-decoder framework and grounded representations that are learned from pretrained convolutional neural networks (CNNs) for classifying images.
In this thesis, I study different computational techniques to translate the meaning of sentences from one language into another considering the visual modality
as a naturally occurring meaning representation bridging between languages. We
examine the behaviour of state-of-the-art MNMT systems from the data perspective in order to understand the role of the both textual and visual inputs in such
systems. We evaluate our models on the Multi30k, a large-scale multilingual multimodal dataset publicly available for machine learning research. Our results in the optimal and sparse data settings show that the differences in translation system
performance are proportional to the amount of both visual and linguistic information whereas, in the adversarial condition the effect of the visual modality is rather
small or negligible. The chapters of the thesis follow a progression starting with using different state-of-the-art MMT models for incorporating images in optimal data
settings to creating synthetic image data under the low-resource scenario and extending to addition of adversarial perturbations to the textual input for evaluating
the real contribution of images.