We address the problem of visual instance search, which consists to retrieve all
the images within an dataset that contain a particular visual example provided to
the system. The traditional approach of processing the image content for this task
relied on extracting local low-level information within images that was “manually
engineered” to be invariant to di↵erent image conditions. One of the most popular
approaches uses the Bag of Visual Words (BoW) model on the local features to
aggregate the local information into a single representation. Usually, a final reranking stage is included in the pipeline to refine the search results. Since the
emergence of deep learning as the dominant technique in computer vision in 2012,
much research attention has been focused on deriving image representations from
Convolutional Neural Networks (CNN) models for the task of instance search as a
“data driven” approach to designing image representations. However, one of the main
challenges in the instance search task is the lack of annotated datasets to fit CNN
models parameters.
This work explores the capabilities of descriptors derived from pre-trained CNN
models for image classification to address the task of instance retrieval. First, we
conduct an investigation of the traditional bag of visual words encoding on local
CNN features to produce a scalable image retrieval framework that generalizes well
across di↵erent retrieval domains. Second, we propose to improve the capacity of the
obtained representations by exploring an unsupervised fine-tuning strategy that allow
us to obtain better performing representations at the price of losing the generalization
of the representations. Finally, we propose using visual attention models to weight
the contribution of the relevant parts of an image to obtain a very powerful image
representation for instance retrieval without requiring the construction of a large
and suitable training dataset for fine-tuning CNN architectures.