Enhancing CLIP with GPT-4: Harnessing Visual Descriptions as Prompts

Contrastive pretrained large Vision-Language Models (VLMs) like CLIP have revolutionized visual representation learning by providing good performance on downstream datasets. VLMs are 0-shot adapted to a downstream dataset by designing prompts that are relevant to the dataset. Such prompt engineering makes use of domain expertise and a validation dataset. Meanwhile, recent developments in generative pretrained models like GPT-4 mean they can be used as advanced internet search tools. They can also be manipulated to provide visual information in any structure. In this work, we show that GPT-4 can be used to generate text that is visually descriptive and how this can be used to adapt CLIP to downstream tasks. We show considerable improvements in 0-shot transfer accuracy on specialized fine-grained datasets like EuroSAT (~7%), DTD (~ 7%), SUN397 (~ 4.6%), and CUB ( ~3.3%) when compared to CLIP’s default prompt. We also design a simple few-shot adapter that learns to choose the best possible sentences to construct generalizable classifiers that outperform the recently proposed CoCoOP by ~2% on average and by over 4% on 4 specialized fine-grained datasets. The code, prompts, and auxiliary text dataset is available at github.com/mayug/VDT-Adapter.


Introduction
Contrastive pre-training of large-scale VLMs has demonstrated remarkable image classification performance on open-set classes.Models like CLIP [25] and ALIGN [13] are pretrained on web-scale datasets consisting of image-text pairs (over 400 million and 1.8 billion respectively), resulting in a highly generalizable model with competent 0-shot domain adaptation capabilities.While vanilla supervised training is performed on a closed set of concepts or classes, CLIP pretraining uses natural language.This results in a joint text-vision embedding space that is not constrained to a fixed set of classes.In CLIP, the classifier is constructed by plugging the class name into a predetermined prompt template like 'a photo of {class name}'.A straightforward way to adapt CLIP to different domains is by prompt engineering, which usually involves modifying the prompt template to include semantic information about the target task.For example, to classify bird images, one could construct a prompt 'a photo of {classname}, a type of bird'.This prompt engineering process, however, is not optimal because it: 1.) requires domain expertise in the target domain; 2.) has high variance -small changes to the prompt result in large variation in performance; 3.) has a fixed prompt template for all the classes, therefore only the class name in the prompt provides the classification anchor, which might not contain enough information to distinguish different classes.For example, in Fig 1 we see an image of a Green Heron, which from the name would suggest that it is predominantly a green-colored bird and we would assume that it is similar to Green Woodpecker if we have never seen either bird.However, we can see that it is in fact a blackish-brown bird with a chestnut-colored neck and visually more similar to a bird like the Black Bittern.For 0-shot transfer to fine-grained datasets like this to work well, CLIP has to either have seen and associated images of a Green Heron to the text 'Green Heron' from its large pretraining dataset or additional information in the form of visually descriptive textual (VDT) information is required.Here we define VDT as a set of sentences that describe the visual features of the class under consideration including shape, size, color, environment, patterns, composition, etc.While most humans can identify many different common bird species just from their names, they would need access to an ornithology taxonomy of bird descriptions to identify more rare bird species.Similarly, we argue that CLIP's 0shot accuracy can be improved by incorporating VDT information into the prompts.As shown, in Fig 1, including VDT information like black crown and black rump moves the classification prototype of Green Heron away from the classification prototype of Green Woodpecker and towards that of Black Bittern in the text-encoder's embedding space.Figure 1: An example showing three birds, Green Heron, Green Woodpecker, and Black Bittern.Green Heron and Green Woodpecker have close-by classification prototypes by virtue of not having enough details in the prompt template.Only the text-encoder's embedding space is visualized.Here we see that adding visual descriptions to the prompt resolves this issue and moves the classification prototypes in the word-encoder's space such that classification prototypes for visually similar birds (Green Woodpecker and Black Bittern) lie together.
In this work, we first show that we can use VDT information for each class in the target domain to construct class conditional prompts that achieve performance improvements over CLIP's default prompt.We show this on the CUB dataset [1] by constructing sentences from domain experts about the bird species in Section 3.2.1 as they are readily available as part of the dataset.
However, we acknowledge that domain expert annotations are costly and time-consuming to obtain, hampering the scalability of our method to other datasets.To address this, we focus on the recent advances in generative pretrained Large Language Models (LLMs) like GPT-4 to construct these class conditional prompts in a manner easily scalable to other datasets.These models are a good fit for the task of constructing sophisticated prompts, because: 1) they are a condensed form of human knowledge (trained on web-scale text data) [32]; 2) they can be manipulated to produce information in any form or structure which makes compatibility with CLIP's prompt style relatively simple.Therefore we use GPT-4 to construct visually descriptive textual information about the classes with special emphasis in the GPT-4 prompts about visual cues like shape, color, structure, and compositionality.We use the generated VDT information to construct prompt ensembles that are passed through CLIP's text encoder and aggregated to generate classifiers that are then used for 0-shot classification.Using GPT-4 circumvents the need for domain knowledge and conveniently provides class conditional prompts.Prompt ensembling the VDT sentences reduce CLIP's performance sensitivity to small changes in the prompt.We show per-formance improvements over vanilla CLIP with the default prompt on 12 datasets with an average improvement of 2% and even better improvements in fine-grained datasets like EuroSAT (∼ 7%), DTD (∼ 7%), SUN397 (∼ 4.6%), and CUB (∼ 3.3%).The prompts and all the auxiliary class information will be made publicly available to promote research in prompt ensembling and multi-modal adapter design.
Finally, we design a simple adapter that learns to adaptively select and aggregate the best sentences for any given dataset and show that making use of this additional VDT information improves the few-shot domain transfer performance of CLIP as well.We demonstrate the few-shot adaptation performance for the recently proposed Base-to-New setting on a benchmark of 12 datasets and outperform recent methods like CoOp [35] and CoCoOp [34] despite having fewer model parameters, shorter training time, and a simpler model architecture.
In short, our contributions are as follows: 2. Related Works
While our GPT-generated prompt ensembles are similar to CLIP's prompt ensembles, CLIP's prompt ensembles were constructed and tuned manually, and are class agnostic, while ours were generated by GPT models that were prompted to provide VDT information for each class.

Prompt Learning
CoOp [35] successfully used prompt learning in VLMs but had generalizability limitations due to overfitting on the few-shot dataset [34].In response, CoCoOp was proposed, enhancing performance with image-conditioned prompt learning using a meta-network, albeit at a higher resource cost.We address generalizability differently by using class conditional VDT information.Our simpler and more efficient model, CLIP-A-self, outperforms CoCoOp in the Base-to-New few-shot setting.

Few-shot adapters for Vision Language models
CLIP-Adapter [10] (CLIP-A) offers a simpler few-shot transfer method for VLMs, utilizing an MLP trained on fixed image/text encoders.Our CLIP-A-self is different from CLIP-A in that we apply a self-attention mechanism on the set of all sentences for any class, learning to select and aggregate the best subset of VDT information for the dataset from the few-shot training set.Although Tipadapter [33] showed superior performance on base classes with a cache model, it's inapplicable in the Base-to-New setting due to its reliance on few-shot test class examples, making it irrelevant for our comparison.

Semantic information from Large Language Models
Recent advancements in transformer-based language models, particularly the GPT family [3,22], have demonstrated exceptional abilities in semantic extraction from intricate texts.Their application to vision tasks has emerged as an active area of research.[20] employs Palm540B LLM [5] to generate semantic data for unsupervised class embedding vectors in 0-shot classification, but only tests on three legacy datasets.Our research presents results on a modern benchmark of 12 datasets.Recently, [24,19] leverage GPT-3 for class conditional prompts to enhance CLIP's 0shot domain transfer on 6 datasets.While [19] focuses on using GPT-3 to construct visual descriptors that aid in the interpretability of CLIP's predictions during 0-shot domain transfer, we argue that 0-shot domain transfer performance improves with the inclusion of high-quality VDT information.Hence, we make use of GPT-4 for richer, more diverse, and more accurate VDT information.
While [19] utilize GPT-3, probability space ensemble, and highlight VDT's role in 0-shot transfer, our method differs.We use GPT-4 for auxiliary data collection, perform ensemble in word-encoder space, and introduce a few-shot adapter for optimal VDT selection in few-shot transfer.[27] uses GPT-3 for prompt construction in diffusion models to generate images for support sets while our work only uses GPT4 to acquire auxiliary text data.To our knowledge, we are the first to prompt GPT-4 for visually descriptive sentences to improve CLIP's 0-shot and few-shot domain transfer.

Review of CLIP and CLIP-Adapter
Through contrastive pretraining on large image-text datasets, CLIP performs image classification on various concepts, aligning related images and texts in a shared embedding space, while separating dissimilar ones.After pretraining, CLIP directly performs image classification on the target dataset without any finetuning.First, we review how the CLIP model performs 0-shot classification on an open set.
The CLIP model, comprising a vision and language model, encodes an image and its corresponding caption into visual and textual embeddings, respectively.During inference, these embeddings are compared using cosine similarity.Given an image I ∈ R H×W ×C , where H, W , C denotes the height, width, and number of channels of the image, the vision encoder transforms the image into the joint embedding space to get the image features f ∈ R D where D represents the dimension of the features.
During inference, a prompt template such as 'A photo of {classname}' is used to generate sentences for K different classes and passed through the text-encoder to yield classifier weight matrix W ∈ R D×K .Prediction probabilities are then calculated by multiplying image feature f with W and applying a softmax function: In CLIP [25], 0-shot domain transfer utilizes domainspecific information in the prompt template, such as 'A photo of a {class-name}, a type of bird' for bird images.[25] reports that careful prompt design and prompt ensembling are important to improve 0-shot classification accuracy.Prompt ensembling is achieved by constructing several prompts for each class and then averaging the classification vectors.In our work, we show that prompt ensembles of VDT information improve CLIP's 0-shot domain transfer.
CLIP-A [10] is a learnable MLP adapter applied to image and/or word encoder features for few-shot transfer to target datasets.During few-shot transfer, given N images per class with labels, denoted as (x i,k , y i,k ) i=N,j=K i=1,k=1 , K classifier weights are constructed using the prompt template H and text encoder g as W = g(H(classname({y i,k }))).The image features f and text features W pass through the learnable adapters A v , A t to get adapted features as follows.
The hyperparameters α and β blend CLIP's knowledge with fine-tuned knowledge to avoid CLIP-Adapter overfitting.Logits are calculated as per Eqn 1, and cross entropy loss over the entire training set In the All setting, few-shot transfer is tested on a hold-out dataset with images from the K classes used in training.In the Base-to-New setting, proposed by [34], the evaluation occurs on U non-overlapping classes.Our model is evaluated in the more practical Base-to-New setting.

Language Model Prompt Design
In this section, we show that using VDT information in the prompt template improves CLIP's 0-shot transfer capabilities and describe our approach to generate class-specific prompts using an LLM.

Visual Descriptive Sentences
[25] demonstrates that careful prompt design and prompt ensembling improve the 0-shot classification performance of CLIP.Here we ask the question: What type of information can be appended to the prompt template to improve the 0-shot domain transfer performance?We show that appending visually descriptive information to the prompt template and ensembling improves the 0-shot performance over the default prompt and prompts containing non-visual information.
Using the CUB dataset with expert annotations, we contrast the 0-shot performance of visual and non-visual prompt ensembles.For the visual prompts, we take class attribute vectors detailing attributes like color, pattern, shape, etc. for 28 bird body parts, leading to 312 scores per bird.We use the most pronounced attribute-value pairs to form 28 visual prompts (denoted Visual-GT) such as 'A photo of Green Heron.Green Heron has a greenish-black head cap.' Conversely, for non-visual prompts (denoted Non-Visual-GT), we collect information on bird calls, migration, behavior, and habitat, yielding 12 different prompts like 'A photo of Green Heron.The green heron's bird call is a loud, harsh 'skeow" per class.
We derive classification vectors for Visual-GT and Non-Visual-GT by averaging class-level sentence embeddings within CLIP's joint embedding space, considering its 77token limit.Table 1 shows no improvement using Non-Visual-GT prompts over the default, yet a 4% improvement with Visual-GT.

Prompting LLMs for visually descriptive information
In the prior section, we highlighted the use of expert VDT information in creating class-specific prompts to enhance CLIP's 0-shot performance.However, acquiring expert annotations is both expensive and time-consuming.To overcome this, we utilize GPT language models, known for their large-scale knowledge and flexibility [32].Our approach involves using GPT-4 to generate visual descriptions for any given dataset thereby aiding in the construction of prompt ensembles for CLIP in a scalable manner.Our prompting strategy takes inspiration from chain-of-  thought prompting [29] and is as follows: First, we ask GPT-4 to list all the attributes that may be necessary to discriminate between images of the K classes under consideration.Second, we ask GPT-4 to provide the values for all these attributes for all the K classes as sentences.An example for the CUB dataset is shown in the left side of Fig 1.
The last row in Table 1 shows that the GPT-4 generated visual sentences' performance is similar to that of sentences generated from the class attribute vectors annotated by domain experts.We follow the same simple strategy for all the datasets in the benchmark suite to generate visually descriptive sentences in a scalable and flexible manner and use them to construct prompt ensembles.

Simple few-shot adapters for visual sentences
We design a simple adapter that can use VDT information to improve the few-shot transfer of CLIP to the target datasets.Similar to the CLIP-A text, we append a small set of learnable parameters to the output of the word encoder and train the adapter using cross-entropy loss.Our CLIP-A-self uses a self-attention layer that applies attention over the embeddings of the different sentences for each class and averages the output to get the final classification vector.
Given we have M GPT generated sentences for each of the K classes t m,k , we construct M prompts by appending each sentence to the prompt template like H(classname(y i,k ), {t m,k }) and pass them through CLIP's word encoder to get W sent ∈ R D×M ×K .
For the self-attention adapter, we apply vanilla selfattention [28] over all the visual descriptive sentences such that during training it learns to select and aggregate the most relevant visual sentences for identifying each class.Just like before, we first obtain the classification vector for all sentences W s ∈ R K×M ×D and pass them as the key, query, and value to the self-attention module B self and average out the output tokens to get the final classification vector W ⋆ .Here the attention is applied over the M different visually descriptive sentences.
We finally obtain the new adapter classifier weights W ⋆ ∈ R D×K that have been adapted to focus on the most visually discriminative information among the M visually descriptive sentences for any given dataset.We make use of 1 to calculate the probabilities and predict the image category by selecting the class with the highest probability.
During the few-shot training only the weights of the adapter network B self are trained using cross-entropy loss.

Experiments
We assess the significance of visual sentence ensembles in two scenarios: (i) we gauge visual sentence quality by comparing an ensemble of these prompts with CLIP's default prompts across 12 benchmark datasets; (ii) we contrast the performance of adapters using these visual prompts against other few-shot transfer techniques in Base-to-New class generalization within a dataset.Prior to discussing the results, we detail the datasets and experimental setup.
For 0-shot transfer with visual sentences, we test on All classes across these datasets while for the Base-to-New setting, following [34], we equally sample classes for base and new sets without overlap.We use the 150-base and 50new class split from ZSL and few-shot literature [30,18] for CUB.Like [34], our CLIP-A-self is evaluated on the 16-shot setting for easier comparison with other methods.

Baselines
We compare the performance of visual sentences ensemble on 0-shot transfer against the CLIP model [25] whose default prompts for each dataset have been extensively finetuned using a test set.We also compare against DCLIP [19] a recent work that uses GPT-3 to generate VDT information for 0-shot transfer.We compare our CLIP-A-self against two prompt learning methods CoOp [35] which learns static prompts and CoCoOp [34] which learns a dynamic prompt that is specifically designed to improve Base-to-New transfer.We also compare our CLIP-A-self against CLIP-A [10] due to the similarity in architecture and to show that the performance improvements are from making use of the visual sentences and not from the just adapting the text features.

Training settings
Our implementation is based on CoOp's and CLIP-A's code. 1 We make all our comparisons on VIT CLIP backbone i.e., VIT-B/16.We take the results for CoOp and Co-CoOp for all datasets (except CUB) from their respective papers, while we make use of practices from the respective papers like context length set to 4 and context initialization to "a photo of" to ensure the best results on the CUB dataset.For CLIP-A, we re-run all experiments on VIT-B/16 backbone as they were not reported in the paper.For all adapter models including ours, we only tune the residual ratio β hyper-parameter.For CLIP-A, we use the version where the MLP is applied on top of the visual encoder as it performed the best [10].We make use of May version of GPT-4 for obtaining the auxiliary dataset.generated prompt ensemble improves upon the performance of CLIP 0-shot by 2% on average over 12 datasets.The improvement over CLIP-ZS is significant; over 5% for specialized fine-grained datasets like CUB, SUN397, EuroSAT, and DTD and over 2% for oxford-flowers and oxford-pets.This shows that CLIP does not recognize several of the classnames in these datasets and describing the class in the form of visually descriptive sentences results in better classifiers from the text-encoder and better classification accuracy.It is also worth noting that only including the visually descriptive sentences in the prompts can help improve the performance of general datasets like Imagenet (over 4%) and Caltech-101 (over 1%) too.For all other datasets, the transfer performance matches that of CLIP, with the exception being the action recognition dataset UCF-101.We inspected the sentences generated for UCF-101 and notice that several of the sentences generated by GPT involves temporal information instead of visual descriptions and we believe this could be the reason for the drop in accuracy.However, we notice in Section 4.5.1 that the self-attention module of the few-shot adapter learns to emphasize the visual sentences out of the generated sentences which might explain the improvement in the performance of few-shot adapters in the new setting in Section 4.5.We also compare against recent work [19]  GPT-4 model over the GPT-3 model results in much higher improvements for specialized datasets like DTD (∼ 5%) and EuroSAT (∼ 6%).We compare the text used by [19] against our GPT4-generated VDT in the supplementary.

GPT-Adapters improve few-shot transfer performance.
We compare the performance of our CLIP-A-self against CLIP, CoOp, and CoCoOp on the benchmark suite of 12 datasets in the Base-to-New setting in Table 5.Here we see that GPT-Adapters that make use of the VDT information outperform CoCoOp by 3% in the new setting while maintaining similar performance to that of CoOp in the base setting on the average accuracy over 12 datasets.This is impressive considering that CoCoOp makes use of a metanetwork and forward pass through the text encoder making it computationally intensive to train.CoCoOp takes up to 5 hours to train on 16-shot ImageNet for VIT-B/16 encoder, in comparison, our CLIP-A-self takes only 10 mins (on an RTX 3090 GPU).The Base-to-New generalization ability of our adapters is even more impressive for fine-grained, specialized datasets as evidenced by the gains over CoCoOp in Harmonic mean of base and new accuracy.For example, CLIP-A-self demonstrates gains in datasets like FGV-CAircraft ( 7.5%), EuroSat ( 7.4%), DTD ( 5.8%), CUB ( 4.3%), Flowers102 ( 4%), Stanford Cars ( 2.4%) and UCF-101 ( 2.4% ).This demonstrates that our adapters make use of semantic information in the form of visually descriptive sentences and fuse this with CLIP's 0-shot knowledge to build more generalizable classifiers that transfer well to unseen classes within the same dataset.It is also worth noting that even though the same set of VDT did not provide any improvements in 0-shot domain transfer for datasets like FGVC-Aircraft, Stanford-Cars, and UCF-101, our selfattention adapter was able to choose the most informative subset of VDT and produce few-shot classifiers that provide substantial few-shot transfer performance gains in comparison to CoCoOp.We show in Section 4.5.1 the sentences picked by the attention mechanism for these datasets to qualitatively verify this.

Attention weights Analysis
We note that even though CLIP-gpt ensembles were outperformed by CLIP default prompt on FGVC Aircraft, UCF-101, and Stanford Cars dataset, we see that CLIP-A-self outperforms CLIP-A and CoCoOp [34] on these datasets in the few-shot transfer setting.We believe that this is because, during few-shot training, the self-attention mechanism learns to select the most relevant visual sentences out of the set of visually descriptive text and helps produce generalizable classifiers.In Table 1 in supplementary, we show the top 3 and bottom 3 attributes picked by attention scores for each of these datasets and show that the sentences with the highest attention scores correspond to visually descriptive attributes in the set and vice versa for the lowest scored attributes.For example, for both Stanford Cars and FGVC it is interesting to see that the color scheme is one of the least used attributes as it's difficult to identify a car or a plane from its color or livery.For UCF-101, information like the force involved or temporal information like speed and range of motion of the action is unlikely to be encoded in the image and hence is not selected by the attention mechanism.Information regarding the subject and the object of the action, like the posture of the person, description of the object, and interaction between objects are visible in the images and hence weighted highly by the attention mechanism.

Ablation over different GPT models
In this section, we see if other GPT models like GPT-3.5 and open-source model, OpenAssistant [15], are as capable as GPT-4 in generating visually descriptive information.We explore this on the CUB dataset as it is fine-grained and specialized.The results are presented in Table 6.We find that the performance improves with larger models which are more capable of memorizing accurate class information with less hallucination [32].Even though we obtain decent performance with the open-source model OpenAssistant, the outputs were always inconsistent and noisy, resulting in a lot of clean-up effort in comparison to GPT-3.5 and GPT-4 where the outputs were in the form of concise sentences following a dictionary format.It is worth noting that our few-shot adapter is capable of picking out the the best VDT information even from a noisy set, pushing the Baseto-New generalization performance of OpenAssistant, and GPT-3.5 close to that of GPT-4.

Conclusion
In this work, we show that using visually descriptive textual (VDT) information can improve the 0-shot domain transfer performance of CLIP over non-visual information and the default prompts.We demonstrate GPT-4 to be an accurate and flexible source of VDT information by improving the 0-shot domain transfer performances on a suite of We visualize the attention weights learned by the CLIP-A-self for datasets Stanford Cars, UCF101, FGVC Aircraft, Oxford Flowers and CUB in Table 1.We notice that the self-attention mechanism in CLIP-A-self assigns more weight to visually descriptive sentences that are most relevant for discriminating between the classes of the dataset under consideration.For instance, we see that for discriminating images of birds species (CUB dataset) and flower species (Oxford Flowers) sentences describing the color of the head and wings of birds and petals of the flowers are important but for identifying different car or aircraft models sentences describing the color or livery is one of the least important.We also see that if the information being described by the VDT sentence is not clearly visible in the image, the attention weight assigned to it by CLIP-A-self is low.For instance, in CUB dataset, the the undersides of birds or the sepals in Oxford Flowers dataset are often not visible in the images, hence the VDT sentence correponding to this is is in the bottom 3 attributes picked by the learnt attention weights.It's also worth noting that, some of the VDT sentences do not have much variation between different classes and hence are not useful in dsicrimination between the classes of the dataset.For instance, in Oxfordflowers, the color of the leaves, the color of the stem are often green for most flowers in the dataset, which maybe why low attention score was learnt for this attribute.

Prompts for GPT-4
Throughout our experiments, we use a two-step prompting strategy in which we first ask the LLM to generate a list of attributes which will aid in visually distinguishing between the different classes in a particular dataset.The second prompt asks the LLM to create a description using the attributes provided by the first prompt and specifies the expected output format.We request a python dictionary as output with a list of sentences, each corresponding to one attribute.The output structure is simple to use (downstream), preserves attribute-level detail and encourages attribute richness.
Example of first prompt for the FGVC Aircraft dataset: The response of the second prompt constitutes the VDT information we utilise as side-information for Airbus A340-200 as an example: "A340-200": [ "The Airbus A340-200 is produced by Airbus, a renowned aircraft manufacturer.","It differentiates itself from other aircraft within the Airbus family through its unique model number: A340-200.","This aircraft primarily serves a commercial role, typically used for passenger transport.","The Airbus A340-200 is equipped with four engines.","These engines are situated under the aircraft's wings.","The aircraft features a low-wing design, with wings positioned at the bottom of the fuselage.","It has a traditional tail configuration, common to many large commercial aircraft.","The A340-200 has a lengthy fuselage, extending to about 59.4 meters.","The body of the Airbus A340-200 is wide-bodied, facilitating a larger passenger capacity.", "Its wings are swept back, a design aspect that improves fuel efficiency and performance at high speeds.","The aircraft features a rounded nose shape, contributing to its aerodynamic design.","The Airbus A340-200 uses a tricycle type landing gear, supporting stability during takeoffs and landings.","Its cockpit windows are angular and include six panels, giving pilots a comprehensive view of their surroundings.","Color schemes vary by airline, but the Airbus corporate livery features a predominantly white body with blue and teal accents.","This model is a single-deck aircraft, focusing on width rather than height for passenger capacity.", "The A340-200 does not have winglets, differing from some newer Airbus models.","There are no canards present on the Airbus A340-200, instead, it employs a more traditional aircraft design.","As a jet-powered aircraft, the A340-200 uses highspeed jet engines for propulsion.","The A340-200 typically accommodates around 260 passengers, though the exact number can vary with the configuration.","With a range of approximately 7,800 nautical miles, the Airbus A340-200 can cover considerable distances without refueling.","The aircraft's four-engine configuration and lengthy, wide-bodied design are unique visual identifiers of the A340-200 model.","Classified as a commercial aircraft, the Airbus A340-200 is primarily used for passenger transportation."] GPT-4 generally adheres to the python dictionary output requirement in the User prompt, but tends to return additional explanations, motivations or clarifications.To encourage the LLM to only return a Python dictionary as requested, we add the following System prompt: You are ChatGPT, a large language model trained by OpenAI.Return only the python dictionary, with no explanation.
Conversely, OpenAssistant's [15] output requires man-ual cleaning and reformatting to get into Python dictionary format.GPT-3.5 performed slightly worse than GPT-4 in terms of adherence to the prompt, as it did not consistently return only a dictionary.In such cases, we simply called the API again.After repeated incorrect format responses, we manually cleaned those cases.
We primarily utilized GPT-4 via the ChatGPT Plus subscription plan at a cost of $20 since the GPT-4 API was not generally available during most of our experimentation phase.The GPT-4 API cost to create the VDT information for the SUN397 dataset was $14.90, as opposed to $1.94 using the GPT-3.5 API.

Comparing our VDT with GPT3
In Table 2, we compare the VDT generated by GPT-4 using our prompting technique with that of [19] who used GPT-3 to obtain visual descriptors for different classes of the dataset.Here we notice that, including a prompt step asking the GPT-4 for visual attributes necessary for classifying between images of the classes result in a fixed number of sentences per class, a fixed order guaranteeing that every class is accompanied by as much visual information as possible.By using GPT-4 we also get much richer and more accurate visual descriptions.For example, for the class industrial, our descriptions provide inforrmation about density of buildings, shadow in the image, road accessibility and layout while the description used by [19] is only 'evidence of human activity'.A similar phenomenon can be observed for DTD dataset.This explains the jump in performance for specialized datasets like DTD and Eurosat over DCLIP.

Generalizability at lower shots
In Figure 1, we compare the harmonic mean of Base and New accuracies of CLIP-A-self with that of CLIP-A over number of shots = 1, 5, 10, 16.Our CLIP-A-self demonstrates performance improvements at lower shots, outperforming CLIP-A on average by over 1.5%/ for the 1-shot case and over 2.5%/ for the 5-shot case.Our adapter shows higher improvements over CLIP-A in the higher shot scenario because of the number of parameters and the inherent difficulty in identifying the VDT sentences that are discriminative for the current classes in the low shot scenario.For instance, identifying the class from a single image is often difficult because of co-occuring objects, environment, background etc which can be resolved if we have more exmaple images from the same class.The largest improvements are for specialized and fine-grained datasets like Stanford-Cars, EuroSat Oxford Flowers, DTD and CUB.Oxford-pets and Food-101 results do not improve much because these datasets are relatively easy and already show good performance with default CLIP.

Figure 2 :
Figure 2: CLIP-A-self, our simple self-attention based adapter learns to select and aggregate the most relevant subset of Visually Descriptive Text (VDT) to generate more generalizable classifiers.First, we prompt GPT-4 to generate VDT, N sentences for K classes that are then passed through the text encoder to get embeddings for each of the N*K sentences.Self-attention is applied over the N sentences of each class and averaged to get K adapted classifier embeddings.

Table 1 :
Comparing visual and non-visual prompt ensembles for 0-shot domain transfer to the CUB dataset.

Table 2 :
Results of including LLM generated VDT on 6 datasets for comparison with other works.We see that higher quality VDT from GPT-4 outperforms GPT-3 generated VDT on specialized datasets like DTD OxfordPets and EuroSAT.

Table 4 :
Comparing our CLIP-A-self against other methods on average accuracy over 12 datasets.

Table 5 :
[30]arison of GPT-Adapters with CLIP, CoOp and CoCoOp in the Base-to-New generalization setting.For prompt learning-based methods (CoOp and CoCoOp), their prompts are learned from the base classes (16 shots).The results strongly justify the importance of including extra visual information.H denotes Harmonic mean (to highlight the generalization trade-off[30]).

Table 1 :
The top 3 and bottom 3 attributes selected by the attention mechanism in GPT-A-self for 3 different datasets.For UCF101, We see that attention learns to pick visually descriptive sentences like posture and description of objects over temporal information like speed of motion and force applied.

Table 2 :
[19]aring our VDT with that of descriptors from[19]for 2 random classes of datasets DTD and Eurosat