This is the html version of the file https://arxiv.org/abs/2207.07635.
Google automatically generates html versions of documents as we crawl the web.
arXiv:2207.07635v1 [cs.CV] 15 Jul 2022
Page 1
Is a Caption Worth a Thousand Images?
A Controlled Study for Representation Learning
Shibani Santurkar
Stanford
shibani@stanford.edu
Yann Dubois
Stanford
yanndubs@stanford.edu
Rohan Taori
Stanford
rtaori@stanford.edu
Percy Liang
Stanford
pliang@cs.stanford.edu
Tatsunori Hashimoto
Stanford
thashim@stanford.edu
Abstract
The development of CLIP [RKH+21] has sparked a debate on whether language supervi-
sion can result in vision models with more transferable representations than traditional image-
only methods. Our work studies this question through a carefully controlled comparison of
two approaches in terms of their ability to learn representations that generalize to downstream
classification tasks. We find that when the pre-training dataset meets certain criteria—it is suf-
ficiently large and contains descriptive captions with low variability—image-only methods do
not match CLIP’s transfer performance, even when they are trained with more image data.
However, contrary to what one might expect, there are practical settings in which these criteria
are not met, wherein added supervision through captions is actually detrimental. Motivated by
our findings, we devise simple prescriptions to enable CLIP to better leverage the language
information present in existing pre-training datasets.
1 Introduction
Image-based contrastive learning approaches have shown promise in building models that gener-
alize beyond the data distributions they are trained on [WXY+18; HFW+20; CKN+20; CMM+20;
CKS+20; CKS+20; CTM+21; CH21]. By leveraging large-scale (unlabelled) data sources through
self-supervised training, these models learn representations that transfer to diverse downstream
tasks—more so that their supervised counterparts [EGH21].
Recently, Radford et al. [RKH+21] showed that a different approach—contrastive learning with
language supervision—can yield models (known as CLIP) with remarkable transfer capabilities.
This development has garnered significant interest in the vision and natural language processing
communities alike, leading to a debate on the utility of multi-modality in representation learn-
ing [ZWM+22; DCB+21; FIW+22]. Our work focuses on a specific question within this debate:
Does language supervision lead to more transferable representations than using images alone?
It might seem like the answer to this question is obvious. After all, CLIP utilized caption in-
formation unavailable to traditional image-based approaches and showed substantial gains over
1
arXiv:2207.07635v1 [cs.CV] 15 Jul 2022

Page 2
“A row of five 227
Back to Santa
Maria on market
day. Santa Maria
De Palautordera,
Cancell del
Montseny,
Barcelona, Spain.”
“duffel van
parked on street
beside small
buildings and
brick building”
“choosing the
parking meters
on this street
should be very
difficult”
“the car is
parked on the
side of the
road by the
tall buildings“
Image-only supervision (e.g., SimCLR)
Image-language supervision (e.g., CLIP)
Objective:
consistency
x
x+ T(x)
Figure 1: A conceptual view of contrastive image-only and image-language pre-training. The two
methods rely on the same self-supervised objective: aligning the representations of positive pairs
(x, x+) while distinguishing them from negative examples (e.g., other examples in the batch). The
transformation T(·) which is used to obtain x+ T(x) (augmented image or caption) encodes the
equivalences we would like the model to satisfy.
prior work [RKH+21]. However, CLIP is drastically different from these approaches in many
ways, from training data to fine-grained implementation choices [DCB+21], which makes it dif-
ficult to isolate the contribution of language supervision. Further, recent studies on CLIP’s zero-
shot classification and robustness properties cast doubt on whether adding language supervision
is always beneficial [FIW+22]. Resolving the aforementioned debate thus requires a carefully con-
trolled comparison of the two approaches in which the only difference is the form of supervision.
Our contributions. We devise a methodology to assess the utility of language supervision from
a representation learning standpoint. To do so, we recognize that CLIP and popular image-based
methods share the same underlying primitive of contrastive learning. In particular, CLIP is con-
ceptually strikingly similar to SimCLR [CKN+20]. Perhaps the only irreducible difference between
them is whether supervision is provided to the model via image augmentations or image-caption
matching (see Figure 1)—which is precisely the quantity we want to study. Thus, we can assess
the value of language supervision by systematically comparing appropriately matched versions
of SimCLR and CLIP1 in terms of their downstream transfer performance. We find that in practice,
the picture is nuanced and depends on three properties of the pre-training dataset:
1. If the scale of the dataset is sufficiently large, CLIP representations transfer better than their
SimCLR counterparts. This gap is not bridged by training SimCLR with more data, suggest-
ing that a caption can be worth more than any number of images. However, in the low-data
regime, language supervision actually hurts model performance in and out-of-distribution.
2. The descriptiveness [KGP21] of dataset captions—i.e., the extent to which they report what is
contained in an image—directly determines how well the resulting CLIP models transfer. In
1We use CLIP to mean models trained using Radford et al. [RKH+21]’s approach, and not their pre-trained model.
2

Page 3
fact, we find that a single descriptive image-caption pair (e.g., from MS-COCO [LMB+14]) is
worth five less descriptive, uncurated captions (e.g., from YFCC [TSF+16]).
3. The variability of captions within a dataset (e.g. due to stylistic or lexical factors) can ad-
versely affect CLIP’s performance. We propose a modification to standard CLIP training—
performing text data augmentations by sampling from a pool of captions for each image—to
alleviate this effect.
Overall, we find that these three properties have inter-twined effects on CLIP’s performance.
For instance, the scale of the widely-used YFCC dataset can, to some extent, compensate for its
less-descriptive and variable captions. Guided by our findings, we devise simple interventions
on datasets that can lead to more-transferrable CLIP models: (i) filtering out low-quality captions
through a text-based classifier, and (ii) applying data augmentation to captions by paraphrasing
them using pre-trained language models [WK21].
2 An apples-to-apples comparison
As stated in Section 1, our goal is to assess the value of language supervision in representation
learning relative to using images alone. While there have been studies of image-only and image-
language pre-training methods in isolation [WXY+18; HFW+20; CKN+20; CMM+20; CKS+20;
CKS+20; CH21; CTM+21; RKH+21] and side-by-side [DCB+21; FIW+22], none of these works
conclusively answer our motivating question due to confounders such as: (i) algorithmic and
architectural variations, and (ii) differing pre-training datasets.
In this section, we outline a series of steps that we take to mitigate these confounders and
compare the two methods on equal footing. Since our focus is on representation learning, we
measure performance in terms of the usefulness of a model’s representations for downstream
tasks, using the evaluation suite of Kornblith et al. [KSL19] to do so. We focus on the fixed-feature
setting where we freeze the weights of a given model and then train a linear probe using task data.
Details of our experimental setup are presented in Appendix A.
2.1 Finding common ground
Our approach for isolating the effects of language supervision is guided by the following insight:
CLIP shares a fundamental commonality with widely-used image-only pre-training methods.
Namely, they rely on the same algorithmic primitive of contrastive learning, which we illustrate
in Figure 1. In both cases, the model is trained with a self-supervised objective: given an image
x, it must distinguish positive examples x+ T(x) from negative ones (e.g., other examples ˆx in
the batch). The choice of transformation T(·) is at the core of contrastive learning as it controls the
equivalences encoded in model representations [DBU+21; HWG+21]. T(x) corresponds to image
augmentations (e.g., rotations) in image-only methods and natural language captions in CLIP.
Thus, to understand the role of language supervision, we can compare CLIP to its closest
image-only equivalent: SimCLR .2 Both CLIP and SimCLR rely on cross-entropy based objective,
which for a given pair (x, x+) of positive examples with associated negatives N is
l = −log
exp(sim(z, z+)/τ)
Cn∈N ∪{z+} exp(sim(z, zn)/τ)
, where z = g(φ(x)) and z+/n = g (φ (x+/n)),
(1)
2Other image-based methods [HFW+20; CKS+20; CH21; CTM+21] have optimizations that are not present in CLIP.
3

Page 4
where sim is cosine similarity, φ/φ are encoders, and g/g are projection heads.
We now discuss the steps we take to alleviate other inconsistencies (aside from T(x)) between
CLIP and SimCLR:
• Transformation stochasticity: We first note that the two methods differ in how they obtain x+,
not just due to the choice of T(x) but also the generative process itself. In SimCLR , x+ is
a new random draw from T(x) in every batch, while for CLIP, it is a single fixed caption.
Perfectly matching them requires training CLIP by sampling a fresh caption x+ for each
image at each iteration. We will refer to this idealized version of CLIP as CLIPS.
• Image augmentations: Both methods apply data augmentations to the image x at each step
in training. However, the specific augmentations used in CLIP (resize and crop) differ
from those used for SimCLR (resize, crop, flip, jitter, blur, grayscale). We remove this
confounder by using standard SimCLR augmentations unless otherwise specified.
• Architecture: We use the ResNet-50 [HZR+16] architecture as the image encoder for both
methods, and a Transformer [VSP+17] as the text encoder (for captions) in CLIP.
• Datasets: Typically, CLIP and SimCLR are trained on different datasets, as the former requires
matched image-caption pairs, while the latter can leverage any computer vision dataset. To
control for the effect of the data distribution, we pre-train both models on the same datasets:
starting with the relatively controlled MS-COCO
• Hyperparameters: We extensively tune hyperparameters for both methods (Appendix A.3).
Mismatches. Despite our efforts to match CLIP and SimCLR, there are some inconsistencies that
we are unable to account for—partly due to the differences in their modalities. In particular, CLIP:
(i) Processes T(x) using a text transformer rather than SimCLR’s ResNet-50.
(ii) Does not share weights between the encoders processing x and T(x) because they corre-
spond to different modalities, unlike SimCLR.
(iii) Uses a linear projection head g/g instead of SimCLR’s MLP, which we allow as Radford
et al. [RKH+21] showed that this choice does not affect CLIP’s performance.
(iv) Only uses other examples in the batch from the same modality as negatives. Thus CLIP has
half the number of negatives compared to SimCLR, which also uses transformed versions of
other examples in the batch (i.e. both ˆx and ˆx+) as negatives .
In Sections 2.2 and 3.2, we take a closer look at how our matched versions of the CLIP and
SimCLR methods compare in terms of downstream transfer performance.
2.2 A COCO case study
We begin our study by comparing CLIP and SimCLR models trained on the MS-COCO
dataset [LMB+14] (henceforth referred to as COCO), which contains 120K images with multi-
object labels. Each image has five human-provided captions, collected post-hoc by Chen et al.
[CFL+15] using Amazon Mechanical Turk. Annotators were given detailed instructions on how
to caption an image such as to describe only the important parts of the image, not to use proper
names, and to use at least 8 words.
4

Page 5
COCO
Aircraft
Birdsnap
Ctech101
Ctech256
Cars
CIFAR10
CIFAR100
DTD
Flowers
Food-101
Pets
SUN937
µTx
Supervised 90.6 31.6 11.8 65.8 53.7 21.7 74.8 46.7 55.9 63.4 47.1 45.9 44.5 47.2 ± 0.2
SimCLR
89.0 40.6 18.5 71.5 58.6 31.5 82.1 57.3 61.7 77.4 58.7 57.3 51.9 56.0 ± 0.2
CLIP
88.4 41.4 17.6 73.2 60.4 35.8 83.6 60.8 65.7 80.5 60.9 57.0 50.8 57.5 ± 0.1
CLIPS
89.8 46.4 20.0 78.4 65.6 41.5 84.6 62.5 66.7 83.9 65.3 61.2 54.9 61.3 ± 0.2
Table 1: Linear probe accuracy for COCO pre-trained models in-distribution and on the transfer
suite from Kornblith et al. [KSL19]. Here, µTx denotes the average test accuracy of the model over
downstream transfer tasks. We report 95% confidence intervals (CI) via bootstrapping.
We use COCO as our starting point for two reasons. First, we can assess the utility of language
supervision in the ideal setting where the captions are of fairly high quality due to the careful
curation process. Second, we can approximate CLIPS
3 by sampling from the available set of five
captions per image.
Captions (often) help on COCO. In Table 1, we compare pre-trained models (SimCLR, CLIP,
CLIPS and a supervised baseline), in terms of the accuracy of a linear probe on: (i) COCO classifi-
cation (in distribution), and (ii) transfer to downstream tasks from Kornblith et al. [KSL19].
On COCO classification, supervised models outperform self-supervised ones, and SimCLR is
more accurate than CLIP trained on human-written captions. This trend, however, flips when
we consider out-of-distribution performance. On most transfer tasks, CLIP performs the best:
yielding on average 10% accuracy over the supervised baseline and 2% over SimCLR.
Using a stochastic version of CLIP further boosts its performance. CLIPS matches SimCLR
performance in-distribution and is about 5% better on average in terms of transfer. While this
improvement is remarkable, it is not clear why stochastically sampling different (one-of-five) cap-
tions for a given image helps. For instance, it may be optimization-related or linked to properties
of the captions themselves. We revisit this question in Section 3.3.
Notably, we find that matching the image augmentations (applied to x) is crucial to correctly
assess the merit of added language supervision. In particular, using standard CLIP augmentations
(only resize and crop) during training lowers its average transfer accuracy by 10% (Appendix
Table 4). With these same augmentations, SimCLR’s performance also drops by 50%. This shows
the importance of correctly controlling for potential confounders discussed in Section 2.1.
3 The impact of pre-training data
Our analysis of COCO shows that language supervision can indeed be beneficial over using im-
ages alone. However, the datasets that CLIP is typically trained on differ, both in scale and qual-
ity, from COCO. For instance, COCO captions were collected post-hoc under controlled settings,
which is markedly different from the automatic scraping procedure used to gather data at scale.
Thus, we shift our focus to two more prototypical pre-training datasets:
3We overload notation and use CLIPS to denote: (i) the idealized stochastic version of CLIP, which samples from
infinite captions per image, and (ii) our approximation of it using a finite set of (typically five) image captions.
5

Page 6
COCO
“A man rides a
giant wave on
his surfboard.”
“A bathroom counter
with two sinks
under two mirrors.”
YFCC
“Thailand - Feb 23-24
2007 -Ko Phi Phi Island,
Phuket 024 This is where
we sat and waited for
the boat to Ko Phi Phi.
Interesting place to
watch the local load up
the boats before the
tide came in.”
“tower has stood
centuries the view
Brian saw but Pearl
chickened out from.
so it's been standing
since the 9th
century. it's teeny
and only gets narrow
as you go up.”
“cricket players
with the trophy
after victory
over country at
sports facility.”
“thumb is poking
into the
smartphone
screen black and
white sketch,
simple drawing.”
ConceptualCaptions
Figure 2: Random samples from the COCO, CC and YFCC datasets (also see Appendix Figure 6).
There are noticeable differences in the diversity of their images and the corresponding captions.
• ConceptualCaptions [SDG+18] (CC) contains 3.3M images harvested from web, with their
ALT-text HTML attributes as captions. The dataset was filtered (retaining only 0.2%) for text
quality—e.g., well-formed captions that mention at least one object found via the Google
Cloud Vision API. Furthermore, all proper nouns in the captions were hypernymized (e.g.,
”Justin Timberlake” becomes ”pop artist”).
• Yahoo Flickr Creative Commons [TSF+16] (YFCC): This dataset has 99.2M images from Flickr,
along with their posted titles as captions with no filtering or post-processing.
We now assess whether our findings on COCO translate to the CC and YFCC datasets. We
start by comparing the transfer performance of CLIP and SimCLR on COCO-scale subsets of
CC/YFCC—cf. points corresponding to 100K samples in Figure 3 (right). We observe that Sim-
CLR’s performance does not vary much across pre-training datasets. On the other hand, CLIP’s
transfer capabilities are highly sensitive to the dataset. With 100K samples from CC/YFCC, using
CLIP is worse than pre-training only on images—in contrast to what we found on COCO.
Inspecting COCO, CC and YFCC samples (Figure 2) yields a possible explanation for this sen-
sitivity. The datasets differ not just in scale and image diversity, but also the extent to which their
captions: (i) describe visually salient aspects of the image, and (ii) vary across images (e.g., in style
and wording). For instance, captions in COCO are homogenous and descriptive, while in YFCC,
they vary and often complementary to the image. In Sections 3.1-3.3, we investigate the effect these
properties (scale, descriptiveness and variability) of the dataset have on CLIP’s performance.
3.1 Scale matters
A major advantage of contrastive learning methods is that they can leverage the vast amounts
of unlabeled data available on the Internet. Thus, it is natural to ask how different forms of con-
trastive supervision benefit from added pre-training data. Intuitively, we expect image-only meth-
ods to perform worse for smaller datasets as they are increasingly less likely to encounter images
with similar augmentations. We might further expect image-language models to perform more
favorably in this setting since they receive richer supervision.
To test whether this is the case, we compare CLIP and SimCLR models trained on datasets
of varying sizes: in the 10-100K sample regime for COCO, and 100K-2M sample regime for
CC/YFCC.4 Our results in Figure 3 deviate from our earlier expectations. First, beyond a cer-
4Due to computational constraints, we train CLIP/SimCLR for fewer epochs (100 instead of 200) on 2M examples.
6

Page 7
Figure 3: Average transfer accuracy of models w.r.t. pre-training dataset size for COCO (left), and
CC and YFCC (right). Language supervision consistently improves performance in the medium to
large data regime over using images alone. However, on small corpora, providing the model with
additional information via captions is actually detrimental. Due to computational constraints, we
train models for fewer epochs (100 instead of 200) on datasets of size 2M.
tain point, SimCLR’s transfer performance improves only marginally with additional data. While
surprising, similar effects have been noted previously [THO21; CYW+22], especially when the
data is uncurated (e.g., YFCC) [THO21].5 Second, in the low-data regime (<50K/200K/500K for
COCO/CC/YFCC), training with language actually hurts the models’ transfer performance.
In fact, we find that (data) scale is essential to take advantage of language supervision. With
sufficient data, CLIP outperforms SimCLR on all three datasets. This gap remains even if we train
SimCLR with extra data, indicating that captions can be worth more than any number of images.
3.2 The importance of descriptive captions
As we saw in Figure 2, captions in typical datasets can vary in terms of how they relate to
the image. Prior work in linguistics and accessibility has drawn a distinction between captions
that are descriptive (meant to replace an image) and complementary (meant to give additional con-
text) [HYH13; KGP21; DMM+22]. This line of work suggests that COCO captions are more de-
scriptive due to the decontextualization of the image and strict instructions provided to the anno-
tators during the caption generation process [KGP21]. In contrast, Flickr captions (e.g., in YFCC)
tend to contain information that is complementary to the image since people typically do not
restate what can already be observed in the photographs they post [Gri75].
In representation learning for object recognition tasks, we ideally want to meaningfully encode
salient objects in the image. Recall that for contrastive models the learned representations are
determined by the transformation T(x) (captions for CLIP). This suggests a hypothesis: captions
that describe the contents of a scene will improve CLIP’s transferability.
To test this hypothesis, we need to quantify the descriptiveness of a caption. Since measuring
it precisely is infeasible, we approximate it with the help of a pre-trained caption-scoring model
(BLIP [LLX+22]). Specifically, we use the score given by BLIP of a caption matching its correspond-
ing image as a surrogate for descriptiveness. Comparing the average caption descriptiveness of
5Table 8 in the appendix shows that even more sophisticated image contrastive learning methods studied in Tian et
al. [THO21]—trained with better data augmentations, with 4x batch size, on 50x the data and for 10x more epochs—are
only marginally better than our CLIP models on the downstream tasks from Kornblith et al. [KSL19].
7

Page 8
(a)
Cons. Comp. Model COCO
µTx
CLIP
88.8 59.2 ± 0.1
CLIP
88.4 57.7 ± 0.2
CLIPS
89.1 59.3 ± 0.2
CLIP
88.4 56.6 ± 0.2
CLIPS
89.3 58.9 ± 0.2
(b)
Figure 4: Relationship between CLIP’s transfer performance and (a) the average descriptiveness
of dataset captions and (b) intra-dataset caption variability. We approximate descriptiveness (a)
using a pre-trained BLIP model [LLX+22] to score the similarity between a caption and the image
it corresponds to. To study the effect of caption variability in (b), we construct synthetic captions
for the COCO dataset using its multi-object image labels. We vary whether these captions are
consistent (use a single term to describe a given object) and complete (describes all image objects).
the three datasets in Figure 4a, we see that COCO > CC > YFCC. This aligns with our earlier
subjective assessment as well as with prior work [HYH13; KGP21].
In Figure 4a, we visualize the relationship between the average descriptiveness of a dataset’s
captions and the transfer performance of the resulting CLIP model. We indeed find that descrip-
tive captions are crucial for CLIP’s performance, and one descriptive image-caption pair from
COCO is worth 2x and 5x samples from CC and YFCC respectively. On YFCC and CC, CLIP thus
requires more data to benefit from language supervision.
Finally, we train CLIP on 100K subsets of CC and YFCC with “more descriptive” captions by
re-captioning the images using BLIP [LLX+22] (examples in Appendix Figure 7). CLIP trained on
CC/YFCC using BLIP captions no longer performs worse than its COCO counterpart (Figure 4a).
This indicates that CLIP’s sensitivity to the pre-training corpus is not just an artifact of differing
image distributions, but due to the presence (or absence) of descriptive captions.
3.3 The effect of intra-dataset variations in captions
Next, we examine how the variability of captions within a dataset affects CLIP’s transfer capabil-
ities. After all, there are many ways to caption an image, as shown in Figure 1. The presented
captions vary in terms of how they describe an object (e.g., “duffel van” or “car”), and the parts of
the image they focus on (e.g., discussing the “street” or “brick”). These stylistic, lexical, and focus
variations in captions could make it harder for CLIP to learn meaningful representations.
A simple setting. We investigate this effect on the COCO dataset by creating synthetic captions
using multi-object image labels (examples in Appendix Figure 8). Here, we can construct cap-
tions to precisely control whether they are: (i) consistent: by either using a fixed term or random
synonyms to describe an object across the dataset; and (ii) complete: by either mentioning all or a
random subset of image objects. In this setting, we find:
• A CLIP model trained with complete and consistent synthetic captions outperforms a model
trained on human-written captions (cf. row 1 in Figure 4b to row 3 in Table 1).
8

Page 9
Method NC Source CC (100K) YFCC (100K)
SimCLR 0
-
55.9 ± 0.2 55.5 ± 0.2
CLIP
1 Human 53.1 ± 0.2 34.7 ± 0.2
CLIP
1
BLIP 53.7 ± 0.2 54.8 ± 0.2
CLIPS
2
BLIP
-
56.9 ± 0.2
CLIPS
5
BLIP 57.8 ± 0.2 58.8 ± 0.2
CLIPS
10 BLIP
-
59.1 ± 0.2
(a)
(b)
Figure 5: A closer look at CLIPS. (a) Sensitivity of its transfer performance to the number of
captions per image (NC) used during training. For CC and YFCC, we use the BLIP captioning
model to generate multiple diverse captions per image. (b) Performance trade-offs between pre-
training using more image-caption pairs vs. more captions per image on the YFCC dataset.
• Dropping these two conditions, and thereby increasing the variability of captions, causes the
transfer performance of the model to drop (cf. rows 1, 2, and 4 in Figure 4b).
• A stochastic version of CLIP, i.e. CLIPS based on 5 synthetic captions per image, is not as
affected by caption inconsistency and/or incompleteness. The 2% improvement of CLIPS
over CLIP here mirrors the 3.6% gain seen for human-provided captions in Figure 1.
• Unlike CLIP, CLIPS transfers 2% better on average when trained on human-provided cap-
tions as opposed to synthetic ones.
These findings suggest that variability in dataset captions does have an adverse effect on the re-
sulting CLIP models. This drop can be mitigated by stochastically sampling from a set of possible
captions per image, rather than using a single fixed caption (in CLIPS). Note that standard im-
age contrastive learning methods already do this since they use random data augmentations (e.g.,
a different crop) to generate T(x) at every epoch. Finally, our results show that human-written
captions contain useful information for representation learning that is not present in object la-
bels alone. However, extracting this signal is not straightforward, and may require incorporating
multiple captions into CLIP training.
Datasets in practice. Looking at Figure 2, it seems that COCO < CC < YFCC in terms of caption
variability. This trend may be expected given how their captions were obtained and/or post-
processed: careful post-hoc labeling for COCO, filtering and hypernymizing ALT-text for CC, and
scraping raw Flickr titles for YFCC. Our results in the simple setting above suggests that this
variability of YFCC captions (and to a lesser extent CC) could (with lower descriptiveness) be
responsible for the worse transfer performance of the resulting CLIP models (Figure 3 right). It
also explains why scale is essential to benefit from language supervision on CC and YFCC. After
all, CLIP would need to be trained on more captions to even encounter the same word twice.
How many captions are enough? In the simple setting above, we saw that performing “text
data augmentations” in CLIPS can reduce the adverse impacts of caption variability. We now
9

Page 10
analyze how this effect scales with the number of available captions per image, focusing on the
CC and YFCC datasets. Since these datasets only contain one caption per image, we use the BLIP
captioning model to generate multiple captions via nucleus sampling [HBD+20]. We observe in
Figure 5a, that CLIPS improves as the number of available (BLIP-generated) captions per image
increases (plateauing around 10). However, scaling up the overall number of image-caption
pairs appears to be far more effective than incorporating more captions per image (at least those
obtained via BLIP) from the perspective of improving transfer performance (see Figure 5b). Note
that the relative costs of these two approaches are context dependent—e.g. in Section 4 we discuss
ways to augment the pool of image captions in a dataset without additional data collection.
It is important to note that there is a strong inter-dependence between the three properties we
discussed in Section 3 in terms of their influence on CLIP’s transfer capabilities. For instance,
scale can, to an extent compensate for variable/less descriptive captions (as seen in Figure 3).
That being said, our findings suggest that the utility of captions as a form of supervision can be
greatly improved by being more mindful of what and how they describe (in) an image.
4 Making existing captions work
So far, we have identified two properties of captions that impact their usefulness as a mode of
supervision: (i) descriptiveness and (ii) variability. With these in mind, we now put forth simple
interventions that can be made to datasets to improve the performance of CLIP models.
Data pre-processing: Given the importance of caption descriptiveness, we might consider pre-
processing scraped data to select for samples with this property. The CC data collection proce-
dure [SDG+18] partially demonstrates the effectiveness of this approach, as pre-training CLIP on
CC samples leads to better transfer performance than a comparable number of “raw” YFCC ones.
However, due to its reliance on the Google Vision API, this procedure can be quite expensive,
with costs scaling with the size of the scraped data. Recent works have taken a different approach,
using pre-trained image-language matching models (like CLIP and BLIP) to filter data [SVB+21].
Given that we are interested in building such models in the first place, we avoid taking this route.
Instead, we focus on understanding how far we can get by simply discarding low quality cap-
tions, agnostic to the images. To do so, we take inspiration from the filtering pipelines used to
build large language models [BMR+20]. Here, raw Internet data is cleaned by selecting samples
that are “similar” to known high-quality datasets (e.g., Wikipedia). Taking a similar approach, we
train a linear classifier on a bag-of-n-grams sentence embeddings [JGB+17] to distinguish valida-
tion set CC/YFCC captions from COCO ones. This classifier is then used to filter CC/YFCC, only
retaining samples that are predicted as being COCO-like (examples in Appendix Figure 9). For a
given pre-training data budget, we see moderate gains (2%) from using this simple heuristic to
filter the CC and YFCC datasets—see Table 2 (left). To put these gains in context, we also report the
performance of CLIP trained on the same images with “high-quality” BLIP-generated captions.
Mitigating caption variability: As we saw in Section 3.3, models trained with CLIPS are less
impacted by caption variability. However, typical image-caption datasets (such as CC and YFCC)
only have one caption per image. We thus devise a methodology to augment these captions by
leveraging recent open-source large language models [WK21]. Concretely, we provide GPT-J with
4 (caption, paraphrase) pairs as in-context [BMR+20] examples. We then prompt it to paraphrase
10

Page 11
Dataset
Method Preproc.
µTx
CC (100K)
SimCLR
-
55.9 ± 0.3
CLIP
-
53.1 ± 0.2
CLIP
Filter 54.2 ± 0.2
YFCC (500K)
SimCLR
-
55.4 ± 0.2
CLIP
-
58.8 ± 0.2
CLIP
BLIP 61.8 ± 0.2
CLIP
Filter 60.4 ± 0.2
Dataset
Method Caption
µTx
COCO (120K)
SimCLR
-
56.0 ± 0.2
CLIP
Human 57.5 ± 0.1
CLIPS
Human 61.3 ± 0.2
CLIPS
GPT-J 58.9 ± 0.3
CC (200K)
SimCLR
-
55.3 ± 0.2
CLIP
Human 57.0 ± 0.3
CLIPS
GPT-J 58.8 ± 0.3
Table 2: Improving CLIP’s transfer performance through simple interventions on existing datasets.
(left) Applying a simple bag-of-words classifier to identify data subsets with “high quality” cap-
tions. (right) Using in-context learning with GPT-J to obtain five diverse captions for dataset im-
ages (via paraphrasing) which are then used to train CLIPS. For COCO, we also compare to CLIPS
trained with five human-written caption.
a given target caption. By sampling from GPT-J, we can obtain multiple (in our case five) para-
phrases for every such caption (see Appendix Figure 10 for examples). In Table 2 (right), we
see that feeding these captions into CLIPS results in a considerable performance boost over CLIP
(trained with a single caption/image). For instance, for COCO, CLIPS trained on our generated
captions bridges more than half of the performance gap between CLIP and CLIPS trained with
one and five human-provided captions respectively.
5 Related Work
Representation learning. Building models with general representations that transfer to down-
stream tasks has been a long-standing goal in ML [DJV+14; RAS+14; CSV+14; AGM14; YCB+14].
Our work is in line with prior studies aimed at characterizing the effect of design choices made
during training [ARS+15; HAE16; CMB+16; KSL19; ZPK+19; LBL+20], e.g. model architecture,
datasets and loss functions, on learned representations.
The utility of language in vision. There has been a long line of work on leveraging language to
improve vision models [QCD07; SS12; FCS+13; BAM18; GWW19]. However, with the develop-
ment of CLIP and its variants [MKW+21; LLZ+22; YHH+22], this approach has become a serious
contender to traditional image-only ones. Follow up works have sought to investigate how inte-
gral language is to CLIP’s performance. Ruan et al. [RDM22] suggest theoretically that the robust-
ness of linear probes on CLIP’s representations stems from pretraining with a large and diverse
set of images and domain-agnostic augmentations T(x) . More recently, Fang et al. [FIW+22]
examined CLIP’s effective robustness [TDS+20] (on ImageNet-like datasets [DDS+09; RDS+15;
RRS+19; WGX+19; BMA+19; HZB+21; HBM+21]) in the zero-shot setting. They find that CLIP’s
robustness is comparable to that of a supervised classifier trained on the same pool of YFCC im-
ages, and therefore conclude that data distribution is more important than language supervision.
Our work is complementary to this study, as we examine the role of language in a different setting,
i.e., self-supervised representation learning. We show that the impact of language supervision in
11

Page 12
this setting is complex and depends on the quality and quantity of image-caption data.
Most similar to our work is the recent study by Devillers et al. [DCB+21], which argues that
language supervision does not result in improved downstream transfer, few-shot learning and ad-
versarial robustness. While they also consider CLIP and image-only models (e.g. BiT [KBZ+20]),
they do not attempt to directly control confounding effects. In particular, the two sets of models
are trained on different datasets with different objectives (e.g., contrastive for CLIP, supervised for
BiT). Our work performs a substantially more controlled study on the effect of language supervi-
sion, allowing us to make more direct claims than Devillers et al. [DCB+21].
Supervision in self-supervised learning. Prior works in contrastive learning have studied how
properties of the transformation T(x) affect the transferability of learned representations. They
show that for a given image x, a good view (x+) is one that retains label information while re-
moving other nuisances [TSP+20; TWS+21; DBU+21; FDF+20; MMW+21; WZM+22]. From this
perspective, our work can be viewed as studying whether a caption provides a better view for a
given image compared to standard image augmentations.
6 Discussion
Our work takes a step towards resolving the debate as to whether multi-modality, and language
in particular, can improve visual representation learning. A comparison of CLIP with matched
image-only SimCLR models reveals that neither form of supervision (using images alone or cou-
pled with language) is strictly better than the other. Indeed, there are practical regimes where
CLIP’s performance cannot be matched using SimCLR with any amount of image data and others
where language supervision is harmful. This is a direct consequence of CLIP’s sensitivity to its
pre-training data, especially its scale, descriptiveness, and variability of the captions. Through
our analysis, we also discovered algorithmic improvements (CLIPS) and dataset modifications
(filtering and augmenting captions) to better take advantage of language supervision.
Limitations. Our exploration allows us to quantify the utility of language supervision (over us-
ing images alone) in a specific setting: transfer learning via probing on certain object recognition
tasks [KSL19]. We view expanding the scope of our analysis as a direction for future work. Fur-
ther, despite the significant steps we took to control the differences between CLIP and SimCLR,
there are still some inconsistencies that have not been accounted for (discussed in Section 2). Nev-
ertheless, the differences between our and previous results [e.g, DCB+21] suggest that we success-
fully pinned down some crucial confounders (architecture, augmentations, stochasticity, datasets,
hyperparameters). Finally, while we show that CLIP’s representations are influenced by what the
captions they are trained on describe, we sidestep whether or not this is always desirable. After all,
recent studies [BPK21] show that vision-linguistic datasets have various biases and stereotypes,
which we might not want our models to learn.
Acknowledgements
We are grateful to Niladri Chatterji, Elisa Kreiss, Nimit Sohoni and Dimitris Tsipras for helpful
discussions. SS is supported by Open Philanthropy, YD by a Knights-Hennessy Scholarship, and
RT by the NSF GRFP under Grant No. DGE 1656518. We also thank Stanford HAI for a Google
Cloud credits grant.
12

Page 13
References
[AGM14]
P. Agrawal, R. Girshick, and J. Malik. “Analyzing the performance of multilayer
neural networks for object recognition”. In: European Conference on Computer Vision
(ECCV). 2014.
[ARS+15]
H. Azizpour, A. S. Razavian, J. Sullivan, A. Maki, and S. Carlsson. “Factors of trans-
ferability for a generic convnet representation”. In: IEEE Transactions on Pattern Anal-
ysis and Machine Intelligence (2015).
[BAM18]
T. Baltrušaitis, C. Ahuja, and L. Morency. “Multimodal machine learning: A sur-
vey and taxonomy”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence
(2018).
[BMA+19]
A. Barbu, D. Mayo, J. Alverio, W. Luo, C. Wang, D. Gutfreund, J. Tenenbaum, and
B. Katz. “ObjectNet: A large-scale bias-controlled dataset for pushing the limits of
object recognition models”. In: Advances in Neural Information Processing Systems
(NeurIPS). 2019.
[BMR+20]
T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakan-
tan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T.
Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen,
E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Rad-
ford, I. Sutskever, and D. Amodei. “Language Models are Few-Shot Learners”. In:
arXiv preprint arXiv:2005.14165 (2020).
[BPK21]
A. Birhane, V. U. Prabhu, and E. Kahembwe. “Multimodal datasets: misogyny,
pornography, and malignant stereotypes”. In: arXiv preprint arXiv:2110.01963 (2021).
[CFL+15]
X. Chen, H. Fang, T. Lin, R. Vedantam, S. Gupta, P. Dollár, and C. L. Zitnick.
“Microsoft coco captions: Data collection and evaluation server”. In: arXiv preprint
arXiv:1504.00325 (2015).
[CH21]
X. Chen and K. He. “Exploring simple siamese representation learning”. In: Confer-
ence on Computer Vision and Pattern Recognition (CVPR). 2021.
[CKN+20]
T. Chen, S. Kornblith, M. Norouzi, and G. Hinton. “A simple framework for con-
trastive learning of visual representations”. In: International Conference on Machine
Learning (ICML). 2020.
[CKS+20]
T. Chen, S. Kornblith, K. Swersky, M. Norouzi, and G. Hinton. “Big self-supervised
models are strong semi-supervised learners”. In: Advances in Neural Information Pro-
cessing Systems (NeurIPS). 2020.
[CMB+16]
B. Chu, V. Madhavan, O. Beijbom, J. Hoffman, and T. Darrell. “Best practices for
fine-tuning visual classifiers to new domains”. In: European Conference on Computer
Vision (ECCV). 2016.
[CMM+20] M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin. “Unsupervised
learning of visual features by contrasting cluster assignments”. In: Advances in Neu-
ral Information Processing Systems (NeurIPS). 2020.
[CSV+14]
K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. “Return of the devil in
the details: Delving deep into convolutional nets”. In: arXiv preprint arXiv:1405.3531
(2014).
13

Page 14
[CTM+21]
M. Caron, H. Touvron, I. Misra, H. J’egou, J. Mairal, P. Bojanowski, and A. Joulin.
“Emerging properties in self-supervised vision transformers”. In: Conference on Com-
puter Vision and Pattern Recognition (CVPR). 2021.
[CYW+22] Elijah Cole, Xuan Yang, Kimberly Wilber, Oisin Mac Aodha, and Serge Belongie.
“When does contrastive visual representation learning work?” In: Conference on
Computer Vision and Pattern Recognition (CVPR). 2022.
[DBU+21]
Y. Dubois, B. Bloem-Reddy, K. Ullrich, and C. J. Maddison. “Lossy compression for
lossless prediction”. In: Advances in Neural Information Processing Systems (NeurIPS)
(2021).
[DCB+21]
B. Devillers, B. Choksi, R. Bielawski, and R. VanRullen. “Does language help gener-
alization in vision models?” In: Computational Natural Language Learning. 2021.
[DDS+09]
J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei. “ImageNet: A large-scale hi-
erarchical image database”. In: Conference on Computer Vision and Pattern Recognition
(CVPR). 2009.
[DJV+14]
J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. “De-
CAF: A deep convolutional activation feature for generic visual recognition”. In:
International Conference on Machine Learning (ICML). 2014.
[DMM+22] P. Dognin, I. Melnyk, Y. Mroueh, I. Padhi, M. Rigotti, J. Ross, Y. Schiff, R. A. Young,
and B. Belgodere. “Image Captioning as an Assistive Technology: Lessons Learned
from VizWiz 2020 Challenge”. In: Journal of Artificial Intelligence Research (2022).
[EGH21]
L. Ericsson, H. Gouk, and T. M. Hospedales. “How well do self-supervised models
transfer?” In: Conference on Computer Vision and Pattern Recognition (CVPR). 2021.
[FCS+13]
A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov.
“Devise: A deep visual-semantic embedding model”. In: Advances in Neural Informa-
tion Processing Systems (NeurIPS). 2013.
[FDF+20]
M. Federici, A. Dutta, P. Forr’e, N. Kushman, and Z. Akata. “Learning robust rep-
resentations via multi-view information bottleneck”. In: International Conference on
Learning Representations (ICLR) (2020).
[FIW+22]
A. Fang, G. Ilharco, M. Wortsman, Y. Wan, V. Shankar, A. Dave, and L. Schmidt.
“Data Determines Distributional Robustness in Contrastive Language Image Pre-
training (CLIP)”. In: arXiv preprint arXiv:2205.01397 (2022).
[FP19]
W. Falcon and the PyTorch Lightning team. PyTorch Lightning. 2019. URL: https:
//github.com/Lightning-AI/lightning.
[Gri75]
H. P. Grice. “Logic and Conversation”. In: Syntax and Semantics (1975).
[GWW19]
W. Guo, J. Wang, and S. Wang. “Deep multimodal representation learning: A sur-
vey”. In: IEEE Access (2019).
[HAE16]
M. Huh, P. Agrawal, and A. A. Efros. “What makes ImageNet good for transfer
learning?” In: arXiv preprint arXiv:1608.08614 (2016).
[HBD+20]
A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi. “The Curious Case of Neural
Text Degeneration”. In: arXiv preprint arXiv:1904.09751 (2020).
14

Page 15
[HBM+21] D. Hendrycks, S. Basart, N. Mu, S. Kadavath, F. Wang, E. Dorundo, R. Desai, T.
Zhu, S. Parajuli, M. Guo, et al. “The many faces of robustness: A critical analysis
of out-of-distribution generalization”. In: Conference on Computer Vision and Pattern
Recognition (CVPR). 2021.
[HFW+20] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick. “Momentum contrast for unsuper-
vised visual representation learning”. In: Conference on Computer Vision and Pattern
Recognition (CVPR). 2020.
[HWG+21] J. Z. HaoChen, C. Wei, A. Gaidon, and T. Ma. “Provable guarantees for self-
supervised deep learning with spectral contrastive loss”. In: Advances in Neural In-
formation Processing Systems (NeurIPS) (2021).
[HYH13]
M. Hodosh, P. Young, and J. Hockenmaier. “Framing image description as a rank-
ing task: Data, models and evaluation metrics”. In: Journal of Artificial Intelligence
Research (JAIR) (2013).
[HZB+21]
D. Hendrycks, K. Zhao, S. Basart, J. Steinhardt, and D. Song. “Natural adversarial
examples”. In: Conference on Computer Vision and Pattern Recognition (CVPR). 2021.
[HZR+16]
K. He, X. Zhang, S. Ren, and J. Sun. “Deep Residual Learning for Image Recogni-
tion”. In: Computer Vision and Pattern Recognition (CVPR). 2016.
[IWW+21]
G. Ilharco, M. Wortsman, R. Wightman, C. Gordon, N. Carlini, R. Taori, A. Dave, V.
Shankar, H. Namkoong, J. Miller, H. Hajishirzi, A. Farhadi, and L. Schmidt. Open-
CLIP. 2021. URL: https://doi.org/10.5281/zenodo.5143773.
[JGB+17]
A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov. “Bag of Tricks for Efficient Text
Classification”. In: European Association for Computational Linguistics (EACL). 2017.
[KBZ+20]
A. Kolesnikov, L. Beyer, X. Zhai, J. Puigcerver, J. Yung, S. Gelly, and N. Houlsby.
“Big Transfer (BiT): General Visual Representation Learning”. In: European Confer-
ence on Computer Vision (ECCV). 2020.
[KGP21]
E. Kreiss, N. D. Goodman, and C. Potts. “Concadia: Tackling Image Accessibility
with Descriptive Texts and Context”. In: arXiv preprint arXiv:2104.08376 (2021).
[KSL19]
S. Kornblith, J. Shlens, and Q. V. Le. “Do better imagenet models transfer better?”
In: Conference on Computer Vision and Pattern Recognition (CVPR). 2019.
[LBL+20]
F. Locatello, S. Bauer, M. Lucic, G. R”atsch, S. Gelly, B. Sch”olkopf, and O. Bachem.
“A sober look at the unsupervised learning of disentangled representations and
their evaluation”. In: Journal of Machine Learning Research (JMLR) (2020).
[LLX+22]
J. Li, D. Li, C. Xiong, and S. Hoi. “Blip: Bootstrapping language-image pre-training
for unified vision-language understanding and generation”. In: arXiv preprint
arXiv:2201.12086 (2022).
[LLZ+22]
Y. Li, F. Liang, L. Zhao, Y. Cui, W. Ouyang, J. Shao, F. Yu, and J. Yan. “Supervi-
sion exists everywhere: A data efficient contrastive language-image pre-training
paradigm”. In: International Conference on Learning Representations (ICLR). 2022.
[LMB+14]
T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll’ar, and C. L.
Zitnick. “Microsoft coco: Common objects in context”. In: European Conference on
Computer Vision (ECCV). 2014.
[MKW+21] N. Mu, A. Kirillov, D. Wagner, and S. Xie. “SLIP: Self-supervision meets Language-
Image Pre-training”. In: arXiv preprint arXiv:2112.12750 (2021).
15

Page 16
[MMW+21] J. Mitrovic, B. McWilliams, J. Walker, L. Buesing, and C. Blundell. “Representation
Learning via Invariant Causal Mechanisms”. In: International Conference on Learning
Representations (ICLR). 2021.
[QCD07]
A. Quattoni, M. Collins, and T. Darrell. “Learning visual representations using
images with captions”. In: Conference on Computer Vision and Pattern Recognition
(CVPR). 2007.
[RAS+14]
A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. “CNN Features off-the-
shelf: an Astounding Baseline for Recognition.” In: arXiv preprint arXiv:1403.6382
(2014).
[RDM22]
Y. Ruan, Y. Dubois, and C. J. Maddison. “Optimal Representations for Covariate
Shift”. In: International Conference on Learning Representations (ICLR). 2022.
[RDS+15]
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy,
A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. “ImageNet Large Scale Visual
Recognition Challenge”. In: International Journal of Computer Vision (IJCV) (2015).
[RKH+21]
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A.
Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. “Learning transferable
visual models from natural language supervision”. In: International Conference on
Machine Learning (ICML). 2021.
[RRS+19]
B. Recht, R. Roelofs, L. Schmidt, and V. Shankar. “Do ImageNet Classifiers General-
ize to ImageNet?” In: International Conference on Machine Learning (ICML). 2019.
[SDG+18]
P. Sharma, N. Ding, S. Goodman, and R. Soricut. “Conceptual captions: A cleaned,
hypernymed, image alt-text dataset for automatic image captioning”. In: Association
for Computational Linguistics (ACL). 2018.
[SS12]
N. Srivastava and R. R. Salakhutdinov. “Multimodal learning with deep boltzmann
machines”. In: Advances in Neural Information Processing Systems (NeurIPS) (2012).
[SVB+21]
C. Schuhmann, R. Vencu, R. Beaumont, R. Kaczmarczyk, C. Mullis, A. Katta, T.
Coombes, J. Jitsev, and A. Komatsuzaki. “LAION-400M: Open dataset of clip-
filtered 400 million image-text pairs”. In: arXiv preprint arXiv:2111.02114 (2021).
[TDS+20]
R. Taori, A. Dave, V. Shankar, N. Carlini, B. Recht, and L. Schmidt. “Measuring ro-
bustness to natural distribution shifts in image classification”. In: Advances in Neural
Information Processing Systems (NeurIPS) (2020).
[THO21]
Y. Tian, O. J. Henaff, and A. van den Oord. “Divide and contrast: Self-supervised
learning from uncurated data”. In: Conference on Computer Vision and Pattern Recog-
nition (CVPR). 2021.
[TSF+16]
B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and
L. Li. “YFCC100M: The new data in multimedia research”. In: Communications of the
Association for Computing Machinery (ACM). 2016.
[TSP+20]
Y. Tian, C. Sun, B. Poole, D. Krishnan, C. Schmid, and P. Isola. “What makes for good
views for contrastive learning?” In: Advances in Neural Information Processing Systems
(NeurIPS). 2020.
[TWS+21]
Y. H. Tsai, Y. Wu, R. R. Salakhutdinov, and L. Morency. “Self-supervised Learning
from a Multi-view Perspective”. In: International Conference on Learning Representa-
tions (ICLR). 2021.
16

Page 17
[VSP+17]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser,
and I. Polosukhin. “Attention Is All You Need”. In: arXiv preprint arXiv:1706.03762
(2017).
[WGX+19] H. Wang, S. Ge, E. P. Xing, and Z. C. Lipton. “Learning robust global representations
by penalizing local predictive power”. In: Advances in Neural Information Processing
Systems (NeurIPS). 2019.
[WK21]
B. Wang and A. Komatsuzaki. GPT-J-6B: A 6 Billion Parameter Autoregressive Language
Model. https://github.com/kingoflolz/mesh-transformer-jax. 2021.
[WXY+18]
Z. Wu, Y. Xiong, S. X. Yu, and D. Lin. “Unsupervised feature learning via non-
parametric instance discrimination”. In: Conference on Computer Vision and Pattern
Recognition (CVPR). 2018.
[WZM+22] M. Wu, C. Zhuang, M. Mosse, D. L. K. Yamins, and N. D. Goodman. “On Mutual In-
formation in Contrastive Learning for Visual Representations”. In: AAAI Conference
on Artificial Intelligence. 2022.
[YCB+14]
J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. “How transferable are features
in deep neural networks?” In: Advances in Neural Information Processing Systems
(NeurIPS). 2014.
[YHH+22]
L. Yao, R. Huang, L. Hou, G. Lu, M. Niu, H. Xu, X. Liang, Z. Li, X. Jiang, and C.
Xu. “FILIP: Fine-grained Interactive Language-Image Pre-Training”. In: International
Conference on Learning Representations (ICLR). 2022.
[ZPK+19]
X. Zhai, J. Puigcerver, A. Kolesnikov, P. Ruyssen, C. Riquelme, M. Lucic, J. Djo-
longa, A. S. Pinto, M. Neumann, A. Dosovitskiy, et al. “A large-scale study of rep-
resentation learning with the visual task adaptation benchmark”. In: arXiv preprint
arXiv:1910.04867 (2019).
[ZWM+22] X. Zhai, X. Wang, B. Mustafa, A. Steiner, D. Keysers, A. Kolesnikov, and L. Beyer.
“LiT: Zero-Shot Transfer with Locked-image Text Tuning”. In: Conference on Com-
puter Vision and Pattern Recognition (CVPR). 2022.
17

Page 18
A Setup and Experimental details
A.1 Datasets
In Appendix Figure 6, we present (random) samples from the MS-COCO [LMB+14], Conceptual
Captions [SDG+18] and YFCC datasets [TSF+16]. We use the 2017 version of COCO, which con-
tains five human-written captions along with multi-object image labels for each image.
- “A table topped with plates and glasses with eating utensils..”
- “a fork is laying on a small white plate”
- “dirty dishes on a table, and a bottle of something.”
- “a table top with some dishes on top of it”,
- “A table full of dirty dishes is pictured in this image.”
- “An All Nippon Airways 777 sitting at a gate on the tarmac.”
- “a large air plane on a run way”
- “A jumbo jet being serviced at an airport.”
- “A large blue and white jetliner sitting on top of a tarmac.”,
- “A large airliner preparing for departure at an airport.”
- “A man jumping a horse over an obstacle.”
- “A person jumping a horse over an object.”
- “An equestrian competitor and his horse jumping over a stile”
- “A horse and jockey jump over bush hurdles .”,
- “A rider and horse jump over a wooden brush obstacle.”
- “A couple of zebra standing on top of a dirt field.”
- “Some zebras walking around in a field looking around”
- “Some very cute zebras in a big dusty field.”
- “A small zebra standing next to a bigger zebra.”,
- “The baby Zebras stripes are much closer together than an adults.”
Figure 6: Dataset examples: MS-COCO [LMB+14]
Licenses. These datasets were obtained by scraping images from online hosting services (e.g.
Flickr). Thus, the ownership of the images lies with the respective individuals that uploaded them.
Nevertheless, as per their terms of agreement, the images can be used for research purposes.
18

Page 19
“paratroopers load onto a helicopter.”
“Close up hands of woman typing text message on smart
phone in a cafe.”
“woman in a bathrobe is smiling to camera in the forest”
“Girls in old time dresses selling flowers are pictured tak-
ing a rest of a bench.”
“A shrimp has pairs of legs.”
Figure 6: Dataset examples: Conceptual Captions [SDG+18]
19

Page 20
“Kenneth Phan #7 A Day in the Life of DC is a photo
project meant to capture a flavor of the region through
the eyes of the participants. Participants submitted
twelve photos taken on May 30, 2009. Photos by Ken-
neth Phan”
“Pombas New York - USA 27 de Setembro 2013”
“Um, the girls that live at the house I lived in 13 years
ago are huge @foursquare fans! #amazing @ 600 euclid
4sq.commPKNZv (posted via FlickSquare)”
“squares Created for dA Users Gallery Challenge #43 –
Winter Stock 1 Model with thanks to Reine-Haru”
“Stripes and Squares Love the contrast of the light on the
keyboard, stripes and squares”
Figure 6: Dataset examples: YFCC [TSF+16]
20

Page 21
Like most large-scale datasets, COCO, CC and YFCC have not been extensively vetted, and
may contain identifying information or offensive content. Characterizing the pervasiveness of
these issues is an important and active area of research. That being said, we do not redistribute
the data, our work is unlikely to significantly further the risks from these datasets.
A.2 Models
We rely on existing open source implementations of CLIP [IWW+21] and SimCLR [FP19] for
all our experiments, with a ResNet-50 image encoder (feature dimension=2048), and a linear
and MLP projection head respectively. We use the transformer architecture from Radford et al.
[RKH+21] for encoding captions in CLIP. Unless otherwise specified, we use five captions per
image to train CLIPS. For downstream transfer, we train a linear probe using task data.
A.3 Hyperparameters
We ran an extensive hyperparameter grid for CLIP and SimCLR on MS-COCO and used the same
configuration in the rest of our experiments. These defaults are stated in Appendix Table 3.
Model
BatchSize Epochs Warmup
lr
wd
Supervised
1024
200
10
103 106
SimCLR
1024
200
10
102 106
CLIP
1024
200
10
103
0.1
Table 3: Default hyperparameters for model training.
We use the Adam optimizer with a cosine lr schedule for all the models. All other hyperpa-
rameters are defaults from standard implementations of SimCLR 6 and CLIP.7
Exceptions. We train CLIP/SimCLR on CC/YFCC-2M for 100 epochs due to computational re-
strictions. For corpora smaller than 100K (Figure 3), we scale up the number of epochs to keep the
number of iterations roughly comparable.
Data augmentations. The PyTorch pseudo-code for the default SimCLR and CLIP data from
prior work augmentation are as follows:
TSimCLR
= {RandomResizedCrop(size = 224),
RandomHorizontalFlip(p = 0.5)
RandomApply(ColorJitter(0.8, 0.8, 0.8, 0.2),p = 0.8)
RandomGrayscale(p = 0.2),
GaussianBlur(kernel size = 23,p = 0.5)}
TCLIP
= {RandomResizedCrop(size = 224,
scale = (0.9, 1.0),
interpolation = BICUBIC)}
6
https://pytorch-lightning-bolts.readthedocs.io/en/latest/self_supervised_models.html
7
https://github.com/mlfoundations/open_clip
21

Page 22
Note that for our experiments, unless otherwise specified, we use the standard SimCLR set for the
Supervised/SimCLR/CLIP models.
Linear probe. We train the probe using cross-entropy loss on CLIP/SimCLR features of dimen-
sionality 2048. In cases where the downstream task data is imbalanced, we re-weight the loss to
account for it. We also evaluate class balanced accuracy at test time. For each downstream task,
we train the probe for 250 epochs using an SGD optimizer. We use a batch size of 256, weight
decay of 106 and momentum 0.9. We perform a grid search for the best learning rate (using the
validation set), considering values between 3×102 and 10. We also consider 3 random seeds.
COCO supervised. The COCO dataset contains multi-object labels for each image. We thus train
the supervised classifier and linear probe in this setting to predict whether each of the 80 objects
is present in an image. We then evaluate the accuracy of the model by aggregating (in a class
balanced manner) the correctness of each of these binary predictions.
Confidence intervals. We report 95% confidence intervals obtained via bootstrapping over the
test set, as well as the three random seeds used for the linear probe. Due to space constraints, we
do not always report them in the main paper, but include a detailed table for all our experiments
with confidence intervals in the Appendix.
A.4 Compute
We train each of our models of 4 NVIDIA A100 GPUs. Training both CLIP and SimCLR models
takes on the order of 8-10 hours for a pre-training corpus of size 100K.
22

Page 23
A.5 BLIP recaptioning
To generate BLIP captions for images from the CC and YFCC datasets, we use the BLIP captioning
model [LLX+22]. In particular, we use the provided8 ViT-Base with nucleus sampling (topp of 0.9,
repetition penalty of 1.1, and text length range [5, 40]), varying the random seed to generate mul-
tiple captions per image. (Random) image-BLIP caption pairs are shown in Appendix Figure 7.
Dataset: CC
- “portrait of a young boy sitting in the leaves in a park - stock image.”
- “toddler boy in a coat sitting on leaves with arms up to the air, smiling
and laughing - stock photo.”
- “An image of a little boy sitting on the leaves in a park - stock image.”
Dataset: CC
- “The men are walking on a dirt ground with equipment in the back-
ground.”
- “military soldiers in uniforms carrying weapons and soldiers on their
back in a desert”
- “Officials and soldiers stand in the desert, looking at a vehicle with
missiles.”
Dataset: YFCC
- “Signs of various silhouettes of people dancing, standing, and laying
on the street.”
- “lot of bronze colored women holding their arms up with their hands
together in front of a metal wall art”
- “Sculptures on a wall of various silhouettes and dance positions.”
Dataset: YFCC
- “I love the white swan in the foreground with the water behind him.”
- “an image group of white birds in a green area near some water”
- “the swan is standing on the green grass near the water”
Figure 7: Random images from CC and YFCC alongside BLIP captions.
8
https://github.com/salesforce/BLIP
23

Page 24
A.6 Synthetic COCO captions
We present examples of synthetic captions for the MS-COCO dataset created using the available
multi-object image labels in Appendix Figure 8. A synthetic caption is complete (incomplete) if it
describes all (a random subset) of objects in the image. It is consistent (inconsistent) if it describes
a given object using a single consistent term throughout the dataset (one from a set of manually
curated synonyms) and whether we use a fixed template (one of a set of templates). In every case,
we randomly order the objects that we describe.
Complete and Consistent:
- “A photo of four bowls, a oven, seven cups, a refrigerator, two
persons, a spoon, two cakes”
Incomplete and consistent:
- “A photo of a person, six cups, three bowls, two cakes, a oven.”
Incomplete and Inconsistent:
- “A photo of a kitchen, two women, two shot glasses.”
- “I see a oven, a kitchen, two mugs, a kitchen, a man.”
Complete and Consistent:
- “A photo of a person, a tennis racket, a sports ball, a car.”
Incomplete and consistent:
- “A photo of a car, a person, a tennis racket.”
Incomplete and Inconsistent:
- “a sports ball, a motorcar together.”
- “There is a woman.”
Figure 8: Random image samples from MS-COCO alongside our synthetic captions.
A.7 Filtering captions
In Section 4, we introduce a methodology to filter poor quality captions from a given source
dataset. Using the fastText library,9 we train a linear classifier bag-of-n-grams sentence embed-
dings (n=2) to distinguish a subset of source captions from those in the COCO validation set. We
then use the classifier to filter the source dataset, only selecting the ones that are (mis)classified
as being COCO like. In Appendix Figure 9, we present a (random) subset of filtered examples
from the YFCC dataset. Compared to random YFCC samples (cf. Appendix Figure 6), the ones
in Appendix Figure 9 have much shorter captions—often without attributes such as dates, urls
9
https://github.com/facebookresearch/fastText
24

Page 25
and hashtags. That being said, it is important to recognize that any simple heuristic for filtering is
ultimately limited by the captions present in the source dataset.
“Orc/Troll There’s a face only a mother could love.”
“Pedal Board 9 Back Camera.”
“Kittens Morrissey and Marr relax on the bed.”
“Park in Santa Clara Complete with Che graffiti.”
Figure 9: Filtered image-caption pairs from the YFCC dataset (see Section 4).
A.8 Augmenting captions with GPT-J
We also propose a methodology to augment captions contained in existing datasets by using a
pre-trained large language model (in our case GPT-J-6b, referred to as GPT-J) to paraphrase them
(Section 4). To this end, we rely on in-context learning, wherein we provide GPT-J with some
(four) paired caption-paraphrase examples (using the five human-provided COCO captions) as
the context. We then ask GPT-J to paraphrase a given target caption. For instance, a query to the
model might look like:
25

Page 26
Paraphrase the sentence below
Input: A little boy standing next to a dog in a field.
Output: A dog parked filled with people and a bunch of different dogs.
Paraphrase the sentence below
Input: Some people are on the sandy beach flying kites.
Output: a sunny day at the beach with colorful kites in the sky
Paraphrase the sentence below
Input: A living room filled with furniture and a table.
Output: A living room with a nice couch and a coffee table.
Paraphrase the sentence below
Input: A couple of people on a surfboard in the ocean.
Output: A dog is lying on the surfboard as it surfs on a wave.
Paraphrase the sentence below
Input: a handful of snowmen
make this with a glove .
Output:
We use temperature sampling to generate multiple diverse captions for a given image-caption
pair from the dataset. Examples for the CC dataset are shown in Appendix Figure 10.
26

Page 27
CC caption: “young beautiful businesswoman holding the documents and crossing
the road in financial district .”
GPT-J paraphrases:
- “A women in business attire crossing the road in financial district.”
- “A young beautiful businesswoman is crossing the road.”
- “A young woman crossing the road in financial district.”
CC caption: “Island with waterfalls and rainbows floating above clouds in a
dreamy sky – stock photo #”
GPT-J paraphrases:
- “An island with waterfalls and rainbows floating above clouds in a dreamy
sky.”
- “A rainbow in the sky with a cloud and a rainbow in the water”
- “A rainbow floating in a dreamy sky and a waterfall”
CC caption: “Young friends eating pizza in the park”
GPT-J paraphrases:
- “The young friends are eating pizza in the park”
- “A group of young people eating pizza in the park”
- “Some kids are eating pizza in the park”
CC caption: “Actors attend the spring fashion show during fashion week.”
GPT-J paraphrases:
- “A fashion show that’s attended by actors.”
- “People attending a spring fashion show during fashion week.”
- “The spring fashion show was held during fashion week.”
Figure 10: Caption paraphrases generated using in-context learning with GPT-J.
27

Page 28
B Additional experiments
In Appendix Tables 4-11, we report per-task performance for all our experiments. In Appendix
Table 4, we also illustrate the performance of SimCLR/CLIP models trained using the simpler
data augmentations typically used for CLIP training (cf. Appendix A.3). One can see that both
models perform worse with this modification—with the performance of CLIP dropping by 10%
and that of SimCLR by 50%.
For COCO, we also consider a variant of SimCLR, which we refer to as SimCLR+lab, that factors
in label information in the transformation T(x). Specifically, for a given image x, x+ is a data
augmented version of another COCO image which has at least one object in common with x. We
see that factoring label information does improve SimCLR’s performance considerably, putting it
between vanilla CLIP and CLIPS. However, note for typical pre-training datasets such as CC and
YFCC, we do not have access to such “expert” object labels. Instead, we can take advantage of
captions to improve the equivalences learned by the model.
Model
SUP
SimCLRSimCLR SimCLR+lab
CLIP
CLIP
CLIPS
COCO
90.5 ± 1.5 60.4 ± 2.4 88.9 ± 1.6 89.3 ± 1.5 84.9 ± 1.9 88.4 ± 1.7 89.8 ± 1.6
Aircraft
31.6 ± 0.9 2.3 ± 0.3 40.6 ± 1.0 47.0 ± 1.0 30.3 ± 1.0 41.4 ± 1.0 46.4 ± 1.0
Birdsnap
11.8 ± 0.4 0.7 ± 0.1 18.5 ± 0.5 20.8 ± 0.5 14.0 ± 0.4 17.6 ± 0.5 20.0 ± 0.5
Cal101
65.8 ± 0.7 3.8 ± 0.3 71.5 ± 0.7 80.4 ± 0.6 53.6 ± 0.8 73.2 ± 0.7 78.4 ± 0.6
Cal256
53.7 ± 0.5 3.1 ± 0.2 58.6 ± 0.4 65.7 ± 0.4 41.5 ± 0.5 60.4 ± 0.5 65.6 ± 0.5
Cars
21.7 ± 0.5 1.2 ± 0.1 31.4 ± 0.6 39.3 ± 0.7 23.4 ± 0.5 35.8 ± 0.6 41.5 ± 0.6
CIFAR-10 74.8 ± 0.5 23.2 ± 0.5 82.1 ± 0.4 81.5 ± 0.5 74.0 ± 0.5 83.6 ± 0.4 84.6 ± 0.4
CIFAR-100 46.7 ± 0.6 6.0 ± 0.3 57.3 ± 0.6 56.8 ± 0.6 50.4 ± 0.6 60.8 ± 0.6 62.5 ± 0.6
DTD
55.9 ± 1.4 6.2 ± 0.6 61.7 ± 1.3 60.3 ± 1.3 48.2 ± 1.4 65.7 ± 1.3 66.7 ± 1.3
Flowers
63.5 ± 0.7 4.6 ± 0.3 77.4 ± 0.6 81.4 ± 0.6 68.2 ± 0.7 80.5 ± 0.6 84.0 ± 0.6
Food
47.1 ± 0.4 4.0 ± 0.1 58.7 ± 0.3 56.4 ± 0.4 51.8 ± 0.4 60.9 ± 0.4 65.3 ± 0.4
Pets
45.9 ± 1.0 6.3 ± 0.5 57.3 ± 0.9 63.0 ± 0.9 44.6 ± 0.9 57.0 ± 0.9 61.2 ± 0.9
SUN
44.5 ± 0.4 1.3 ± 0.1 51.9 ± 0.4 52.2 ± 0.4 37.6 ± 0.4 50.8 ± 0.4 54.9 ± 0.4
µTx
47.2 ± 0.2 5.2 ± 0.1 56.0 ± 0.2 58.7 ± 0.2 44.8 ± 0.2 57.5 ± 0.1 61.3 ± 0.2
Table 4: Extended comparison of transfer performance of supervised, SimCLR and CLIP pre-
trained models. Here SimCLRand CLIPdenote models trained with the default CLIP data
augmentation transforms instead of the SimCLR ones (cf. Appendix A.3). SimCLR+lab refers to
SimCLR models trained by picking x+ to be a different image with the same label as x.
28

Page 29
Model
CLIP
CLIP
CLIPS
CLIP
CLIPS
Complete
Consistent
COCO
88.8 ± 1.7 88.4 ± 1.7 89.3 ± 1.6 88.3 ± 1.7 89.2 ± 1.5
Aircraft
46.6 ± 1.0 44.5 ± 1.0 46.6 ± 1.0 45.6 ± 1.0 45.8 ± 1.0
Birdsnap
18.9 ± 0.5 17.2 ± 0.5 18.6 ± 0.5 18.5 ± 0.5 19.1 ± 0.5
Cal101
77.3 ± 0.6 75.3 ± 0.7 76.8 ± 0.6 76.1 ± 0.7 76.0 ± 0.6
Cal256
63.3 ± 0.5 59.9 ± 0.5 63.0 ± 0.4 61.4 ± 0.5 63.6 ± 0.5
Cars
42.4 ± 0.6 41.6 ± 0.6 42.7 ± 0.7 41.2 ± 0.6 42.8 ± 0.6
CIFAR-10 83.3 ± 0.4 82.4 ± 0.4 82.9 ± 0.4 83.7 ± 0.4 83.2 ± 0.4
CIFAR-100 60.5 ± 0.6 59.0 ± 0.6 58.9 ± 0.5 59.9 ± 0.5 60.1 ± 0.6
DTD
64.3 ± 1.3 63.7 ± 1.3 66.1 ± 1.3 63.4 ± 1.2 65.2 ± 1.2
Flowers
82.1 ± 0.5 78.3 ± 0.6 79.5 ± 0.6 79.3 ± 0.6 80.6 ± 0.6
Food
61.4 ± 0.3 57.6 ± 0.4 60.9 ± 0.4 59.0 ± 0.3 61.9 ± 0.4
Pets
60.0 ± 0.9 57.1 ± 1.0 58.8 ± 0.9 59.8 ± 0.9 60.6 ± 1.0
SUN
52.1 ± 0.4 49.6 ± 0.4 53.1 ± 0.4 50.6 ± 0.4 52.7 ± 0.4
µTx
59.2 ± 0.1 56.6 ± 0.2 58.9 ± 0.2 57.7 ± 0.2 59.3 ± 0.2
Table 5: The impact of intra-dataset variations in captions on CLIP’s transfer performance. Here,
we use synthetic captions for pre-training, constructed using COCO multi-object image labels.
We vary whether these captions are consistent (i.e., do they use a single term to describe a given
object?) and complete (i.e., do they describe all image objects?). We also consider a variant of
CLIP, CLIPS, which uses multiple captions per image.
29

Page 30
Model
SimCLR
CLIP
Dataset size
100K
200K
500K
2M
100K
200K
500K
2M
Aircraft
40.5 ± 1.0 40.3 ± 1.0 39.3 ± 0.9 37.9 ± 1.0 35.5 ± 1.0 39.9 ± 1.0 41.6 ± 1.0 45.1 ± 1.0
Birdsnap
20.2 ± 0.5 20.7 ± 0.5 20.6 ± 0.5 20.6 ± 0.5 15.1 ± 0.5 17.5 ± 0.5 19.8 ± 0.5 24.0 ± 0.6
Cal101
70.7 ± 0.7 70.3 ± 0.7 70.3 ± 0.7 69.0 ± 0.7 67.7 ± 0.8 73.5 ± 0.7 79.0 ± 0.6 84.8 ± 0.6
Cal256
57.7 ± 0.5 57.3 ± 0.5 57.3 ± 0.5 56.7 ± 0.5 54.4 ± 0.4 60.0 ± 0.5 65.9 ± 0.4 73.9 ± 0.4
Cars
33.3 ± 0.6 31.2 ± 0.6 29.6 ± 0.6 27.5 ± 0.6 29.8 ± 0.6 33.8 ± 0.6 37.7 ± 0.6 42.6 ± 0.7
CIFAR-10 81.0 ± 0.5 80.4 ± 0.5 79.3 ± 0.5 79.8 ± 0.5 82.5 ± 0.4 83.9 ± 0.4 85.6 ± 0.4 86.8 ± 0.4
CIFAR-100 58.1 ± 0.6 57.4 ± 0.6 56.4 ± 0.5 56.4 ± 0.6 59.7 ± 0.6 63.2 ± 0.5 64.8 ± 0.5 67.8 ± 0.6
DTD
62.8 ± 1.3 63.9 ± 1.2 64.5 ± 1.2 64.3 ± 1.3 63.7 ± 1.3 67.6 ± 1.3 70.3 ± 1.3 74.7 ± 1.2
Flowers
80.8 ± 0.6 80.3 ± 0.6 80.2 ± 0.6 79.4 ± 0.6 76.5 ± 0.6 80.8 ± 0.6 85.0 ± 0.5 88.8 ± 0.5
Food
57.6 ± 0.3 58.3 ± 0.4 57.0 ± 0.4 56.7 ± 0.4 56.6 ± 0.4 59.4 ± 0.4 62.7 ± 0.3 68.1 ± 0.3
Pets
58.2 ± 0.9 57.9 ± 1.0 56.8 ± 0.9 55.9 ± 0.9 49.7 ± 1.0 53.5 ± 0.9 60.2 ± 0.9 65.2 ± 0.9
SUN
49.4 ± 0.4 49.8 ± 0.4 49.7 ± 0.4 49.6 ± 0.4 45.9 ± 0.4 50.9 ± 0.4 55.3 ± 0.4 61.8 ± 0.4
µTx
55.9 ± 0.2 55.3 ± 0.2 55.1 ± 0.2 54.5 ± 0.2 53.1 ± 0.2 57.0 ± 0.2 60.7 ± 0.2 65.3 ± 0.2
Table 6: Transfer performance of SimCLR and CLIP models after pre-training on CC subsets.
Model
SimCLR
CLIP
Dataset size
100K
200K
500K
2M
100K
200K
500K
2M
Aircraft
39.5 ± 0.9 39.3 ± 1.0 38.0 ± 0.9 36.3 ± 0.9 17.0 ± 0.7 21.2 ± 0.8 41.5 ± 0.9 43.0 ± 0.9
Birdsnap
19.2 ± 0.5 18.9 ± 0.5 19.7 ± 0.5 19.0 ± 0.5 8.3 ± 0.4 10.4 ± 0.4 19.8 ± 0.5 26.2 ± 0.6
Cal101
71.0 ± 0.7 71.1 ± 0.7 70.3 ± 0.7 68.4 ± 0.7 42.7 ± 0.7 51.4 ± 0.7 75.2 ± 0.7 82.1 ± 0.6
Cal256
56.9 ± 0.5 58.5 ± 0.5 58.6 ± 0.5 57.7 ± 0.5 32.9 ± 0.4 38.2 ± 0.5 62.4 ± 0.5 70.5 ± 0.4
Cars
33.1 ± 0.6 29.8 ± 0.6 28.1 ± 0.6 26.8 ± 0.6 11.8 ± 0.4 15.5 ± 0.5 36.1 ± 0.7 37.4 ± 0.6
CIFAR-10 80.4 ± 0.5 80.6 ± 0.4 80.2 ± 0.5 79.7 ± 0.5 71.1 ± 0.5 72.9 ± 0.5 83.5 ± 0.4 86.0 ± 0.4
CIFAR-100 56.8 ± 0.5 58.2 ± 0.5 56.8 ± 0.5 57.2 ± 0.6 46.6 ± 0.6 47.7 ± 0.6 62.3 ± 0.6 66.2 ± 0.5
DTD
64.8 ± 1.2 67.0 ± 1.2 67.3 ± 1.2 67.0 ± 1.3 41.9 ± 1.3 49.9 ± 1.3 69.1 ± 1.2 74.3 ± 1.1
Flowers
80.9 ± 0.6 81.2 ± 0.6 80.5 ± 0.6 80.5 ± 0.6 47.6 ± 0.7 54.9 ± 0.8 83.4 ± 0.5 89.4 ± 0.4
Food
57.4 ± 0.4 57.9 ± 0.4 56.9 ± 0.4 57.4 ± 0.4 36.6 ± 0.4 43.0 ± 0.3 61.7 ± 0.3 67.4 ± 0.4
Pets
54.8 ± 1.0 55.4 ± 1.0 55.6 ± 0.9 55.6 ± 0.9 30.4 ± 0.9 34.0 ± 0.9 55.7 ± 1.0 61.9 ± 0.9
SUN
51.4 ± 0.4 52.9 ± 0.4 53.1 ± 0.4 53.2 ± 0.4 29.2 ± 0.4 34.7 ± 0.4 54.6 ± 0.4 62.8 ± 0.4
µTx
55.5 ± 0.2 55.9 ± 0.2 55.4 ± 0.2 54.9 ± 0.2 34.7 ± 0.2 39.5 ± 0.2 58.8 ± 0.2 63.9 ± 0.2
Table 7: Transfer performance of SimCLR and CLIP models after pre-training on YFCC subsets.
30

Page 31
Method
Size Epochs
Aircraft
Birdsnap
Ctech101
Cars
CIFAR10
CIFAR100
DTD
Flowers
Food-101
Pets
SUN937
BYOL ([THO21]) 100M 1000
47.5 31.3 84.0 44.3 85.0 63.9 75.2 93.4 67.9 71.1 63.4
MoCLR ([THO21]) 100M 1000
45.6 29.4 85.6 41.1 87.8 69.9 75.8 92.9 67.7 67.7 63.4
CLIP
2M
100
43.0 26.2 82.1 37.4 86.0 66.2 74.3 89.4 67.4 61.9 62.8
Table 8: Comparison of our results to [THO21].
Model
CLIP
CLIPS
Dataset
CC
YFCC
CC
YFCC
Dataset size
100K
500K
100K
500K
100K
100K
Aircraft
35.5 ± 1.0 41.6 ± 1.0 35.4 ± 0.9 42.8 ± 0.9 40.1 ± 1.0 41.3 ± 0.9
Birdsnap
15.1 ± 0.5 19.8 ± 0.5 15.9 ± 0.5 20.7 ± 0.6 17.8 ± 0.5 19.2 ± 0.5
Cal101
67.7 ± 0.8 79.1 ± 0.6 67.9 ± 0.7 79.7 ± 0.6 75.1 ± 0.6 75.8 ± 0.6
Cal256
54.4 ± 0.5 65.9 ± 0.5 55.8 ± 0.5 67.6 ± 0.4 61.8 ± 0.5 62.6 ± 0.5
Cars
29.8 ± 0.6 37.7 ± 0.6 29.6 ± 0.6 37.8 ± 0.6 37.3 ± 0.6 38.1 ± 0.6
CIFAR-10
82.5 ± 0.5 85.6 ± 0.4 82.9 ± 0.4 85.6 ± 0.4 83.6 ± 0.4 82.7 ± 0.4
CIFAR-100 59.7 ± 0.6 64.8 ± 0.5 60.9 ± 0.6 65.2 ± 0.5 62.2 ± 0.6 60.9 ± 0.6
DTD
63.7 ± 1.2 70.2 ± 1.2 64.1 ± 1.3 71.0 ± 1.3 67.7 ± 1.3 68.7 ± 1.2
Flowers
76.5 ± 0.6 85.0 ± 0.5 77.7 ± 0.6 86.2 ± 0.5 81.4 ± 0.6 83.7 ± 0.6
Food
56.6 ± 0.4 62.7 ± 0.3 57.3 ± 0.4 64.0 ± 0.3 59.6 ± 0.4 61.3 ± 0.3
Pets
49.7 ± 1.0 60.2 ± 0.9 49.6 ± 0.9 61.6 ± 0.9 54.7 ± 0.9 56.6 ± 0.9
SUN
45.9 ± 0.4 55.3 ± 0.4 47.7 ± 0.4 57.1 ± 0.4 52.5 ± 0.4 54.9 ± 0.4
µTx
53.7 ± 0.2 60.7 ± 0.2 54.8 ± 0.2 61.8 ± 0.2 57.8 ± 0.2 58.8 ± 0.2
Table 9: Effect of using BLIP captions for CC/YFCC images in CLIP training.
31

Page 32
Dataset
CC
YFCC
Dataset size
100K
500K
Aircraft
37.0 ± 1.0 41.0 ± 1.0
Birdsnap
15.5 ± 0.5 21.1 ± 0.5
Cal101
71.1 ± 0.7 78.2 ± 0.7
Cal256
55.9 ± 0.5 64.9 ± 0.4
Cars
30.9 ± 0.6 35.2 ± 0.6
CIFAR-10 82.9 ± 0.5 85.1 ± 0.4
CIFAR-100 59.3 ± 0.6 63.4 ± 0.6
DTD
63.8 ± 1.3 71.8 ± 1.2
Flowers
76.3 ± 0.7 84.3 ± 0.5
Food
57.4 ± 0.4 64.1 ± 0.3
Pets
52.7 ± 0.9 59.3 ± 0.9
SUN
47.4 ± 0.4 56.4 ± 0.4
µTx
54.2 ± 0.2 60.4 ± 0.2
Table 10: Effect of caption filtering on CLIP’s transfer performance.
Dataset
CC
COCO
Dataset size
200K
120K
Aircraft
41.9 ± 0.9 44.7 ± 1.0
Birdsnap
18.8 ± 0.5 18.6 ± 0.5
Cal101
77.4 ± 0.7 75.9 ± 0.6
Cal256
63.5 ± 0.4 62.8 ± 0.4
Cars
38.2 ± 0.6 40.8 ± 0.6
CIFAR-10 84.0 ± 0.4 84.1 ± 0.4
CIFAR-100 62.5 ± 0.6 61.3 ± 0.6
DTD
68.1 ± 1.2 65.3 ± 1.3
Flowers
82.4 ± 0.6 81.9 ± 0.6
Food
60.4 ± 0.4 62.0 ± 0.4
Pets
56.0 ± 1.0 59.6 ± 1.0
SUN
53.1 ± 0.4 51.9 ± 0.4
µTx
58.8 ± 0.3 58.9 ± 0.3
Table 11: Training CLIPS models using additional captions generated via GPT-J paraphrasing.
32
-