arXiv:2207.07635v1 [cs.CV] 15 Jul 2022

Page 1

Is a Caption Worth a Thousand Images?

A Controlled Study for Representation Learning

Shibani Santurkar

Stanford

shibani@stanford.edu

Yann Dubois

Stanford

yanndubs@stanford.edu

Rohan Taori

Stanford

rtaori@stanford.edu

Percy Liang

Stanford

pliang@cs.stanford.edu

Tatsunori Hashimoto

Stanford

thashim@stanford.edu

Abstract

The development of CLIP [RKH+21] has sparked a debate on whether language supervi-

sion can result in vision models with more transferable representations than traditional image-

only methods. Our work studies this question through a carefully controlled comparison of

two approaches in terms of their ability to learn representations that generalize to downstream

classification tasks. We find that when the pre-training dataset meets certain criteria—it is suf-

ficiently large and contains descriptive captions with low variability—image-only methods do

not match CLIP’s transfer performance, even when they are trained with more image data.

However, contrary to what one might expect, there are practical settings in which these criteria

are not met, wherein added supervision through captions is actually detrimental. Motivated by

our findings, we devise simple prescriptions to enable CLIP to better leverage the language

information present in existing pre-training datasets.

1 Introduction

Image-based contrastive learning approaches have shown promise in building models that gener-

alize beyond the data distributions they are trained on [WXY+18; HFW+20; CKN+20; CMM+20;

CKS+20; CKS+20; CTM+21; CH21]. By leveraging large-scale (unlabelled) data sources through

self-supervised training, these models learn representations that transfer to diverse downstream

tasks—more so that their supervised counterparts [EGH21].

Recently, Radford et al. [RKH+21] showed that a different approach—contrastive learning with

language supervision—can yield models (known as CLIP) with remarkable transfer capabilities.

This development has garnered significant interest in the vision and natural language processing

communities alike, leading to a debate on the utility of multi-modality in representation learn-

ing [ZWM+22; DCB+21; FIW+22]. Our work focuses on a specific question within this debate:

Does language supervision lead to more transferable representations than using images alone?

It might seem like the answer to this question is obvious. After all, CLIP utilized caption in-

formation unavailable to traditional image-based approaches and showed substantial gains over

arXiv:2207.07635v1 [cs.CV] 15 Jul 2022

Page 2

“A row of five 227

Back to Santa

Maria on market

day. Santa Maria

De Palautordera,

Cancell del

Montseny,

Barcelona, Spain.”

“duffel van

parked on street

beside small

buildings and

brick building”

“choosing the

parking meters

on this street

should be very

difficult”

“the car is

parked on the

side of the

road by the

tall buildings“

∼

Image-only supervision (e.g., SimCLR)

Image-language supervision (e.g., CLIP)

Objective:

consistency

x+ ∼ T(x)

Figure 1: A conceptual view of contrastive image-only and image-language pre-training. The two

methods rely on the same self-supervised objective: aligning the representations of positive pairs

(x, x+) while distinguishing them from negative examples (e.g., other examples in the batch). The

transformation T(·) which is used to obtain x+ ∼ T(x) (augmented image or caption) encodes the

equivalences we would like the model to satisfy.

prior work [RKH+21]. However, CLIP is drastically different from these approaches in many

ways, from training data to fine-grained implementation choices [DCB+21], which makes it dif-

ficult to isolate the contribution of language supervision. Further, recent studies on CLIP’s zero-

shot classification and robustness properties cast doubt on whether adding language supervision

is always beneficial [FIW+22]. Resolving the aforementioned debate thus requires a carefully con-

trolled comparison of the two approaches in which the only difference is the form of supervision.

Our contributions. We devise a methodology to assess the utility of language supervision from

a representation learning standpoint. To do so, we recognize that CLIP and popular image-based

methods share the same underlying primitive of contrastive learning. In particular, CLIP is con-

ceptually strikingly similar to SimCLR [CKN+20]. Perhaps the only irreducible difference between

them is whether supervision is provided to the model via image augmentations or image-caption

matching (see Figure 1)—which is precisely the quantity we want to study. Thus, we can assess

the value of language supervision by systematically comparing appropriately matched versions

of SimCLR and CLIP1 in terms of their downstream transfer performance. We find that in practice,

the picture is nuanced and depends on three properties of the pre-training dataset:

1. If the scale of the dataset is sufficiently large, CLIP representations transfer better than their

SimCLR counterparts. This gap is not bridged by training SimCLR with more data, suggest-

ing that a caption can be worth more than any number of images. However, in the low-data

regime, language supervision actually hurts model performance in and out-of-distribution.

2. The descriptiveness [KGP21] of dataset captions—i.e., the extent to which they report what is

contained in an image—directly determines how well the resulting CLIP models transfer. In

1We use CLIP to mean models trained using Radford et al. [RKH+21]’s approach, and not their pre-trained model.

Page 3

fact, we find that a single descriptive image-caption pair (e.g., from MS-COCO [LMB+14]) is

worth five less descriptive, uncurated captions (e.g., from YFCC [TSF+16]).

3. The variability of captions within a dataset (e.g. due to stylistic or lexical factors) can ad-

versely affect CLIP’s performance. We propose a modification to standard CLIP training—

performing text data augmentations by sampling from a pool of captions for each image—to

alleviate this effect.

Overall, we find that these three properties have inter-twined effects on CLIP’s performance.

For instance, the scale of the widely-used YFCC dataset can, to some extent, compensate for its

less-descriptive and variable captions. Guided by our findings, we devise simple interventions

on datasets that can lead to more-transferrable CLIP models: (i) filtering out low-quality captions

through a text-based classifier, and (ii) applying data augmentation to captions by paraphrasing

them using pre-trained language models [WK21].

2 An apples-to-apples comparison

As stated in Section 1, our goal is to assess the value of language supervision in representation

learning relative to using images alone. While there have been studies of image-only and image-

language pre-training methods in isolation [WXY+18; HFW+20; CKN+20; CMM+20; CKS+20;

CKS+20; CH21; CTM+21; RKH+21] and side-by-side [DCB+21; FIW+22], none of these works

conclusively answer our motivating question due to confounders such as: (i) algorithmic and

architectural variations, and (ii) differing pre-training datasets.

In this section, we outline a series of steps that we take to mitigate these confounders and

compare the two methods on equal footing. Since our focus is on representation learning, we

measure performance in terms of the usefulness of a model’s representations for downstream

tasks, using the evaluation suite of Kornblith et al. [KSL19] to do so. We focus on the fixed-feature

setting where we freeze the weights of a given model and then train a linear probe using task data.

Details of our experimental setup are presented in Appendix A.

2.1 Finding common ground

Our approach for isolating the effects of language supervision is guided by the following insight:

CLIP shares a fundamental commonality with widely-used image-only pre-training methods.

Namely, they rely on the same algorithmic primitive of contrastive learning, which we illustrate

in Figure 1. In both cases, the model is trained with a self-supervised objective: given an image

x, it must distinguish positive examples x+ ∼ T(x) from negative ones (e.g., other examples ˆx in

the batch). The choice of transformation T(·) is at the core of contrastive learning as it controls the

equivalences encoded in model representations [DBU+21; HWG+21]. T(x) corresponds to image

augmentations (e.g., rotations) in image-only methods and natural language captions in CLIP.

Thus, to understand the role of language supervision, we can compare CLIP to its closest

image-only equivalent: SimCLR .2 Both CLIP and SimCLR rely on cross-entropy based objective,

which for a given pair (x, x+) of positive examples with associated negatives N is

l = −log

exp(sim(z, z+)/τ)

Cn∈N ∪{z+} exp(sim(z, zn)/τ)

, where z = g(φ(x)) and z+/n = g (φ (x+/n)),

(1)

2Other image-based methods [HFW+20; CKS+20; CH21; CTM+21] have optimizations that are not present in CLIP.

Page 4

where sim is cosine similarity, φ/φ are encoders, and g/g are projection heads.

We now discuss the steps we take to alleviate other inconsistencies (aside from T(x)) between

CLIP and SimCLR:

• Transformation stochasticity: We first note that the two methods differ in how they obtain x+,

not just due to the choice of T(x) but also the generative process itself. In SimCLR , x+ is

a new random draw from T(x) in every batch, while for CLIP, it is a single fixed caption.

Perfectly matching them requires training CLIP by sampling a fresh caption x+ for each

image at each iteration. We will refer to this idealized version of CLIP as CLIPS.

• Image augmentations: Both methods apply data augmentations to the image x at each step

in training. However, the specific augmentations used in CLIP (resize and crop) differ

from those used for SimCLR (resize, crop, flip, jitter, blur, grayscale). We remove this

confounder by using standard SimCLR augmentations unless otherwise specified.

• Architecture: We use the ResNet-50 [HZR+16] architecture as the image encoder for both

methods, and a Transformer [VSP+17] as the text encoder (for captions) in CLIP.

• Datasets: Typically, CLIP and SimCLR are trained on different datasets, as the former requires

matched image-caption pairs, while the latter can leverage any computer vision dataset. To

control for the effect of the data distribution, we pre-train both models on the same datasets:

starting with the relatively controlled MS-COCO

• Hyperparameters: We extensively tune hyperparameters for both methods (Appendix A.3).

Mismatches. Despite our efforts to match CLIP and SimCLR, there are some inconsistencies that

we are unable to account for—partly due to the differences in their modalities. In particular, CLIP:

(i) Processes T(x) using a text transformer rather than SimCLR’s ResNet-50.

(ii) Does not share weights between the encoders processing x and T(x) because they corre-

spond to different modalities, unlike SimCLR.

(iii) Uses a linear projection head g/g instead of SimCLR’s MLP, which we allow as Radford

et al. [RKH+21] showed that this choice does not affect CLIP’s performance.

(iv) Only uses other examples in the batch from the same modality as negatives. Thus CLIP has

half the number of negatives compared to SimCLR, which also uses transformed versions of

other examples in the batch (i.e. both ˆx and ˆx+) as negatives .

In Sections 2.2 and 3.2, we take a closer look at how our matched versions of the CLIP and

SimCLR methods compare in terms of downstream transfer performance.

2.2 A COCO case study

We begin our study by comparing CLIP and SimCLR models trained on the MS-COCO

dataset [LMB+14] (henceforth referred to as COCO), which contains ∼120K images with multi-

object labels. Each image has five human-provided captions, collected post-hoc by Chen et al.

[CFL+15] using Amazon Mechanical Turk. Annotators were given detailed instructions on how

to caption an image such as to describe only the important parts of the image, not to use proper

names, and to use at least 8 words.

Page 5

COCO

Aircraft

Birdsnap

Ctech101

Ctech256

Cars

CIFAR10

CIFAR100

DTD

Flowers

Food-101

Pets

SUN937

µTx

Supervised 90.6 31.6 11.8 65.8 53.7 21.7 74.8 46.7 55.9 63.4 47.1 45.9 44.5 47.2 ± 0.2

SimCLR

89.0 40.6 18.5 71.5 58.6 31.5 82.1 57.3 61.7 77.4 58.7 57.3 51.9 56.0 ± 0.2

CLIP

88.4 41.4 17.6 73.2 60.4 35.8 83.6 60.8 65.7 80.5 60.9 57.0 50.8 57.5 ± 0.1

CLIPS

89.8 46.4 20.0 78.4 65.6 41.5 84.6 62.5 66.7 83.9 65.3 61.2 54.9 61.3 ± 0.2

Table 1: Linear probe accuracy for COCO pre-trained models in-distribution and on the transfer

suite from Kornblith et al. [KSL19]. Here, µTx denotes the average test accuracy of the model over

downstream transfer tasks. We report 95% confidence intervals (CI) via bootstrapping.

We use COCO as our starting point for two reasons. First, we can assess the utility of language

supervision in the ideal setting where the captions are of fairly high quality due to the careful

curation process. Second, we can approximate CLIPS

3 by sampling from the available set of five

captions per image.

Captions (often) help on COCO. In Table 1, we compare pre-trained models (SimCLR, CLIP,

CLIPS and a supervised baseline), in terms of the accuracy of a linear probe on: (i) COCO classifi-

cation (in distribution), and (ii) transfer to downstream tasks from Kornblith et al. [KSL19].

On COCO classification, supervised models outperform self-supervised ones, and SimCLR is

more accurate than CLIP trained on human-written captions. This trend, however, flips when

we consider out-of-distribution performance. On most transfer tasks, CLIP performs the best:

yielding on average 10% accuracy over the supervised baseline and 2% over SimCLR.

Using a stochastic version of CLIP further boosts its performance. CLIPS matches SimCLR

performance in-distribution and is about 5% better on average in terms of transfer. While this

improvement is remarkable, it is not clear why stochastically sampling different (one-of-five) cap-

tions for a given image helps. For instance, it may be optimization-related or linked to properties

of the captions themselves. We revisit this question in Section 3.3.

Notably, we find that matching the image augmentations (applied to x) is crucial to correctly

assess the merit of added language supervision. In particular, using standard CLIP augmentations

(only resize and crop) during training lowers its average transfer accuracy by 10% (Appendix

Table 4). With these same augmentations, SimCLR’s performance also drops by 50%. This shows

the importance of correctly controlling for potential confounders discussed in Section 2.1.

3 The impact of pre-training data

Our analysis of COCO shows that language supervision can indeed be beneficial over using im-

ages alone. However, the datasets that CLIP is typically trained on differ, both in scale and qual-

ity, from COCO. For instance, COCO captions were collected post-hoc under controlled settings,

which is markedly different from the automatic scraping procedure used to gather data at scale.

Thus, we shift our focus to two more prototypical pre-training datasets:

3We overload notation and use CLIPS to denote: (i) the idealized stochastic version of CLIP, which samples from

infinite captions per image, and (ii) our approximation of it using a finite set of (typically five) image captions.

Page 6

COCO

“A man rides a

giant wave on

his surfboard.”

“A bathroom counter

with two sinks

under two mirrors.”

YFCC

“Thailand - Feb 23-24

2007 -Ko Phi Phi Island,

Phuket 024 This is where

we sat and waited for

the boat to Ko Phi Phi.

Interesting place to

watch the local load up

the boats before the

tide came in.”

“tower has stood

centuries the view

Brian saw but Pearl

chickened out from.

so it's been standing

since the 9th

century. it's teeny

and only gets narrow

as you go up.”

“cricket players

with the trophy

after victory

over country at

sports facility.”

“thumb is poking

into the

smartphone

screen black and

white sketch,

simple drawing.”

ConceptualCaptions

Figure 2: Random samples from the COCO, CC and YFCC datasets (also see Appendix Figure 6).

There are noticeable differences in the diversity of their images and the corresponding captions.

• ConceptualCaptions [SDG+18] (CC) contains ∼3.3M images harvested from web, with their

ALT-text HTML attributes as captions. The dataset was filtered (retaining only 0.2%) for text

quality—e.g., well-formed captions that mention at least one object found via the Google

Cloud Vision API. Furthermore, all proper nouns in the captions were hypernymized (e.g.,

”Justin Timberlake” becomes ”pop artist”).

• Yahoo Flickr Creative Commons [TSF+16] (YFCC): This dataset has ∼ 99.2M images from Flickr,

along with their posted titles as captions with no filtering or post-processing.

We now assess whether our findings on COCO translate to the CC and YFCC datasets. We

start by comparing the transfer performance of CLIP and SimCLR on COCO-scale subsets of

CC/YFCC—cf. points corresponding to 100K samples in Figure 3 (right). We observe that Sim-

CLR’s performance does not vary much across pre-training datasets. On the other hand, CLIP’s

transfer capabilities are highly sensitive to the dataset. With 100K samples from CC/YFCC, using

CLIP is worse than pre-training only on images—in contrast to what we found on COCO.

Inspecting COCO, CC and YFCC samples (Figure 2) yields a possible explanation for this sen-

sitivity. The datasets differ not just in scale and image diversity, but also the extent to which their

captions: (i) describe visually salient aspects of the image, and (ii) vary across images (e.g., in style

and wording). For instance, captions in COCO are homogenous and descriptive, while in YFCC,

they vary and often complementary to the image. In Sections 3.1-3.3, we investigate the effect these

properties (scale, descriptiveness and variability) of the dataset have on CLIP’s performance.

3.1 Scale matters

A major advantage of contrastive learning methods is that they can leverage the vast amounts

of unlabeled data available on the Internet. Thus, it is natural to ask how different forms of con-

trastive supervision benefit from added pre-training data. Intuitively, we expect image-only meth-

ods to perform worse for smaller datasets as they are increasingly less likely to encounter images

with similar augmentations. We might further expect image-language models to perform more

favorably in this setting since they receive richer supervision.

To test whether this is the case, we compare CLIP and SimCLR models trained on datasets

of varying sizes: in the 10-100K sample regime for COCO, and 100K-2M sample regime for

CC/YFCC.4 Our results in Figure 3 deviate from our earlier expectations. First, beyond a cer-

4Due to computational constraints, we train CLIP/SimCLR for fewer epochs (100 instead of 200) on 2M examples.

Page 7

Figure 3: Average transfer accuracy of models w.r.t. pre-training dataset size for COCO (left), and

CC and YFCC (right). Language supervision consistently improves performance in the medium to

large data regime over using images alone. However, on small corpora, providing the model with

additional information via captions is actually detrimental. Due to computational constraints, we

train models for fewer epochs (100 instead of 200) on datasets of size 2M.

tain point, SimCLR’s transfer performance improves only marginally with additional data. While

surprising, similar effects have been noted previously [THO21; CYW+22], especially when the

data is uncurated (e.g., YFCC) [THO21].5 Second, in the low-data regime (<50K/200K/500K for

COCO/CC/YFCC), training with language actually hurts the models’ transfer performance.

In fact, we find that (data) scale is essential to take advantage of language supervision. With

sufficient data, CLIP outperforms SimCLR on all three datasets. This gap remains even if we train

SimCLR with extra data, indicating that captions can be worth more than any number of images.

3.2 The importance of descriptive captions

As we saw in Figure 2, captions in typical datasets can vary in terms of how they relate to

the image. Prior work in linguistics and accessibility has drawn a distinction between captions

that are descriptive (meant to replace an image) and complementary (meant to give additional con-

text) [HYH13; KGP21; DMM+22]. This line of work suggests that COCO captions are more de-

scriptive due to the decontextualization of the image and strict instructions provided to the anno-

tators during the caption generation process [KGP21]. In contrast, Flickr captions (e.g., in YFCC)

tend to contain information that is complementary to the image since people typically do not

restate what can already be observed in the photographs they post [Gri75].

In representation learning for object recognition tasks, we ideally want to meaningfully encode

salient objects in the image. Recall that for contrastive models the learned representations are

determined by the transformation T(x) (captions for CLIP). This suggests a hypothesis: captions

that describe the contents of a scene will improve CLIP’s transferability.

To test this hypothesis, we need to quantify the descriptiveness of a caption. Since measuring

it precisely is infeasible, we approximate it with the help of a pre-trained caption-scoring model

(BLIP [LLX+22]). Specifically, we use the score given by BLIP of a caption matching its correspond-

ing image as a surrogate for descriptiveness. Comparing the average caption descriptiveness of

5Table 8 in the appendix shows that even more sophisticated image contrastive learning methods studied in Tian et

al. [THO21]—trained with better data augmentations, with 4x batch size, on 50x the data and for 10x more epochs—are

only marginally better than our CLIP models on the downstream tasks from Kornblith et al. [KSL19].

Page 8

(a)

Cons. Comp. Model COCO

µTx

CLIP

88.8 59.2 ± 0.1

CLIP

88.4 57.7 ± 0.2

CLIPS

89.1 59.3 ± 0.2

CLIP

88.4 56.6 ± 0.2

CLIPS

89.3 58.9 ± 0.2

(b)

Figure 4: Relationship between CLIP’s transfer performance and (a) the average descriptiveness

of dataset captions and (b) intra-dataset caption variability. We approximate descriptiveness (a)

using a pre-trained BLIP model [LLX+22] to score the similarity between a caption and the image

it corresponds to. To study the effect of caption variability in (b), we construct synthetic captions

for the COCO dataset using its multi-object image labels. We vary whether these captions are

consistent (use a single term to describe a given object) and complete (describes all image objects).

the three datasets in Figure 4a, we see that COCO > CC > YFCC. This aligns with our earlier

subjective assessment as well as with prior work [HYH13; KGP21].

In Figure 4a, we visualize the relationship between the average descriptiveness of a dataset’s

captions and the transfer performance of the resulting CLIP model. We indeed find that descrip-

tive captions are crucial for CLIP’s performance, and one descriptive image-caption pair from

COCO is worth 2x and 5x samples from CC and YFCC respectively. On YFCC and CC, CLIP thus

requires more data to benefit from language supervision.

Finally, we train CLIP on 100K subsets of CC and YFCC with “more descriptive” captions by

re-captioning the images using BLIP [LLX+22] (examples in Appendix Figure 7). CLIP trained on

CC/YFCC using BLIP captions no longer performs worse than its COCO counterpart (Figure 4a).

This indicates that CLIP’s sensitivity to the pre-training corpus is not just an artifact of differing

image distributions, but due to the presence (or absence) of descriptive captions.

3.3 The effect of intra-dataset variations in captions

Next, we examine how the variability of captions within a dataset affects CLIP’s transfer capabil-

ities. After all, there are many ways to caption an image, as shown in Figure 1. The presented

captions vary in terms of how they describe an object (e.g., “duffel van” or “car”), and the parts of

the image they focus on (e.g., discussing the “street” or “brick”). These stylistic, lexical, and focus

variations in captions could make it harder for CLIP to learn meaningful representations.

A simple setting. We investigate this effect on the COCO dataset by creating synthetic captions

using multi-object image labels (examples in Appendix Figure 8). Here, we can construct cap-

tions to precisely control whether they are: (i) consistent: by either using a fixed term or random

synonyms to describe an object across the dataset; and (ii) complete: by either mentioning all or a

random subset of image objects. In this setting, we find:

• A CLIP model trained with complete and consistent synthetic captions outperforms a model

trained on human-written captions (cf. row 1 in Figure 4b to row 3 in Table 1).

Page 9

Method NC Source CC (100K) YFCC (100K)

SimCLR 0

55.9 ± 0.2 55.5 ± 0.2

CLIP

1 Human 53.1 ± 0.2 34.7 ± 0.2

CLIP

BLIP 53.7 ± 0.2 54.8 ± 0.2

CLIPS

BLIP

56.9 ± 0.2

CLIPS

BLIP 57.8 ± 0.2 58.8 ± 0.2

CLIPS

10 BLIP

59.1 ± 0.2

(a)

(b)

Figure 5: A closer look at CLIPS. (a) Sensitivity of its transfer performance to the number of

captions per image (NC) used during training. For CC and YFCC, we use the BLIP captioning

model to generate multiple diverse captions per image. (b) Performance trade-offs between pre-

training using more image-caption pairs vs. more captions per image on the YFCC dataset.

• Dropping these two conditions, and thereby increasing the variability of captions, causes the

transfer performance of the model to drop (cf. rows 1, 2, and 4 in Figure 4b).

• A stochastic version of CLIP, i.e. CLIPS based on 5 synthetic captions per image, is not as

affected by caption inconsistency and/or incompleteness. The ∼2% improvement of CLIPS

over CLIP here mirrors the 3.6% gain seen for human-provided captions in Figure 1.

• Unlike CLIP, CLIPS transfers 2% better on average when trained on human-provided cap-

tions as opposed to synthetic ones.

These findings suggest that variability in dataset captions does have an adverse effect on the re-

sulting CLIP models. This drop can be mitigated by stochastically sampling from a set of possible

captions per image, rather than using a single fixed caption (in CLIPS). Note that standard im-

age contrastive learning methods already do this since they use random data augmentations (e.g.,

a different crop) to generate T(x) at every epoch. Finally, our results show that human-written

captions contain useful information for representation learning that is not present in object la-

bels alone. However, extracting this signal is not straightforward, and may require incorporating

multiple captions into CLIP training.

Datasets in practice. Looking at Figure 2, it seems that COCO < CC < YFCC in terms of caption

variability. This trend may be expected given how their captions were obtained and/or post-

processed: careful post-hoc labeling for COCO, filtering and hypernymizing ALT-text for CC, and

scraping raw Flickr titles for YFCC. Our results in the simple setting above suggests that this

variability of YFCC captions (and to a lesser extent CC) could (with lower descriptiveness) be

responsible for the worse transfer performance of the resulting CLIP models (Figure 3 right). It

also explains why scale is essential to benefit from language supervision on CC and YFCC. After

all, CLIP would need to be trained on more captions to even encounter the same word twice.

How many captions are enough? In the simple setting above, we saw that performing “text

data augmentations” in CLIPS can reduce the adverse impacts of caption variability. We now

Page 10

analyze how this effect scales with the number of available captions per image, focusing on the

CC and YFCC datasets. Since these datasets only contain one caption per image, we use the BLIP

captioning model to generate multiple captions via nucleus sampling [HBD+20]. We observe in

Figure 5a, that CLIPS improves as the number of available (BLIP-generated) captions per image

increases (plateauing around 10). However, scaling up the overall number of image-caption

pairs appears to be far more effective than incorporating more captions per image (at least those

obtained via BLIP) from the perspective of improving transfer performance (see Figure 5b). Note

that the relative costs of these two approaches are context dependent—e.g. in Section 4 we discuss

ways to augment the pool of image captions in a dataset without additional data collection.

It is important to note that there is a strong inter-dependence between the three properties we

discussed in Section 3 in terms of their influence on CLIP’s transfer capabilities. For instance,

scale can, to an extent compensate for variable/less descriptive captions (as seen in Figure 3).

That being said, our findings suggest that the utility of captions as a form of supervision can be

greatly improved by being more mindful of what and how they describe (in) an image.

4 Making existing captions work

So far, we have identified two properties of captions that impact their usefulness as a mode of

supervision: (i) descriptiveness and (ii) variability. With these in mind, we now put forth simple

interventions that can be made to datasets to improve the performance of CLIP models.

Data pre-processing: Given the importance of caption descriptiveness, we might consider pre-

processing scraped data to select for samples with this property. The CC data collection proce-

dure [SDG+18] partially demonstrates the effectiveness of this approach, as pre-training CLIP on

CC samples leads to better transfer performance than a comparable number of “raw” YFCC ones.

However, due to its reliance on the Google Vision API, this procedure can be quite expensive,

with costs scaling with the size of the scraped data. Recent works have taken a different approach,

using pre-trained image-language matching models (like CLIP and BLIP) to filter data [SVB+21].

Given that we are interested in building such models in the first place, we avoid taking this route.

Instead, we focus on understanding how far we can get by simply discarding low quality cap-

tions, agnostic to the images. To do so, we take inspiration from the filtering pipelines used to

build large language models [BMR+20]. Here, raw Internet data is cleaned by selecting samples

that are “similar” to known high-quality datasets (e.g., Wikipedia). Taking a similar approach, we

train a linear classifier on a bag-of-n-grams sentence embeddings [JGB+17] to distinguish valida-

tion set CC/YFCC captions from COCO ones. This classifier is then used to filter CC/YFCC, only

retaining samples that are predicted as being COCO-like (examples in Appendix Figure 9). For a

given pre-training data budget, we see moderate gains (∼ 2%) from using this simple heuristic to

filter the CC and YFCC datasets—see Table 2 (left). To put these gains in context, we also report the

performance of CLIP trained on the same images with “high-quality” BLIP-generated captions.

Mitigating caption variability: As we saw in Section 3.3, models trained with CLIPS are less

impacted by caption variability. However, typical image-caption datasets (such as CC and YFCC)

only have one caption per image. We thus devise a methodology to augment these captions by

leveraging recent open-source large language models [WK21]. Concretely, we provide GPT-J with

4 (caption, paraphrase) pairs as in-context [BMR+20] examples. We then prompt it to paraphrase

Page 11

Dataset

Method Preproc.

µTx

CC (100K)

SimCLR

55.9 ± 0.3

CLIP

53.1 ± 0.2

CLIP

Filter 54.2 ± 0.2

YFCC (500K)

SimCLR

55.4 ± 0.2

CLIP

58.8 ± 0.2

CLIP

BLIP 61.8 ± 0.2

CLIP

Filter 60.4 ± 0.2

Dataset

Method Caption

µTx

COCO (120K)

SimCLR

56.0 ± 0.2

CLIP

Human 57.5 ± 0.1

CLIPS

Human 61.3 ± 0.2

CLIPS

GPT-J 58.9 ± 0.3

CC (200K)

SimCLR

55.3 ± 0.2

CLIP

Human 57.0 ± 0.3

CLIPS

GPT-J 58.8 ± 0.3

Table 2: Improving CLIP’s transfer performance through simple interventions on existing datasets.

(left) Applying a simple bag-of-words classifier to identify data subsets with “high quality” cap-

tions. (right) Using in-context learning with GPT-J to obtain five diverse captions for dataset im-

ages (via paraphrasing) which are then used to train CLIPS. For COCO, we also compare to CLIPS

trained with five human-written caption.

a given target caption. By sampling from GPT-J, we can obtain multiple (in our case five) para-

phrases for every such caption (see Appendix Figure 10 for examples). In Table 2 (right), we

see that feeding these captions into CLIPS results in a considerable performance boost over CLIP

(trained with a single caption/image). For instance, for COCO, CLIPS trained on our generated

captions bridges more than half of the performance gap between CLIP and CLIPS trained with

one and five human-provided captions respectively.

5 Related Work

Representation learning. Building models with general representations that transfer to down-

stream tasks has been a long-standing goal in ML [DJV+14; RAS+14; CSV+14; AGM14; YCB+14].

Our work is in line with prior studies aimed at characterizing the effect of design choices made

during training [ARS+15; HAE16; CMB+16; KSL19; ZPK+19; LBL+20], e.g. model architecture,

datasets and loss functions, on learned representations.

The utility of language in vision. There has been a long line of work on leveraging language to

improve vision models [QCD07; SS12; FCS+13; BAM18; GWW19]. However, with the develop-

ment of CLIP and its variants [MKW+21; LLZ+22; YHH+22], this approach has become a serious

contender to traditional image-only ones. Follow up works have sought to investigate how inte-

gral language is to CLIP’s performance. Ruan et al. [RDM22] suggest theoretically that the robust-

ness of linear probes on CLIP’s representations stems from pretraining with a large and diverse

set of images and domain-agnostic augmentations T(x) . More recently, Fang et al. [FIW+22]

examined CLIP’s effective robustness [TDS+20] (on ImageNet-like datasets [DDS+09; RDS+15;

RRS+19; WGX+19; BMA+19; HZB+21; HBM+21]) in the zero-shot setting. They find that CLIP’s

robustness is comparable to that of a supervised classifier trained on the same pool of YFCC im-

ages, and therefore conclude that data distribution is more important than language supervision.

Our work is complementary to this study, as we examine the role of language in a different setting,

i.e., self-supervised representation learning. We show that the impact of language supervision in

Page 12

this setting is complex and depends on the quality and quantity of image-caption data.

Most similar to our work is the recent study by Devillers et al. [DCB+21], which argues that

language supervision does not result in improved downstream transfer, few-shot learning and ad-

versarial robustness. While they also consider CLIP and image-only models (e.g. BiT [KBZ+20]),

they do not attempt to directly control confounding effects. In particular, the two sets of models

are trained on different datasets with different objectives (e.g., contrastive for CLIP, supervised for

BiT). Our work performs a substantially more controlled study on the effect of language supervi-

sion, allowing us to make more direct claims than Devillers et al. [DCB+21].

Supervision in self-supervised learning. Prior works in contrastive learning have studied how

properties of the transformation T(x) affect the transferability of learned representations. They

show that for a given image x, a good view (x+) is one that retains label information while re-

moving other nuisances [TSP+20; TWS+21; DBU+21; FDF+20; MMW+21; WZM+22]. From this

perspective, our work can be viewed as studying whether a caption provides a better view for a

given image compared to standard image augmentations.

6 Discussion

Our work takes a step towards resolving the debate as to whether multi-modality, and language

in particular, can improve visual representation learning. A comparison of CLIP with matched

image-only SimCLR models reveals that neither form of supervision (using images alone or cou-

pled with language) is strictly better than the other. Indeed, there are practical regimes where

CLIP’s performance cannot be matched using SimCLR with any amount of image data and others

where language supervision is harmful. This is a direct consequence of CLIP’s sensitivity to its

pre-training data, especially its scale, descriptiveness, and variability of the captions. Through

our analysis, we also discovered algorithmic improvements (CLIPS) and dataset modifications

(filtering and augmenting captions) to better take advantage of language supervision.

Limitations. Our exploration allows us to quantify the utility of language supervision (over us-

ing images alone) in a specific setting: transfer learning via probing on certain object recognition

tasks [KSL19]. We view expanding the scope of our analysis as a direction for future work. Fur-

ther, despite the significant steps we took to control the differences between CLIP and SimCLR,

there are still some inconsistencies that have not been accounted for (discussed in Section 2). Nev-

ertheless, the differences between our and previous results [e.g, DCB+21] suggest that we success-

fully pinned down some crucial confounders (architecture, augmentations, stochasticity, datasets,

hyperparameters). Finally, while we show that CLIP’s representations are influenced by what the

captions they are trained on describe, we sidestep whether or not this is always desirable. After all,

recent studies [BPK21] show that vision-linguistic datasets have various biases and stereotypes,

which we might not want our models to learn.

Acknowledgements

We are grateful to Niladri Chatterji, Elisa Kreiss, Nimit Sohoni and Dimitris Tsipras for helpful

discussions. SS is supported by Open Philanthropy, YD by a Knights-Hennessy Scholarship, and

RT by the NSF GRFP under Grant No. DGE 1656518. We also thank Stanford HAI for a Google

Cloud credits grant.

Page 13

References

[AGM14]

P. Agrawal, R. Girshick, and J. Malik. “Analyzing the performance of multilayer

neural networks for object recognition”. In: European Conference on Computer Vision

(ECCV). 2014.

[ARS+15]

H. Azizpour, A. S. Razavian, J. Sullivan, A. Maki, and S. Carlsson. “Factors of trans-

ferability for a generic convnet representation”. In: IEEE Transactions on Pattern Anal-

ysis and Machine Intelligence (2015).

[BAM18]

T. Baltrušaitis, C. Ahuja, and L. Morency. “Multimodal machine learning: A sur-

vey and taxonomy”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence

(2018).

[BMA+19]

A. Barbu, D. Mayo, J. Alverio, W. Luo, C. Wang, D. Gutfreund, J. Tenenbaum, and

B. Katz. “ObjectNet: A large-scale bias-controlled dataset for pushing the limits of

object recognition models”. In: Advances in Neural Information Processing Systems

(NeurIPS). 2019.

[BMR+20]

T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakan-

tan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T.

Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen,

E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Rad-

ford, I. Sutskever, and D. Amodei. “Language Models are Few-Shot Learners”. In:

arXiv preprint arXiv:2005.14165 (2020).

[BPK21]

A. Birhane, V. U. Prabhu, and E. Kahembwe. “Multimodal datasets: misogyny,

pornography, and malignant stereotypes”. In: arXiv preprint arXiv:2110.01963 (2021).

[CFL+15]

X. Chen, H. Fang, T. Lin, R. Vedantam, S. Gupta, P. Dollár, and C. L. Zitnick.

“Microsoft coco captions: Data collection and evaluation server”. In: arXiv preprint

arXiv:1504.00325 (2015).

[CH21]

X. Chen and K. He. “Exploring simple siamese representation learning”. In: Confer-

ence on Computer Vision and Pattern Recognition (CVPR). 2021.

[CKN+20]

T. Chen, S. Kornblith, M. Norouzi, and G. Hinton. “A simple framework for con-

trastive learning of visual representations”. In: International Conference on Machine

Learning (ICML). 2020.

[CKS+20]

T. Chen, S. Kornblith, K. Swersky, M. Norouzi, and G. Hinton. “Big self-supervised

models are strong semi-supervised learners”. In: Advances in Neural Information Pro-

cessing Systems (NeurIPS). 2020.

[CMB+16]

B. Chu, V. Madhavan, O. Beijbom, J. Hoffman, and T. Darrell. “Best practices for

fine-tuning visual classifiers to new domains”. In: European Conference on Computer

Vision (ECCV). 2016.

[CMM+20] M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin. “Unsupervised

learning of visual features by contrasting cluster assignments”. In: Advances in Neu-

ral Information Processing Systems (NeurIPS). 2020.

[CSV+14]

K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. “Return of the devil in

the details: Delving deep into convolutional nets”. In: arXiv preprint arXiv:1405.3531

(2014).

Page 14

[CTM+21]

M. Caron, H. Touvron, I. Misra, H. J’egou, J. Mairal, P. Bojanowski, and A. Joulin.

“Emerging properties in self-supervised vision transformers”. In: Conference on Com-

puter Vision and Pattern Recognition (CVPR). 2021.

[CYW+22] Elijah Cole, Xuan Yang, Kimberly Wilber, Oisin Mac Aodha, and Serge Belongie.

“When does contrastive visual representation learning work?” In: Conference on

Computer Vision and Pattern Recognition (CVPR). 2022.

[DBU+21]

Y. Dubois, B. Bloem-Reddy, K. Ullrich, and C. J. Maddison. “Lossy compression for

lossless prediction”. In: Advances in Neural Information Processing Systems (NeurIPS)

(2021).

[DCB+21]

B. Devillers, B. Choksi, R. Bielawski, and R. VanRullen. “Does language help gener-

alization in vision models?” In: Computational Natural Language Learning. 2021.

[DDS+09]

J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei. “ImageNet: A large-scale hi-

erarchical image database”. In: Conference on Computer Vision and Pattern Recognition

(CVPR). 2009.

[DJV+14]

J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. “De-

CAF: A deep convolutional activation feature for generic visual recognition”. In:

International Conference on Machine Learning (ICML). 2014.

[DMM+22] P. Dognin, I. Melnyk, Y. Mroueh, I. Padhi, M. Rigotti, J. Ross, Y. Schiff, R. A. Young,

and B. Belgodere. “Image Captioning as an Assistive Technology: Lessons Learned

from VizWiz 2020 Challenge”. In: Journal of Artificial Intelligence Research (2022).

[EGH21]

L. Ericsson, H. Gouk, and T. M. Hospedales. “How well do self-supervised models

transfer?” In: Conference on Computer Vision and Pattern Recognition (CVPR). 2021.

[FCS+13]

A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov.

“Devise: A deep visual-semantic embedding model”. In: Advances in Neural Informa-

tion Processing Systems (NeurIPS). 2013.

[FDF+20]

M. Federici, A. Dutta, P. Forr’e, N. Kushman, and Z. Akata. “Learning robust rep-

resentations via multi-view information bottleneck”. In: International Conference on

Learning Representations (ICLR) (2020).

[FIW+22]

A. Fang, G. Ilharco, M. Wortsman, Y. Wan, V. Shankar, A. Dave, and L. Schmidt.

“Data Determines Distributional Robustness in Contrastive Language Image Pre-

training (CLIP)”. In: arXiv preprint arXiv:2205.01397 (2022).

[FP19]

W. Falcon and the PyTorch Lightning team. PyTorch Lightning. 2019. URL: https:

//github.com/Lightning-AI/lightning.

[Gri75]

H. P. Grice. “Logic and Conversation”. In: Syntax and Semantics (1975).

[GWW19]

W. Guo, J. Wang, and S. Wang. “Deep multimodal representation learning: A sur-

vey”. In: IEEE Access (2019).

[HAE16]

M. Huh, P. Agrawal, and A. A. Efros. “What makes ImageNet good for transfer

learning?” In: arXiv preprint arXiv:1608.08614 (2016).

[HBD+20]

A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi. “The Curious Case of Neural

Text Degeneration”. In: arXiv preprint arXiv:1904.09751 (2020).

Page 15

[HBM+21] D. Hendrycks, S. Basart, N. Mu, S. Kadavath, F. Wang, E. Dorundo, R. Desai, T.

Zhu, S. Parajuli, M. Guo, et al. “The many faces of robustness: A critical analysis

of out-of-distribution generalization”. In: Conference on Computer Vision and Pattern

Recognition (CVPR). 2021.

[HFW+20] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick. “Momentum contrast for unsuper-

vised visual representation learning”. In: Conference on Computer Vision and Pattern

Recognition (CVPR). 2020.

[HWG+21] J. Z. HaoChen, C. Wei, A. Gaidon, and T. Ma. “Provable guarantees for self-

supervised deep learning with spectral contrastive loss”. In: Advances in Neural In-

formation Processing Systems (NeurIPS) (2021).

[HYH13]

M. Hodosh, P. Young, and J. Hockenmaier. “Framing image description as a rank-

ing task: Data, models and evaluation metrics”. In: Journal of Artificial Intelligence

Research (JAIR) (2013).

[HZB+21]

D. Hendrycks, K. Zhao, S. Basart, J. Steinhardt, and D. Song. “Natural adversarial

examples”. In: Conference on Computer Vision and Pattern Recognition (CVPR). 2021.

[HZR+16]

K. He, X. Zhang, S. Ren, and J. Sun. “Deep Residual Learning for Image Recogni-

tion”. In: Computer Vision and Pattern Recognition (CVPR). 2016.

[IWW+21]

G. Ilharco, M. Wortsman, R. Wightman, C. Gordon, N. Carlini, R. Taori, A. Dave, V.

Shankar, H. Namkoong, J. Miller, H. Hajishirzi, A. Farhadi, and L. Schmidt. Open-

CLIP. 2021. URL: https://doi.org/10.5281/zenodo.5143773.

[JGB+17]

A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov. “Bag of Tricks for Efficient Text

Classification”. In: European Association for Computational Linguistics (EACL). 2017.

[KBZ+20]

A. Kolesnikov, L. Beyer, X. Zhai, J. Puigcerver, J. Yung, S. Gelly, and N. Houlsby.

“Big Transfer (BiT): General Visual Representation Learning”. In: European Confer-

ence on Computer Vision (ECCV). 2020.

[KGP21]

E. Kreiss, N. D. Goodman, and C. Potts. “Concadia: Tackling Image Accessibility

with Descriptive Texts and Context”. In: arXiv preprint arXiv:2104.08376 (2021).

[KSL19]

S. Kornblith, J. Shlens, and Q. V. Le. “Do better imagenet models transfer better?”

In: Conference on Computer Vision and Pattern Recognition (CVPR). 2019.

[LBL+20]

F. Locatello, S. Bauer, M. Lucic, G. R”atsch, S. Gelly, B. Sch”olkopf, and O. Bachem.

“A sober look at the unsupervised learning of disentangled representations and

their evaluation”. In: Journal of Machine Learning Research (JMLR) (2020).

[LLX+22]

J. Li, D. Li, C. Xiong, and S. Hoi. “Blip: Bootstrapping language-image pre-training

for unified vision-language understanding and generation”. In: arXiv preprint

arXiv:2201.12086 (2022).

[LLZ+22]

Y. Li, F. Liang, L. Zhao, Y. Cui, W. Ouyang, J. Shao, F. Yu, and J. Yan. “Supervi-

sion exists everywhere: A data efficient contrastive language-image pre-training

paradigm”. In: International Conference on Learning Representations (ICLR). 2022.

[LMB+14]

T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll’ar, and C. L.

Zitnick. “Microsoft coco: Common objects in context”. In: European Conference on

Computer Vision (ECCV). 2014.

[MKW+21] N. Mu, A. Kirillov, D. Wagner, and S. Xie. “SLIP: Self-supervision meets Language-

Image Pre-training”. In: arXiv preprint arXiv:2112.12750 (2021).

Page 16

[MMW+21] J. Mitrovic, B. McWilliams, J. Walker, L. Buesing, and C. Blundell. “Representation

Learning via Invariant Causal Mechanisms”. In: International Conference on Learning

Representations (ICLR). 2021.

[QCD07]

A. Quattoni, M. Collins, and T. Darrell. “Learning visual representations using

images with captions”. In: Conference on Computer Vision and Pattern Recognition

(CVPR). 2007.

[RAS+14]

A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. “CNN Features off-the-

shelf: an Astounding Baseline for Recognition.” In: arXiv preprint arXiv:1403.6382

(2014).

[RDM22]

Y. Ruan, Y. Dubois, and C. J. Maddison. “Optimal Representations for Covariate

Shift”. In: International Conference on Learning Representations (ICLR). 2022.

[RDS+15]

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy,

A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. “ImageNet Large Scale Visual

Recognition Challenge”. In: International Journal of Computer Vision (IJCV) (2015).

[RKH+21]

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A.

Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. “Learning transferable

visual models from natural language supervision”. In: International Conference on

Machine Learning (ICML). 2021.

[RRS+19]

B. Recht, R. Roelofs, L. Schmidt, and V. Shankar. “Do ImageNet Classifiers General-

ize to ImageNet?” In: International Conference on Machine Learning (ICML). 2019.

[SDG+18]

P. Sharma, N. Ding, S. Goodman, and R. Soricut. “Conceptual captions: A cleaned,

hypernymed, image alt-text dataset for automatic image captioning”. In: Association

for Computational Linguistics (ACL). 2018.

[SS12]

N. Srivastava and R. R. Salakhutdinov. “Multimodal learning with deep boltzmann

machines”. In: Advances in Neural Information Processing Systems (NeurIPS) (2012).

[SVB+21]

C. Schuhmann, R. Vencu, R. Beaumont, R. Kaczmarczyk, C. Mullis, A. Katta, T.

Coombes, J. Jitsev, and A. Komatsuzaki. “LAION-400M: Open dataset of clip-

filtered 400 million image-text pairs”. In: arXiv preprint arXiv:2111.02114 (2021).

[TDS+20]

R. Taori, A. Dave, V. Shankar, N. Carlini, B. Recht, and L. Schmidt. “Measuring ro-

bustness to natural distribution shifts in image classification”. In: Advances in Neural

Information Processing Systems (NeurIPS) (2020).

[THO21]

Y. Tian, O. J. Henaff, and A. van den Oord. “Divide and contrast: Self-supervised

learning from uncurated data”. In: Conference on Computer Vision and Pattern Recog-

nition (CVPR). 2021.

[TSF+16]

B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and

L. Li. “YFCC100M: The new data in multimedia research”. In: Communications of the

Association for Computing Machinery (ACM). 2016.

[TSP+20]

Y. Tian, C. Sun, B. Poole, D. Krishnan, C. Schmid, and P. Isola. “What makes for good

views for contrastive learning?” In: Advances in Neural Information Processing Systems

(NeurIPS). 2020.

[TWS+21]

Y. H. Tsai, Y. Wu, R. R. Salakhutdinov, and L. Morency. “Self-supervised Learning

from a Multi-view Perspective”. In: International Conference on Learning Representa-

tions (ICLR). 2021.

Page 17

[VSP+17]

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser,

and I. Polosukhin. “Attention Is All You Need”. In: arXiv preprint arXiv:1706.03762

(2017).

[WGX+19] H. Wang, S. Ge, E. P. Xing, and Z. C. Lipton. “Learning robust global representations

by penalizing local predictive power”. In: Advances in Neural Information Processing

Systems (NeurIPS). 2019.

[WK21]

B. Wang and A. Komatsuzaki. GPT-J-6B: A 6 Billion Parameter Autoregressive Language

Model. https://github.com/kingoflolz/mesh-transformer-jax. 2021.

[WXY+18]

Z. Wu, Y. Xiong, S. X. Yu, and D. Lin. “Unsupervised feature learning via non-

parametric instance discrimination”. In: Conference on Computer Vision and Pattern

Recognition (CVPR). 2018.

[WZM+22] M. Wu, C. Zhuang, M. Mosse, D. L. K. Yamins, and N. D. Goodman. “On Mutual In-

formation in Contrastive Learning for Visual Representations”. In: AAAI Conference

on Artificial Intelligence. 2022.

[YCB+14]

J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. “How transferable are features

in deep neural networks?” In: Advances in Neural Information Processing Systems

(NeurIPS). 2014.

[YHH+22]

L. Yao, R. Huang, L. Hou, G. Lu, M. Niu, H. Xu, X. Liang, Z. Li, X. Jiang, and C.

Xu. “FILIP: Fine-grained Interactive Language-Image Pre-Training”. In: International

Conference on Learning Representations (ICLR). 2022.

[ZPK+19]

X. Zhai, J. Puigcerver, A. Kolesnikov, P. Ruyssen, C. Riquelme, M. Lucic, J. Djo-

longa, A. S. Pinto, M. Neumann, A. Dosovitskiy, et al. “A large-scale study of rep-

resentation learning with the visual task adaptation benchmark”. In: arXiv preprint

arXiv:1910.04867 (2019).

[ZWM+22] X. Zhai, X. Wang, B. Mustafa, A. Steiner, D. Keysers, A. Kolesnikov, and L. Beyer.

“LiT: Zero-Shot Transfer with Locked-image Text Tuning”. In: Conference on Com-

puter Vision and Pattern Recognition (CVPR). 2022.

Page 18

A Setup and Experimental details

A.1 Datasets

In Appendix Figure 6, we present (random) samples from the MS-COCO [LMB+14], Conceptual

Captions [SDG+18] and YFCC datasets [TSF+16]. We use the 2017 version of COCO, which con-

tains five human-written captions along with multi-object image labels for each image.

- “A table topped with plates and glasses with eating utensils..”

- “a fork is laying on a small white plate”

- “dirty dishes on a table, and a bottle of something.”

- “a table top with some dishes on top of it”,

- “A table full of dirty dishes is pictured in this image.”

- “An All Nippon Airways 777 sitting at a gate on the tarmac.”

- “a large air plane on a run way”

- “A jumbo jet being serviced at an airport.”

- “A large blue and white jetliner sitting on top of a tarmac.”,

- “A large airliner preparing for departure at an airport.”

- “A man jumping a horse over an obstacle.”

- “A person jumping a horse over an object.”

- “An equestrian competitor and his horse jumping over a stile”

- “A horse and jockey jump over bush hurdles .”,

- “A rider and horse jump over a wooden brush obstacle.”

- “A couple of zebra standing on top of a dirt field.”

- “Some zebras walking around in a field looking around”

- “Some very cute zebras in a big dusty field.”

- “A small zebra standing next to a bigger zebra.”,

- “The baby Zebras stripes are much closer together than an adults.”

Figure 6: Dataset examples: MS-COCO [LMB+14]

Licenses. These datasets were obtained by scraping images from online hosting services (e.g.

Flickr). Thus, the ownership of the images lies with the respective individuals that uploaded them.

Nevertheless, as per their terms of agreement, the images can be used for research purposes.

Page 19

“paratroopers load onto a helicopter.”

“Close up hands of woman typing text message on smart

phone in a cafe.”

“woman in a bathrobe is smiling to camera in the forest”

“Girls in old time dresses selling flowers are pictured tak-

ing a rest of a bench.”

“A shrimp has pairs of legs.”

Figure 6: Dataset examples: Conceptual Captions [SDG+18]

Page 20

“Kenneth Phan #7 A Day in the Life of DC is a photo

project meant to capture a flavor of the region through

the eyes of the participants. Participants submitted

twelve photos taken on May 30, 2009. Photos by Ken-

neth Phan”

“Pombas New York - USA 27 de Setembro 2013”

“Um, the girls that live at the house I lived in 13 years

ago are huge @foursquare fans! #amazing @ 600 euclid

4sq.commPKNZv (posted via FlickSquare)”

“squares Created for dA Users Gallery Challenge #43 –

Winter Stock 1 Model with thanks to Reine-Haru”

“Stripes and Squares Love the contrast of the light on the

keyboard, stripes and squares”

Figure 6: Dataset examples: YFCC [TSF+16]

Page 21

Like most large-scale datasets, COCO, CC and YFCC have not been extensively vetted, and

may contain identifying information or offensive content. Characterizing the pervasiveness of

these issues is an important and active area of research. That being said, we do not redistribute

the data, our work is unlikely to significantly further the risks from these datasets.

A.2 Models

We rely on existing open source implementations of CLIP [IWW+21] and SimCLR [FP19] for

all our experiments, with a ResNet-50 image encoder (feature dimension=2048), and a linear

and MLP projection head respectively. We use the transformer architecture from Radford et al.

[RKH+21] for encoding captions in CLIP. Unless otherwise specified, we use five captions per

image to train CLIPS. For downstream transfer, we train a linear probe using task data.

A.3 Hyperparameters

We ran an extensive hyperparameter grid for CLIP and SimCLR on MS-COCO and used the same

configuration in the rest of our experiments. These defaults are stated in Appendix Table 3.

Model

BatchSize Epochs Warmup

Supervised

1024

200

10−3 10−6

SimCLR

1024

200

10−2 10−6

CLIP

1024

200

10−3

0.1

Table 3: Default hyperparameters for model training.

We use the Adam optimizer with a cosine lr schedule for all the models. All other hyperpa-

rameters are defaults from standard implementations of SimCLR 6 and CLIP.7

Exceptions. We train CLIP/SimCLR on CC/YFCC-2M for 100 epochs due to computational re-

strictions. For corpora smaller than 100K (Figure 3), we scale up the number of epochs to keep the

number of iterations roughly comparable.

Data augmentations. The PyTorch pseudo-code for the default SimCLR and CLIP data from

prior work augmentation are as follows:

TSimCLR

= {RandomResizedCrop(size = 224),

RandomHorizontalFlip(p = 0.5)

RandomApply(ColorJitter(0.8, 0.8, 0.8, 0.2),p = 0.8)

RandomGrayscale(p = 0.2),

GaussianBlur(kernel size = 23,p = 0.5)}

TCLIP

= {RandomResizedCrop(size = 224,

scale = (0.9, 1.0),

interpolation = BICUBIC)}

https://pytorch-lightning-bolts.readthedocs.io/en/latest/self_supervised_models.html

https://github.com/mlfoundations/open_clip

Page 22

Note that for our experiments, unless otherwise specified, we use the standard SimCLR set for the

Supervised/SimCLR/CLIP models.

Linear probe. We train the probe using cross-entropy loss on CLIP/SimCLR features of dimen-

sionality 2048. In cases where the downstream task data is imbalanced, we re-weight the loss to

account for it. We also evaluate class balanced accuracy at test time. For each downstream task,

we train the probe for 250 epochs using an SGD optimizer. We use a batch size of 256, weight

decay of 10−6 and momentum 0.9. We perform a grid search for the best learning rate (using the

validation set), considering values between 3×10−2 and 10. We also consider 3 random seeds.

COCO supervised. The COCO dataset contains multi-object labels for each image. We thus train

the supervised classifier and linear probe in this setting to predict whether each of the 80 objects

is present in an image. We then evaluate the accuracy of the model by aggregating (in a class

balanced manner) the correctness of each of these binary predictions.

Confidence intervals. We report 95% confidence intervals obtained via bootstrapping over the

test set, as well as the three random seeds used for the linear probe. Due to space constraints, we

do not always report them in the main paper, but include a detailed table for all our experiments

with confidence intervals in the Appendix.

A.4 Compute

We train each of our models of 4 NVIDIA A100 GPUs. Training both CLIP and SimCLR models

takes on the order of 8-10 hours for a pre-training corpus of size ∼100K.

Page 23

A.5 BLIP recaptioning

To generate BLIP captions for images from the CC and YFCC datasets, we use the BLIP captioning

model [LLX+22]. In particular, we use the provided8 ViT-Base with nucleus sampling (topp of 0.9,

repetition penalty of 1.1, and text length range [5, 40]), varying the random seed to generate mul-

tiple captions per image. (Random) image-BLIP caption pairs are shown in Appendix Figure 7.

Dataset: CC

- “portrait of a young boy sitting in the leaves in a park - stock image.”

- “toddler boy in a coat sitting on leaves with arms up to the air, smiling

and laughing - stock photo.”

- “An image of a little boy sitting on the leaves in a park - stock image.”

Dataset: CC

- “The men are walking on a dirt ground with equipment in the back-

ground.”

- “military soldiers in uniforms carrying weapons and soldiers on their

back in a desert”

- “Officials and soldiers stand in the desert, looking at a vehicle with

missiles.”

Dataset: YFCC

- “Signs of various silhouettes of people dancing, standing, and laying

on the street.”

- “lot of bronze colored women holding their arms up with their hands

together in front of a metal wall art”

- “Sculptures on a wall of various silhouettes and dance positions.”

Dataset: YFCC

- “I love the white swan in the foreground with the water behind him.”

- “an image group of white birds in a green area near some water”

- “the swan is standing on the green grass near the water”

Figure 7: Random images from CC and YFCC alongside BLIP captions.

https://github.com/salesforce/BLIP

Page 24

A.6 Synthetic COCO captions

We present examples of synthetic captions for the MS-COCO dataset created using the available

multi-object image labels in Appendix Figure 8. A synthetic caption is complete (incomplete) if it

describes all (a random subset) of objects in the image. It is consistent (inconsistent) if it describes

a given object using a single consistent term throughout the dataset (one from a set of manually

curated synonyms) and whether we use a fixed template (one of a set of templates). In every case,

we randomly order the objects that we describe.

Complete and Consistent:

- “A photo of four bowls, a oven, seven cups, a refrigerator, two

persons, a spoon, two cakes”

Incomplete and consistent:

- “A photo of a person, six cups, three bowls, two cakes, a oven.”

Incomplete and Inconsistent:

- “A photo of a kitchen, two women, two shot glasses.”

- “I see a oven, a kitchen, two mugs, a kitchen, a man.”

Complete and Consistent:

- “A photo of a person, a tennis racket, a sports ball, a car.”

Incomplete and consistent:

- “A photo of a car, a person, a tennis racket.”

Incomplete and Inconsistent:

- “a sports ball, a motorcar together.”

- “There is a woman.”

Figure 8: Random image samples from MS-COCO alongside our synthetic captions.

A.7 Filtering captions

In Section 4, we introduce a methodology to filter poor quality captions from a given source

dataset. Using the fastText library,9 we train a linear classifier bag-of-n-grams sentence embed-

dings (n=2) to distinguish a subset of source captions from those in the COCO validation set. We

then use the classifier to filter the source dataset, only selecting the ones that are (mis)classified

as being COCO like. In Appendix Figure 9, we present a (random) subset of filtered examples

from the YFCC dataset. Compared to random YFCC samples (cf. Appendix Figure 6), the ones

in Appendix Figure 9 have much shorter captions—often without attributes such as dates, urls

https://github.com/facebookresearch/fastText

Page 25

and hashtags. That being said, it is important to recognize that any simple heuristic for filtering is

ultimately limited by the captions present in the source dataset.

“Orc/Troll There’s a face only a mother could love.”

“Pedal Board 9 Back Camera.”

“Kittens Morrissey and Marr relax on the bed.”

“Park in Santa Clara Complete with Che graffiti.”

Figure 9: Filtered image-caption pairs from the YFCC dataset (see Section 4).

A.8 Augmenting captions with GPT-J

We also propose a methodology to augment captions contained in existing datasets by using a

pre-trained large language model (in our case GPT-J-6b, referred to as GPT-J) to paraphrase them

(Section 4). To this end, we rely on in-context learning, wherein we provide GPT-J with some

(four) paired caption-paraphrase examples (using the five human-provided COCO captions) as

the context. We then ask GPT-J to paraphrase a given target caption. For instance, a query to the

model might look like:

Page 26

Paraphrase the sentence below

Input: A little boy standing next to a dog in a field.

Output: A dog parked filled with people and a bunch of different dogs.

Paraphrase the sentence below

Input: Some people are on the sandy beach flying kites.

Output: a sunny day at the beach with colorful kites in the sky

Paraphrase the sentence below

Input: A living room filled with furniture and a table.

Output: A living room with a nice couch and a coffee table.

Paraphrase the sentence below

Input: A couple of people on a surfboard in the ocean.

Output: A dog is lying on the surfboard as it surfs on a wave.

Paraphrase the sentence below

Input: a handful of snowmen

make this with a glove .

Output:

We use temperature sampling to generate multiple diverse captions for a given image-caption

pair from the dataset. Examples for the CC dataset are shown in Appendix Figure 10.

Page 27

CC caption: “young beautiful businesswoman holding the documents and crossing

the road in financial district .”

GPT-J paraphrases:

- “A women in business attire crossing the road in financial district.”

- “A young beautiful businesswoman is crossing the road.”

- “A young woman crossing the road in financial district.”

CC caption: “Island with waterfalls and rainbows floating above clouds in a

dreamy sky – stock photo #”

GPT-J paraphrases:

- “An island with waterfalls and rainbows floating above clouds in a dreamy

sky.”

- “A rainbow in the sky with a cloud and a rainbow in the water”

- “A rainbow floating in a dreamy sky and a waterfall”

CC caption: “Young friends eating pizza in the park”

GPT-J paraphrases:

- “The young friends are eating pizza in the park”

- “A group of young people eating pizza in the park”

- “Some kids are eating pizza in the park”

CC caption: “Actors attend the spring fashion show during fashion week.”

GPT-J paraphrases:

- “A fashion show that’s attended by actors.”

- “People attending a spring fashion show during fashion week.”

- “The spring fashion show was held during fashion week.”

Figure 10: Caption paraphrases generated using in-context learning with GPT-J.

Page 28

B Additional experiments

In Appendix Tables 4-11, we report per-task performance for all our experiments. In Appendix

Table 4, we also illustrate the performance of SimCLR/CLIP models trained using the simpler

data augmentations typically used for CLIP training (cf. Appendix A.3). One can see that both

models perform worse with this modification—with the performance of CLIP dropping by 10%

and that of SimCLR by 50%.

For COCO, we also consider a variant of SimCLR, which we refer to as SimCLR+lab, that factors

in label information in the transformation T(x). Specifically, for a given image x, x+ is a data

augmented version of another COCO image which has at least one object in common with x. We

see that factoring label information does improve SimCLR’s performance considerably, putting it

between vanilla CLIP and CLIPS. However, note for typical pre-training datasets such as CC and

YFCC, we do not have access to such “expert” object labels. Instead, we can take advantage of

captions to improve the equivalences learned by the model.

Model

SUP

SimCLR− SimCLR SimCLR+lab

CLIP−

CLIP

CLIPS

COCO

90.5 ± 1.5 60.4 ± 2.4 88.9 ± 1.6 89.3 ± 1.5 84.9 ± 1.9 88.4 ± 1.7 89.8 ± 1.6

Aircraft

31.6 ± 0.9 2.3 ± 0.3 40.6 ± 1.0 47.0 ± 1.0 30.3 ± 1.0 41.4 ± 1.0 46.4 ± 1.0

Birdsnap

11.8 ± 0.4 0.7 ± 0.1 18.5 ± 0.5 20.8 ± 0.5 14.0 ± 0.4 17.6 ± 0.5 20.0 ± 0.5

Cal101

65.8 ± 0.7 3.8 ± 0.3 71.5 ± 0.7 80.4 ± 0.6 53.6 ± 0.8 73.2 ± 0.7 78.4 ± 0.6

Cal256

53.7 ± 0.5 3.1 ± 0.2 58.6 ± 0.4 65.7 ± 0.4 41.5 ± 0.5 60.4 ± 0.5 65.6 ± 0.5

Cars

21.7 ± 0.5 1.2 ± 0.1 31.4 ± 0.6 39.3 ± 0.7 23.4 ± 0.5 35.8 ± 0.6 41.5 ± 0.6

CIFAR-10 74.8 ± 0.5 23.2 ± 0.5 82.1 ± 0.4 81.5 ± 0.5 74.0 ± 0.5 83.6 ± 0.4 84.6 ± 0.4

CIFAR-100 46.7 ± 0.6 6.0 ± 0.3 57.3 ± 0.6 56.8 ± 0.6 50.4 ± 0.6 60.8 ± 0.6 62.5 ± 0.6

DTD

55.9 ± 1.4 6.2 ± 0.6 61.7 ± 1.3 60.3 ± 1.3 48.2 ± 1.4 65.7 ± 1.3 66.7 ± 1.3

Flowers

63.5 ± 0.7 4.6 ± 0.3 77.4 ± 0.6 81.4 ± 0.6 68.2 ± 0.7 80.5 ± 0.6 84.0 ± 0.6

Food

47.1 ± 0.4 4.0 ± 0.1 58.7 ± 0.3 56.4 ± 0.4 51.8 ± 0.4 60.9 ± 0.4 65.3 ± 0.4

Pets

45.9 ± 1.0 6.3 ± 0.5 57.3 ± 0.9 63.0 ± 0.9 44.6 ± 0.9 57.0 ± 0.9 61.2 ± 0.9

SUN

44.5 ± 0.4 1.3 ± 0.1 51.9 ± 0.4 52.2 ± 0.4 37.6 ± 0.4 50.8 ± 0.4 54.9 ± 0.4

µTx

47.2 ± 0.2 5.2 ± 0.1 56.0 ± 0.2 58.7 ± 0.2 44.8 ± 0.2 57.5 ± 0.1 61.3 ± 0.2

Table 4: Extended comparison of transfer performance of supervised, SimCLR and CLIP pre-

trained models. Here SimCLR− and CLIP− denote models trained with the default CLIP data

augmentation transforms instead of the SimCLR ones (cf. Appendix A.3). SimCLR+lab refers to

SimCLR models trained by picking x+ to be a different image with the same label as x.

Page 29

Model

CLIP

CLIPS

CLIP

CLIPS

Complete

Consistent

COCO

88.8 ± 1.7 88.4 ± 1.7 89.3 ± 1.6 88.3 ± 1.7 89.2 ± 1.5

Aircraft

46.6 ± 1.0 44.5 ± 1.0 46.6 ± 1.0 45.6 ± 1.0 45.8 ± 1.0

Birdsnap

18.9 ± 0.5 17.2 ± 0.5 18.6 ± 0.5 18.5 ± 0.5 19.1 ± 0.5

Cal101

77.3 ± 0.6 75.3 ± 0.7 76.8 ± 0.6 76.1 ± 0.7 76.0 ± 0.6

Cal256

63.3 ± 0.5 59.9 ± 0.5 63.0 ± 0.4 61.4 ± 0.5 63.6 ± 0.5

Cars

42.4 ± 0.6 41.6 ± 0.6 42.7 ± 0.7 41.2 ± 0.6 42.8 ± 0.6

CIFAR-10 83.3 ± 0.4 82.4 ± 0.4 82.9 ± 0.4 83.7 ± 0.4 83.2 ± 0.4

CIFAR-100 60.5 ± 0.6 59.0 ± 0.6 58.9 ± 0.5 59.9 ± 0.5 60.1 ± 0.6

DTD

64.3 ± 1.3 63.7 ± 1.3 66.1 ± 1.3 63.4 ± 1.2 65.2 ± 1.2

Flowers

82.1 ± 0.5 78.3 ± 0.6 79.5 ± 0.6 79.3 ± 0.6 80.6 ± 0.6

Food

61.4 ± 0.3 57.6 ± 0.4 60.9 ± 0.4 59.0 ± 0.3 61.9 ± 0.4

Pets

60.0 ± 0.9 57.1 ± 1.0 58.8 ± 0.9 59.8 ± 0.9 60.6 ± 1.0

SUN

52.1 ± 0.4 49.6 ± 0.4 53.1 ± 0.4 50.6 ± 0.4 52.7 ± 0.4

µTx

59.2 ± 0.1 56.6 ± 0.2 58.9 ± 0.2 57.7 ± 0.2 59.3 ± 0.2

Table 5: The impact of intra-dataset variations in captions on CLIP’s transfer performance. Here,

we use synthetic captions for pre-training, constructed using COCO multi-object image labels.

We vary whether these captions are consistent (i.e., do they use a single term to describe a given

object?) and complete (i.e., do they describe all image objects?). We also consider a variant of

CLIP, CLIPS, which uses multiple captions per image.

Page 30

Model

SimCLR

CLIP

Dataset size

100K

200K

500K

100K

200K

500K

Aircraft

40.5 ± 1.0 40.3 ± 1.0 39.3 ± 0.9 37.9 ± 1.0 35.5 ± 1.0 39.9 ± 1.0 41.6 ± 1.0 45.1 ± 1.0

Birdsnap

20.2 ± 0.5 20.7 ± 0.5 20.6 ± 0.5 20.6 ± 0.5 15.1 ± 0.5 17.5 ± 0.5 19.8 ± 0.5 24.0 ± 0.6

Cal101

70.7 ± 0.7 70.3 ± 0.7 70.3 ± 0.7 69.0 ± 0.7 67.7 ± 0.8 73.5 ± 0.7 79.0 ± 0.6 84.8 ± 0.6

Cal256

57.7 ± 0.5 57.3 ± 0.5 57.3 ± 0.5 56.7 ± 0.5 54.4 ± 0.4 60.0 ± 0.5 65.9 ± 0.4 73.9 ± 0.4

Cars

33.3 ± 0.6 31.2 ± 0.6 29.6 ± 0.6 27.5 ± 0.6 29.8 ± 0.6 33.8 ± 0.6 37.7 ± 0.6 42.6 ± 0.7

CIFAR-10 81.0 ± 0.5 80.4 ± 0.5 79.3 ± 0.5 79.8 ± 0.5 82.5 ± 0.4 83.9 ± 0.4 85.6 ± 0.4 86.8 ± 0.4

CIFAR-100 58.1 ± 0.6 57.4 ± 0.6 56.4 ± 0.5 56.4 ± 0.6 59.7 ± 0.6 63.2 ± 0.5 64.8 ± 0.5 67.8 ± 0.6

DTD

62.8 ± 1.3 63.9 ± 1.2 64.5 ± 1.2 64.3 ± 1.3 63.7 ± 1.3 67.6 ± 1.3 70.3 ± 1.3 74.7 ± 1.2

Flowers

80.8 ± 0.6 80.3 ± 0.6 80.2 ± 0.6 79.4 ± 0.6 76.5 ± 0.6 80.8 ± 0.6 85.0 ± 0.5 88.8 ± 0.5

Food

57.6 ± 0.3 58.3 ± 0.4 57.0 ± 0.4 56.7 ± 0.4 56.6 ± 0.4 59.4 ± 0.4 62.7 ± 0.3 68.1 ± 0.3

Pets

58.2 ± 0.9 57.9 ± 1.0 56.8 ± 0.9 55.9 ± 0.9 49.7 ± 1.0 53.5 ± 0.9 60.2 ± 0.9 65.2 ± 0.9

SUN

49.4 ± 0.4 49.8 ± 0.4 49.7 ± 0.4 49.6 ± 0.4 45.9 ± 0.4 50.9 ± 0.4 55.3 ± 0.4 61.8 ± 0.4

µTx

55.9 ± 0.2 55.3 ± 0.2 55.1 ± 0.2 54.5 ± 0.2 53.1 ± 0.2 57.0 ± 0.2 60.7 ± 0.2 65.3 ± 0.2

Table 6: Transfer performance of SimCLR and CLIP models after pre-training on CC subsets.

Model

SimCLR

CLIP

Dataset size

100K

200K

500K

100K

200K

500K

Aircraft

39.5 ± 0.9 39.3 ± 1.0 38.0 ± 0.9 36.3 ± 0.9 17.0 ± 0.7 21.2 ± 0.8 41.5 ± 0.9 43.0 ± 0.9

Birdsnap

19.2 ± 0.5 18.9 ± 0.5 19.7 ± 0.5 19.0 ± 0.5 8.3 ± 0.4 10.4 ± 0.4 19.8 ± 0.5 26.2 ± 0.6

Cal101

71.0 ± 0.7 71.1 ± 0.7 70.3 ± 0.7 68.4 ± 0.7 42.7 ± 0.7 51.4 ± 0.7 75.2 ± 0.7 82.1 ± 0.6

Cal256

56.9 ± 0.5 58.5 ± 0.5 58.6 ± 0.5 57.7 ± 0.5 32.9 ± 0.4 38.2 ± 0.5 62.4 ± 0.5 70.5 ± 0.4

Cars

33.1 ± 0.6 29.8 ± 0.6 28.1 ± 0.6 26.8 ± 0.6 11.8 ± 0.4 15.5 ± 0.5 36.1 ± 0.7 37.4 ± 0.6

CIFAR-10 80.4 ± 0.5 80.6 ± 0.4 80.2 ± 0.5 79.7 ± 0.5 71.1 ± 0.5 72.9 ± 0.5 83.5 ± 0.4 86.0 ± 0.4

CIFAR-100 56.8 ± 0.5 58.2 ± 0.5 56.8 ± 0.5 57.2 ± 0.6 46.6 ± 0.6 47.7 ± 0.6 62.3 ± 0.6 66.2 ± 0.5

DTD

64.8 ± 1.2 67.0 ± 1.2 67.3 ± 1.2 67.0 ± 1.3 41.9 ± 1.3 49.9 ± 1.3 69.1 ± 1.2 74.3 ± 1.1

Flowers

80.9 ± 0.6 81.2 ± 0.6 80.5 ± 0.6 80.5 ± 0.6 47.6 ± 0.7 54.9 ± 0.8 83.4 ± 0.5 89.4 ± 0.4

Food

57.4 ± 0.4 57.9 ± 0.4 56.9 ± 0.4 57.4 ± 0.4 36.6 ± 0.4 43.0 ± 0.3 61.7 ± 0.3 67.4 ± 0.4

Pets

54.8 ± 1.0 55.4 ± 1.0 55.6 ± 0.9 55.6 ± 0.9 30.4 ± 0.9 34.0 ± 0.9 55.7 ± 1.0 61.9 ± 0.9

SUN

51.4 ± 0.4 52.9 ± 0.4 53.1 ± 0.4 53.2 ± 0.4 29.2 ± 0.4 34.7 ± 0.4 54.6 ± 0.4 62.8 ± 0.4

µTx

55.5 ± 0.2 55.9 ± 0.2 55.4 ± 0.2 54.9 ± 0.2 34.7 ± 0.2 39.5 ± 0.2 58.8 ± 0.2 63.9 ± 0.2

Table 7: Transfer performance of SimCLR and CLIP models after pre-training on YFCC subsets.

Page 31

Method

Size Epochs

Aircraft

Birdsnap

Ctech101

Cars

CIFAR10

CIFAR100

DTD

Flowers

Food-101

Pets

SUN937

BYOL ([THO21]) 100M 1000

47.5 31.3 84.0 44.3 85.0 63.9 75.2 93.4 67.9 71.1 63.4

MoCLR ([THO21]) 100M 1000

45.6 29.4 85.6 41.1 87.8 69.9 75.8 92.9 67.7 67.7 63.4

CLIP

100

43.0 26.2 82.1 37.4 86.0 66.2 74.3 89.4 67.4 61.9 62.8

Table 8: Comparison of our results to [THO21].

Model

CLIP

CLIPS

Dataset

YFCC

Dataset size

100K

500K

100K

500K

100K

Aircraft

35.5 ± 1.0 41.6 ± 1.0 35.4 ± 0.9 42.8 ± 0.9 40.1 ± 1.0 41.3 ± 0.9

Birdsnap

15.1 ± 0.5 19.8 ± 0.5 15.9 ± 0.5 20.7 ± 0.6 17.8 ± 0.5 19.2 ± 0.5

Cal101

67.7 ± 0.8 79.1 ± 0.6 67.9 ± 0.7 79.7 ± 0.6 75.1 ± 0.6 75.8 ± 0.6

Cal256

54.4 ± 0.5 65.9 ± 0.5 55.8 ± 0.5 67.6 ± 0.4 61.8 ± 0.5 62.6 ± 0.5

Cars

29.8 ± 0.6 37.7 ± 0.6 29.6 ± 0.6 37.8 ± 0.6 37.3 ± 0.6 38.1 ± 0.6

CIFAR-10

82.5 ± 0.5 85.6 ± 0.4 82.9 ± 0.4 85.6 ± 0.4 83.6 ± 0.4 82.7 ± 0.4

CIFAR-100 59.7 ± 0.6 64.8 ± 0.5 60.9 ± 0.6 65.2 ± 0.5 62.2 ± 0.6 60.9 ± 0.6

DTD

63.7 ± 1.2 70.2 ± 1.2 64.1 ± 1.3 71.0 ± 1.3 67.7 ± 1.3 68.7 ± 1.2

Flowers

76.5 ± 0.6 85.0 ± 0.5 77.7 ± 0.6 86.2 ± 0.5 81.4 ± 0.6 83.7 ± 0.6

Food

56.6 ± 0.4 62.7 ± 0.3 57.3 ± 0.4 64.0 ± 0.3 59.6 ± 0.4 61.3 ± 0.3

Pets

49.7 ± 1.0 60.2 ± 0.9 49.6 ± 0.9 61.6 ± 0.9 54.7 ± 0.9 56.6 ± 0.9

SUN

45.9 ± 0.4 55.3 ± 0.4 47.7 ± 0.4 57.1 ± 0.4 52.5 ± 0.4 54.9 ± 0.4

µTx

53.7 ± 0.2 60.7 ± 0.2 54.8 ± 0.2 61.8 ± 0.2 57.8 ± 0.2 58.8 ± 0.2

Table 9: Effect of using BLIP captions for CC/YFCC images in CLIP training.

Page 32

Dataset

YFCC

Dataset size

100K

500K

Aircraft

37.0 ± 1.0 41.0 ± 1.0

Birdsnap

15.5 ± 0.5 21.1 ± 0.5

Cal101

71.1 ± 0.7 78.2 ± 0.7

Cal256

55.9 ± 0.5 64.9 ± 0.4

Cars

30.9 ± 0.6 35.2 ± 0.6

CIFAR-10 82.9 ± 0.5 85.1 ± 0.4

CIFAR-100 59.3 ± 0.6 63.4 ± 0.6

DTD

63.8 ± 1.3 71.8 ± 1.2

Flowers

76.3 ± 0.7 84.3 ± 0.5

Food

57.4 ± 0.4 64.1 ± 0.3

Pets

52.7 ± 0.9 59.3 ± 0.9

SUN

47.4 ± 0.4 56.4 ± 0.4

µTx

54.2 ± 0.2 60.4 ± 0.2

Table 10: Effect of caption filtering on CLIP’s transfer performance.

Dataset

COCO

Dataset size

200K

120K

Aircraft

41.9 ± 0.9 44.7 ± 1.0

Birdsnap

18.8 ± 0.5 18.6 ± 0.5

Cal101

77.4 ± 0.7 75.9 ± 0.6

Cal256

63.5 ± 0.4 62.8 ± 0.4

Cars

38.2 ± 0.6 40.8 ± 0.6

CIFAR-10 84.0 ± 0.4 84.1 ± 0.4

CIFAR-100 62.5 ± 0.6 61.3 ± 0.6

DTD

68.1 ± 1.2 65.3 ± 1.3

Flowers

82.4 ± 0.6 81.9 ± 0.6

Food

60.4 ± 0.4 62.0 ± 0.4

Pets

56.0 ± 1.0 59.6 ± 1.0

SUN

53.1 ± 0.4 51.9 ± 0.4

µTx

58.8 ± 0.3 58.9 ± 0.3

Table 11: Training CLIPS models using additional captions generated via GPT-J paraphrasing.