📋

CLIP: Contrastive Learning Image Pretraining (image-to-text)

CLIP architecture diagram showing image encoder and text encoder with contrastive learning.

CLIP training process: matching image-text pairs via contrastive objective.

Trained on large amount of <image, text> pairs.
The underlying principle of CLIP is to transform images and text into a similar representation.
- Once in this space, you can measure the similarity between an image and a text by computing the distance between their embeddings. Closer = more related.
- Notice even though it's trained to find image-text similarity, it's able to perform a lot more than that!
In practical terms, an ML Engineer could use CLIP for a variety of tasks such as image captioning, text-to-image synthesis, object detection, and more.
- e.g.: pass your image through CLIP's image encoder to generate an image embedding.
- For each potential text match, you have a precomputed text embedding.
Notice this is a ranking task, where we learn the corresponding matching text higher than non-corresponding texts.

Look for Contrastive loss in related notes.