CLIP: Contrastive Learning Image Pretraining (image-to-text)
- Trained on large amount of <image, text> pairs.
- The underlying principle of CLIP is to transform images and text into a similar representation.
- Once in this space, you can measure the similarity between an image and a text by computing the distance between their embeddings. Closer = more related.
- Notice even though it's trained to find image-text similarity, it's able to perform a lot more than that!
- In practical terms, an ML Engineer could use CLIP for a variety of tasks such as image captioning, text-to-image synthesis, object detection, and more.
- e.g.: pass your image through CLIP's image encoder to generate an image embedding.
- For each potential text match, you have a precomputed text embedding.
- Notice this is a ranking task, where we learn the corresponding matching text higher than non-corresponding texts.
Look for Contrastive loss in related notes.