Post by Gabriele Berton

Training Vision Language Models

Matrioshka Representation Learning (2024, Google Research) A common problem with embedding models is retrieval cost. Embeddings can be costly to store and retrieve, especially if you store embeddings for huge databases (e.g. 100B documents). Normally you train an embedding model with certain output dimension (e.g. 512-D). Then you realize retrieval (kNN) is too expensive, so you train a 128-D model Unless... you trained your model with Matrioshka Representation Learning (MRL). MRL takes the first 128 values of the vector, and computes the loss on it. And on the full vector too. And on a 32-D vector too. This ensures that although the model produces a single 512-D vector, its 128-D and 32-D subset can be used too. So if you computed 512-D vectors for the whole database and want to make it cheaper, all you need to do is select a slice of it. No re-computation or re-training needed. Results degrade on the full 512-D vector by very little (~0.2%) Most embedding models nowadays use MRL, like Gemini Embeddings 2 and EmbeddingGemma