I was wondering if there is an option to reduce embedding size in ml-commons. Documents in most case contain many sentences and pages which lead to huge index size if a 700 or even 300 embedding sizes are used.
An example was shared in the community earlier showing to deploy/serve an sbert model with a default size on OpenSearch to serve text embeddings. I was wondering if ml-common has (or will have) options to reduce the default size using techniques such as PCA, t-SNE (t-SNE-Java), LDA, Autoencoders etc.
The index size should not scale with the size of the documents, since only 1 vector is stored per document. Recall that we apply a pooling layer at the end which results into a single vector. For a fp32 bit precision, a 738 dimensional vector will lead to around (738 * 4 bytes =) 3KB per document.
Nevertheless PCA is a great technique and we can definitely add it to our near future goals. Thank you for the suggestion. Although I imagine that finetuning a smaller model with fewer embedding dimensions might be better than using a large model followed by PCA, but I am not sure.
True, for very long documents one can benefit from multiple vectors.
I’d still recommend using a fine-tuned small model or have a fine-tuned linear layer (i.e, a simple encoding layer) on top of a large model that projects to smaller dimensions.
If we use something like PCA to reduce the dimensions, it is not clear whether relevant (queries, passages) will stay close together in the smaller dimensional space after projection. t-SNE is better for such cases (since it would preserve the structure) but t-SNE cannot be used at runtime without more additional effort.
Interesting work. Thanks for sharing. I think that even if the results are not as good as the default model, it is still good to have as a feature in ml-common to serve a reduced model. It can be very useful in the books scenario when you are searching within the book and not only against abstracts.
@asfoorial, have you tried enabling pq on faiss indexes: k-NN Index - OpenSearch documentation? PQ can help reduce memory requirements by trading-off accuracy. We are also actively developing quantization capabilities for 8-bit vector encodings.