Reducing embedding size in ml-commons

Hi all,

I was wondering if there is an option to reduce embedding size in ml-commons. Documents in most case contain many sentences and pages which lead to huge index size if a 700 or even 300 embedding sizes are used.

An example was shared in the community earlier showing to deploy/serve an sbert model with a default size on OpenSearch to serve text embeddings. I was wondering if ml-common has (or will have) options to reduce the default size using techniques such as PCA, t-SNE (t-SNE-Java), LDA, Autoencoders etc.

Hello!

The index size should not scale with the size of the documents, since only 1 vector is stored per document. Recall that we apply a pooling layer at the end which results into a single vector. For a fp32 bit precision, a 738 dimensional vector will lead to around (738 * 4 bytes =) 3KB per document.

Nevertheless PCA is a great technique and we can definitely add it to our near future goals. Thank you for the suggestion. Although I imagine that finetuning a smaller model with fewer embedding dimensions might be better than using a large model followed by PCA, but I am not sure.

2 Likes

One pooled vector is not ideal for a multi-pages large document. Suppose that you are indexing books then you will need at least a vector per page, even though I would go paragraph level.

As for fine-tuning, the cost of it is outside the operational scope since it is done offline.

True, for very long documents one can benefit from multiple vectors.

I’d still recommend using a fine-tuned small model or have a fine-tuned linear layer (i.e, a simple encoding layer) on top of a large model that projects to smaller dimensions.

If we use something like PCA to reduce the dimensions, it is not clear whether relevant (queries, passages) will stay close together in the smaller dimensional space after projection. t-SNE is better for such cases (since it would preserve the structure) but t-SNE cannot be used at runtime without more additional effort.

I actually tried sbert with PCA as 128 and the result was pretty close to the 738 model. 64 wasn’t that good though. I used GOT as the dataset.

Thanks for sharing. I have tried dimensionality reduction (since contextualized representations often reside in a small-dimensional cone How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings - ACL Anthology) and found not so bad results but I have not tested it extensively and so cant vouch for their quality!

1 Like

Interesting work. Thanks for sharing. I think that even if the results are not as good as the default model, it is still good to have as a feature in ml-common to serve a reduced model. It can be very useful in the books scenario when you are searching within the book and not only against abstracts.

@asfoorial, have you tried enabling pq on faiss indexes: k-NN Index - OpenSearch documentation? PQ can help reduce memory requirements by trading-off accuracy. We are also actively developing quantization capabilities for 8-bit vector encodings.

@dylan thanks. PQ is great but it only applies on memory but not on disk. Looking forward to see and test the future quantization developments.

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.