FP16 Quantization Not Reducing Memory Usage

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser): opensearch version number: 2.17.0

Describe the issue:

In my application, latency is critical, so I can’t use on_disk mode. At the same time, in_memory mode doesn’t allow the compression parameter.

Since scalar quantization (SQ) is available, I wanted to try FP16 quantization. My vectors are 3072 dimensions in float. I first tried using the Reindex API to apply FP16 quantization on existing data, but it didn’t seem to have any effect—the primary memory usage stayed the same.

I then attempted re-ingesting the data with the encoder set for FP16, but again, I don’t see any difference in memory consumption compared to not using the encoder. The vectors also look the same(precision wise).

Question:
Why doesn’t FP16 quantization appear to reduce memory usage in my case? Am I missing a step when reindexing or re-ingesting the data?

Relevant Logs or Screenshots:

@Lah Which memory metrics did you use for comparison?

image

I am looking at the primary.store.size and store.size(entire w replicas) as my storage metric.
without quantization it was 4.7mb and with quantization it was almost the same?

@Lah I’ve done some testing. I had very similar results with 2.17 to 2.19.3 using the same data set as a source.
My test workflow was creating two indices with FP16 and FP32 quantization, ingesting the same data set (500 documents) to both indices and finally reindexing FP32 to FP16 index.

import numpy as np
import json

N_DOCS = 500
DIM = 3072
OUTPUT_FILE = "dataset.jsonl"

print(f"Generating {N_DOCS} docs with {DIM}-dim vectors...")

with open(OUTPUT_FILE, "w") as f:
    for i in range(N_DOCS):
        vec = np.random.rand(DIM).astype(np.float32).tolist()
        doc = {
            "document_id": f"doc_{i}",
            "text_field": f"Sample text {i}",
            "download_url": f"http://example.com/{i}",
            "vector_field": vec
        }
        f.write(json.dumps(doc) + "\n")

print(f"Dataset saved to {OUTPUT_FILE}")

The storage values were very close in versions 2.17.0 to 2.19.3

yellow open test_fp16              Iery9KkkSMecCPao9_Pytw 1 2 500 0 61.6mb 30.8mb
yellow open test_fp32              7Oo3r6FEQwOgpVIMfUpISw 1 2 500 0 67.4mb 33.7mb
yellow open test_reindex_fp32_fp16 lBbH82-1TY6ui9PsptECPQ 1 2 500 0 61.6mb 30.8mb

I’ve noticed a significant change from version 3.0.0 to 3.2.0

yellow open test_fp16              Gl6UJekJSR-jHAbXm3glXw 1 2 500 0 17.8mb  8.9mb
yellow open test_fp32              jvS7c7wjTlSgtpBtV3bsTA 1 2 500 0 23.6mb 11.8mb
yellow open test_reindex_fp32_fp16 aLK3s0NXTnShfY5aWZwW6Q 1 2 500 0 16.1mb  8.9mb

Based on that, I assume there was a change that improved both FP16 and FP32 indices.

Hi @Lah and @pablo

The observation in the index size for fp32 and fp16 you are observing is correct. The reason why you are not seeing a reduce in disk size is because vector are stored in _source field too. This field is the major contributor for disk space. For actually memory usage between fp16 and fp32 I would recommend using k-NN API - OpenSearch Documentation api of KNN plugin. This will exactly tell you how much your graph is taking. and for fp16 you will see a drop.

Now for 3.x version, we enabled a default feature called as derived source which basically removed the vector field from _source and without impacting anything else. Here is the blog on that: https://opensearch.org/blog/do-more-with-less-save-up-to-3x-on-storage-with-derived-vector-source/

You can enable the derived source feature in 2.19 too. Please let me know if you have more questions.

1 Like

Hi @Navneet Thanks for your response.
Right now i am on 2.17 version. I am prioritizing reducing the shard sizes which will enable faster search and save costs.

From a practical standpoint, would it be better to upgrade to 2.19 and enable derived source there, or go directly to 3.x where the feature is fully supported? I see that 2.19 introduces derived source in experimental mode, while in 3.x it’s enabled by default and production-ready.

In the 3.x versions, is the Reindex API fully supported when derived source is enabled?(i.e., will vectors be reconstructed transparently during reindexing). This is an important requirement for me.

Hi @Lah

In the 3.x versions, is the Reindex API fully supported when derived source is enabled?(i.e., will vectors be reconstructed transparently during reindexing). This is an important requirement for me.

Yes reindex is fully supported with 3.x.

From a practical standpoint, would it be better to upgrade to 2.19 and enable derived source there, or go directly to 3.x where the feature is fully supported?

If possible go directly to 3.x latest version.

hi @Navneet ,

It is not currently possible to upgrade AWS OpenSearch Service domains directly to the 3.x version, as AWS OpenSearch Service does not yet support OpenSearch 3.x for managed clusters as of August 2025.

So the best I can do is upgrade to 2.19 version and enable derived source. And in 2.19, derived source is an experimental feature. Will I see improvements? will it also support reindex api?

Hello @Navneet

Is this possible to exclude all the original fp32 vector data and replace them with fp16 during the indexing? Or alternatively, is there any data_type in OpenSearch to support indexing fp16 vector without using the faiss encoding?