Shard allocation failure due to negative free space

Nilushan · October 4, 2024, 6:20am

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
OpenSearch 2.15.0

Describe the issue:
There is an OpenSearch cluster that we have deployed on an AWS Kubernetes cluster. We use Persistent Volumes to store data of OpenSearch. The Persistent volumes are on AWS Elastic File System.
I noticed the following output for _cluster/allocation/explain API call.

{
  "index": "security-auditlog-2024.10.01",
  "shard": 0,
  "primary": false,
  "current_state": "unassigned",
  "unassigned_info": {
    "reason": "NODE_LEFT",
    "at": "2024-10-04T04:03:10.177Z",
    "details": "node_left [Ahacm3NMQui2-2bKSrv6nw]",
    "last_allocation_status": "no_attempt"
  },
  "can_allocate": "throttled",
  "allocate_explanation": "allocation temporarily throttled",
  "node_allocation_decisions": [
    {
      "node_id": "Ahacm3NMQui2-2bKSrv6nw",
      "node_name": "opensearch-data-0",
      "transport_address": "172.17.21.169:9300",
      "node_attributes": {
        "shard_indexing_pressure_enabled": "true"
      },
      "node_decision": "throttled",
      "store": {
        "matching_size_in_bytes": 1425343
      },
      "deciders": [
        {
          "decider": "throttling",
          "decision": "THROTTLE",
          "explanation": "reached the limit of incoming shard recoveries [2], cluster setting [cluster.routing.allocation.node_concurrent_incoming_recoveries=2] (can also be set via [cluster.routing.allocation.node_concurrent_recoveries])"
        }
      ]
    },
    {
      "node_id": "WS2pQ1DoT-eNVFwelyWn3g",
      "node_name": "opensearch-data-2",
      "transport_address": "172.17.21.43:9300",
      "node_attributes": {
        "shard_indexing_pressure_enabled": "true"
      },
      "node_decision": "no",
      "deciders": [
        {
          "decider": "disk_threshold",
          "decision": "NO",
          "explanation": "the node has fewer free bytes remaining than the total size of all incoming shards: free space [-11136401408B], relocating shards [0B]"
        }
      ]
    },
    {
      "node_id": "ugn6My5SST6Bf6vOJfAvqQ",
      "node_name": "opensearch-data-1",
      "transport_address": "172.17.22.108:9300",
      "node_attributes": {
        "shard_indexing_pressure_enabled": "true"
      },
      "node_decision": "no",
      "store": {
        "matching_size_in_bytes": 1425848
      },
      "deciders": [
        {
          "decider": "same_shard",
          "decision": "NO",
          "explanation": "a copy of this shard is already allocated to this node [[security-auditlog-2024.10.01][0], node[ugn6My5SST6Bf6vOJfAvqQ], [P], s[STARTED], a[id=4SLCrYkZRS-5sLmKEscL1A]]"
        },
        {
          "decider": "throttling",
          "decision": "THROTTLE",
          "explanation": "reached the limit of incoming shard recoveries [2], cluster setting [cluster.routing.allocation.node_concurrent_incoming_recoveries=2] (can also be set via [cluster.routing.allocation.node_concurrent_recoveries])"
        }
      ]
    }
  ]
}

The strange part of this output is the node has fewer free bytes remaining than the total size of all incoming shards: free space [-11136401408B], relocating shards [0B].

What could cause the free space to be detected as negative?

Configuration:
3 master nodes
3 data nodes

Relevant Logs or Screenshots:

Mantas · October 4, 2024, 10:30am

Hi @Nilushan,

The Free space [-11136401408B]** means that your node(s) is/are out of disk space by (around 11GB). The negative value indicates that the system is over-committed and can not store any more data. Relocating shards [0B]: means there are no shards currently being relocated to this node (same space issues). You`ll need more storage.

You have a few options here:

Increase Disk Space
Free Up Disk Space on the Node (Delete some indices, if that an option)
Add More Nodes to the Cluster

you can also try to optimise your indices’ compression see here: Index codecs - OpenSearch Documentation

best.
mj

Nilushan · October 8, 2024, 4:52am

Thanks for the reply @Mantas

This was on a fresh OpenSearch cluster and there was only few Megabytes of logs in it when this issue occurred. AFAIR, the disk wasn’t filled at this point.

That being said we noticed an issue where the EFS we were using was having a throughput bottleneck. We had been using the bursting mode in EFS and the throughput was not enough. Switching to provisioned throughput fixed some of the issues we faced. I’m wondering if they are related.

jakabasej5 · October 9, 2024, 9:06am

It could be some kind of reporting issue

Topic		Replies	Views
Elasticsearch cluster status in yellow due to unassigned shards for .opendistro_security index Security	3	2888	December 14, 2021
UNASSIGNED ALLOCATION_FAILED failed shard on node [t8d551UUTkKLOjukvvsKeA]: shard failure, reason [error sending files], failure CorruptIndexException[checksum failed (hardware problem?) : expected=1udzqw9 actual=1i6i2wk (resource=name [_7y8_Lu OpenSearch discuss	1	428	January 4, 2024
Cannot allocate replica shard to a node with version [7.10.2] since this is older than the primary version [1.2.4] OpenSearch upgrade	0	699	October 31, 2022
Open Search Data Insert issue OpenSearch troubleshoot	4	462	July 6, 2023
Opensearch Cluster going to red during upgrade OpenSearch discuss , troubleshoot , upgrade	0	599	April 5, 2022

Shard allocation failure due to negative free space

Related topics