Increase health-check threshhold

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
OS - RHEL8

name            version node.role
pescold01-spc   2.8.0   dr
pescold02-spc   2.8.0   dr
pescold03-spc   2.8.0   dr
peshot01-spc    2.8.0   dir
peshot02-spc    2.8.0   dir
peshot03-spc    2.8.0   dir
peshot04-spc    2.8.0   dir
peshot05-spc    2.8.0   dir
peshot06-spc    2.8.0   dir
pesmaster01-spc 2.8.0   mr
pesmaster02-spc 2.8.0   mr
pesmaster03-spc 2.8.0   mr
peswarm01-spc   2.8.0   dr
peswarm02-spc   2.8.0   dr
peswarm03-spc   2.8.0   dr

Describe the issue:
I need to fix this issue I have on cold nodes. There are old HDD disks and when I delete big index on these HDDs healthcheck is failing and then disconned cold nodes.

health check of [/usr/share/opensearch/data/nodes/0] took [5202ms] which is above the warn threshold of [5s]

It happend only on COLD nodes because their disk utilization is 100% when I delete 50GB index by ILM.

MY QUESTION IS CAN I INCREASE THIS THRESHOLD TO 10 OR MORE SECONDS?

Configuration:
ILM policy

{
    "policy_id": "HOT-WARM-COLD - 180d",
    "description": "BIG data - 7 day rollover, 50 GB\nHOT 1-10 day, warm 10-30day, cold 30-180 day",
    "last_updated_time": 1692691334023,
    "schema_version": 13,
    "error_notification": null,
    "default_state": "hot",
    "states": [
        {
            "name": "hot",
            "actions": [
                {
                    "retry": {
                        "count": 3,
                        "backoff": "exponential",
                        "delay": "1m"
                    },
                    "index_priority": {
                        "priority": 50
                    }
                },
                {
                    "retry": {
                        "count": 3,
                        "backoff": "exponential",
                        "delay": "1m"
                    },
                    "rollover": {
                        "min_index_age": "7d",
                        "min_primary_shard_size": "50gb"
                    }
                }
            ],
            "transitions": [
                {
                    "state_name": "warm",
                    "conditions": {
                        "min_index_age": "10d"
                    }
                }
            ]
        },
        {
            "name": "warm",
            "actions": [
                {
                    "retry": {
                        "count": 3,
                        "backoff": "exponential",
                        "delay": "1m"
                    },
                    "index_priority": {
                        "priority": 25
                    }
                },
                {
                    "retry": {
                        "count": 3,
                        "backoff": "exponential",
                        "delay": "1m"
                    },
                    "allocation": {
                        "require": {
                            "temp": "warm"
                        },
                        "include": {},
                        "exclude": {},
                        "wait_for": false
                    }
                }
            ],
            "transitions": [
                {
                    "state_name": "cold",
                    "conditions": {
                        "min_index_age": "30d"
                    }
                }
            ]
        },
        {
            "name": "cold",
            "actions": [
                {
                    "retry": {
                        "count": 3,
                        "backoff": "exponential",
                        "delay": "1m"
                    },
                    "index_priority": {
                        "priority": 10
                    }
                },
                {
                    "retry": {
                        "count": 3,
                        "backoff": "exponential",
                        "delay": "1m"
                    },
                    "allocation": {
                        "require": {
                            "temp": "cold"
                        },
                        "include": {},
                        "exclude": {},
                        "wait_for": false
                    }
                }
            ],
            "transitions": [
                {
                    "state_name": "delete",
                    "conditions": {
                        "min_index_age": "180d"
                    }
                }
            ]
        },
        {
            "name": "delete",
            "actions": [
                {
                    "retry": {
                        "count": 3,
                        "backoff": "exponential",
                        "delay": "1m"
                    },
                    "delete": {}
                }
            ],
            "transitions": []
        }
    ],
    "ism_template": [
        {
            "index_patterns": [
               SECRET
            ],
            "priority": 10,
            "last_updated_time": 1689940479010
        }
    ]
}

Relevant Logs or Screenshots:

pescold02-elastic[7214]: [2023-12-30T16:10:02,030][WARN ][o.o.m.f.FsHealthService  ] [pescold02-spc] health check of [/usr/share/opensearch/data/nodes/0] took [5202ms] which is above the warn threshold of [5s]
pescold02-elastic[7214]: [2023-12-30T16:10:12,959][INFO ][o.o.c.c.Coordinator      ] [pescold02-spc] cluster-manager node [{pesmaster03-spc}{9jX0k5J6Q5muN5DPn4vu1Q}{LzacVARJRgWVhs0mytyRtA}{10.xx.xx.xx}{10.xx.xx.xx:9300}{mr}{shard_indexing_pressure_enabled=true}] failed, restarting discovery
pescold02-elastic[7214]: org.opensearch.OpenSearchException: node [{pesmaster03-spc}{9jX0k5J6Q5muN5DPn4vu1Q}{LzacVARJRgWVhs0mytyRtA}{10.xx.xx.xx}{10.xx.xx.xx:9300}{mr}{shard_indexing_pressure_enabled=true}] failed [3] consecutive checks
pescold01-elastic[6944]: [2023-12-30T16:08:47,486][WARN ][o.o.m.f.FsHealthService  ] [pescold01-spc] health check of [/usr/share/opensearch/data/nodes/0] took [11405ms] which is above the warn threshold of [5s]
pescold01-elastic[6944]: [2023-12-30T16:09:53,189][WARN ][o.o.m.f.FsHealthService  ] [pescold01-spc] health check of [/usr/share/opensearch/data/nodes/0] took [5803ms] which is above the warn threshold of [5s]
Dec 30 17:10:12 pescold01-spc pescold01-elastic[6944]: [2023-12-30T16:10:12,859][INFO ][o.o.c.c.Coordinator      ] [pescold01-spc] cluster-manager node [{pesmaster03-spc}{9jX0k5J6Q5muN5DPn4vu1Q}{LzacVARJRgWVhs0mytyRtA}{10.xx.xx.xx}{10.xx.xx.xx:9300}{mr}{shard_indexing_pressure_enabled=true}] failed, restarting discovery
pescold01-elastic[6944]: org.opensearch.OpenSearchException: node [{pesmaster03-spc}{9jX0k5J6Q5muN5DPn4vu1Q}{LzacVARJRgWVhs0mytyRtA}{10.xx.xx.xx}{10.xx.xx.xx:9300}{mr}{shard_indexing_pressure_enabled=true}] failed [3] consecutive checks

any suggestions?

It’s a warning that the path is slow, it shouldn’t prevent something from happening, I think the node disconnects for other reasons. My biggest suspect is that the management thread pool is exhausted. Monitoring OpenSearch should shed some light on this.

I’m suspecting the management thread pool because I’ve seen this issue a number of times with HDD-backed nodes. The problem is that OpenSearch needs to check its own health, which means each node accesses its local shards every once in a while. If you have a lot of shards on the node, this can become an IO bottleneck.

My suggestion is to resolve the immediate problem (e.g. remove some old data manually) and then migrate cold nodes to SSD-backed nodes. I would make sure that the SSD is locally attached, not over the network (e.g. EBS) because network latency tends to become the bottleneck. It might sound like I’m telling you to throw money at the problem, but usually it’s cheaper overall to have local SSD storage, because even if nodes themselves are more expensive, you can put more data on them. Of course it depends on hardware/instance options, but usually that’s the case.