Increase health-check threshhold

vnovotny98 · January 2, 2024, 9:21am

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
OS - RHEL8

name            version node.role
pescold01-spc   2.8.0   dr
pescold02-spc   2.8.0   dr
pescold03-spc   2.8.0   dr
peshot01-spc    2.8.0   dir
peshot02-spc    2.8.0   dir
peshot03-spc    2.8.0   dir
peshot04-spc    2.8.0   dir
peshot05-spc    2.8.0   dir
peshot06-spc    2.8.0   dir
pesmaster01-spc 2.8.0   mr
pesmaster02-spc 2.8.0   mr
pesmaster03-spc 2.8.0   mr
peswarm01-spc   2.8.0   dr
peswarm02-spc   2.8.0   dr
peswarm03-spc   2.8.0   dr

Describe the issue:
I need to fix this issue I have on cold nodes. There are old HDD disks and when I delete big index on these HDDs healthcheck is failing and then disconned cold nodes.

health check of [/usr/share/opensearch/data/nodes/0] took [5202ms] which is above the warn threshold of [5s]

It happend only on COLD nodes because their disk utilization is 100% when I delete 50GB index by ILM.

MY QUESTION IS CAN I INCREASE THIS THRESHOLD TO 10 OR MORE SECONDS?

Configuration:
ILM policy

{
    "policy_id": "HOT-WARM-COLD - 180d",
    "description": "BIG data - 7 day rollover, 50 GB\nHOT 1-10 day, warm 10-30day, cold 30-180 day",
    "last_updated_time": 1692691334023,
    "schema_version": 13,
    "error_notification": null,
    "default_state": "hot",
    "states": [
        {
            "name": "hot",
            "actions": [
                {
                    "retry": {
                        "count": 3,
                        "backoff": "exponential",
                        "delay": "1m"
                    },
                    "index_priority": {
                        "priority": 50
                    }
                },
                {
                    "retry": {
                        "count": 3,
                        "backoff": "exponential",
                        "delay": "1m"
                    },
                    "rollover": {
                        "min_index_age": "7d",
                        "min_primary_shard_size": "50gb"
                    }
                }
            ],
            "transitions": [
                {
                    "state_name": "warm",
                    "conditions": {
                        "min_index_age": "10d"
                    }
                }
            ]
        },
        {
            "name": "warm",
            "actions": [
                {
                    "retry": {
                        "count": 3,
                        "backoff": "exponential",
                        "delay": "1m"
                    },
                    "index_priority": {
                        "priority": 25
                    }
                },
                {
                    "retry": {
                        "count": 3,
                        "backoff": "exponential",
                        "delay": "1m"
                    },
                    "allocation": {
                        "require": {
                            "temp": "warm"
                        },
                        "include": {},
                        "exclude": {},
                        "wait_for": false
                    }
                }
            ],
            "transitions": [
                {
                    "state_name": "cold",
                    "conditions": {
                        "min_index_age": "30d"
                    }
                }
            ]
        },
        {
            "name": "cold",
            "actions": [
                {
                    "retry": {
                        "count": 3,
                        "backoff": "exponential",
                        "delay": "1m"
                    },
                    "index_priority": {
                        "priority": 10
                    }
                },
                {
                    "retry": {
                        "count": 3,
                        "backoff": "exponential",
                        "delay": "1m"
                    },
                    "allocation": {
                        "require": {
                            "temp": "cold"
                        },
                        "include": {},
                        "exclude": {},
                        "wait_for": false
                    }
                }
            ],
            "transitions": [
                {
                    "state_name": "delete",
                    "conditions": {
                        "min_index_age": "180d"
                    }
                }
            ]
        },
        {
            "name": "delete",
            "actions": [
                {
                    "retry": {
                        "count": 3,
                        "backoff": "exponential",
                        "delay": "1m"
                    },
                    "delete": {}
                }
            ],
            "transitions": []
        }
    ],
    "ism_template": [
        {
            "index_patterns": [
               SECRET
            ],
            "priority": 10,
            "last_updated_time": 1689940479010
        }
    ]
}

Relevant Logs or Screenshots:

pescold02-elastic[7214]: [2023-12-30T16:10:02,030][WARN ][o.o.m.f.FsHealthService  ] [pescold02-spc] health check of [/usr/share/opensearch/data/nodes/0] took [5202ms] which is above the warn threshold of [5s]
pescold02-elastic[7214]: [2023-12-30T16:10:12,959][INFO ][o.o.c.c.Coordinator      ] [pescold02-spc] cluster-manager node [{pesmaster03-spc}{9jX0k5J6Q5muN5DPn4vu1Q}{LzacVARJRgWVhs0mytyRtA}{10.xx.xx.xx}{10.xx.xx.xx:9300}{mr}{shard_indexing_pressure_enabled=true}] failed, restarting discovery
pescold02-elastic[7214]: org.opensearch.OpenSearchException: node [{pesmaster03-spc}{9jX0k5J6Q5muN5DPn4vu1Q}{LzacVARJRgWVhs0mytyRtA}{10.xx.xx.xx}{10.xx.xx.xx:9300}{mr}{shard_indexing_pressure_enabled=true}] failed [3] consecutive checks
pescold01-elastic[6944]: [2023-12-30T16:08:47,486][WARN ][o.o.m.f.FsHealthService  ] [pescold01-spc] health check of [/usr/share/opensearch/data/nodes/0] took [11405ms] which is above the warn threshold of [5s]
pescold01-elastic[6944]: [2023-12-30T16:09:53,189][WARN ][o.o.m.f.FsHealthService  ] [pescold01-spc] health check of [/usr/share/opensearch/data/nodes/0] took [5803ms] which is above the warn threshold of [5s]
Dec 30 17:10:12 pescold01-spc pescold01-elastic[6944]: [2023-12-30T16:10:12,859][INFO ][o.o.c.c.Coordinator      ] [pescold01-spc] cluster-manager node [{pesmaster03-spc}{9jX0k5J6Q5muN5DPn4vu1Q}{LzacVARJRgWVhs0mytyRtA}{10.xx.xx.xx}{10.xx.xx.xx:9300}{mr}{shard_indexing_pressure_enabled=true}] failed, restarting discovery
pescold01-elastic[6944]: org.opensearch.OpenSearchException: node [{pesmaster03-spc}{9jX0k5J6Q5muN5DPn4vu1Q}{LzacVARJRgWVhs0mytyRtA}{10.xx.xx.xx}{10.xx.xx.xx:9300}{mr}{shard_indexing_pressure_enabled=true}] failed [3] consecutive checks

vnovotny98 · January 4, 2024, 6:55pm

any suggestions?

github.com/opensearch-project/OpenSearch

Increase health-check threshhold for filesystem

opened 08:37AM - 03 Jan 24 UTC

vnovotny98

feedback needed Storage:Performance

### Is your feature request related to a problem? Please describe I have old an…d slow HDD disks and when I delete big indices on these HDD - healthcheck is failing and then disconned cold nodes and red the whole cluster. **It happens only on COLD nodes because their disk utilization is 100% when I delete 50GB index by ILM.** When I delete them manually by DELETE index in kibana, disk utilization is about 5%. Is there a bug in ILM or can I increase threshhold? `health check of [/usr/share/opensearch/data/nodes/0] took [5202ms] which is above the warn threshold of [5s]` ### Describe the solution you'd like CAN I INCREASE THIS THRESHOLD TO 10 OR MORE SECONDS? ### Related component Storage:Performance ### Describe alternatives you've considered Here is my topic on forum but without response. https://forum.opensearch.org/t/increase-health-check-threshhold/17302 ### Additional context LOGS ``` pescold02-elastic[7214]: [2023-12-30T16:10:02,030][WARN ][o.o.m.f.FsHealthService ] [pescold02-spc] health check of [/usr/share/opensearch/data/nodes/0] took [5202ms] which is above the warn threshold of [5s] pescold02-elastic[7214]: [2023-12-30T16:10:12,959][INFO ][o.o.c.c.Coordinator ] [pescold02-spc] cluster-manager node [{pesmaster03-spc}{9jX0k5J6Q5muN5DPn4vu1Q}{LzacVARJRgWVhs0mytyRtA}{10.xx.xx.xx}{10.xx.xx.xx:9300}{mr}{shard_indexing_pressure_enabled=true}] failed, restarting discovery pescold02-elastic[7214]: org.opensearch.OpenSearchException: node [{pesmaster03-spc}{9jX0k5J6Q5muN5DPn4vu1Q}{LzacVARJRgWVhs0mytyRtA}{10.xx.xx.xx}{10.xx.xx.xx:9300}{mr}{shard_indexing_pressure_enabled=true}] failed [3] consecutive checks pescold01-elastic[6944]: [2023-12-30T16:08:47,486][WARN ][o.o.m.f.FsHealthService ] [pescold01-spc] health check of [/usr/share/opensearch/data/nodes/0] took [11405ms] which is above the warn threshold of [5s] pescold01-elastic[6944]: [2023-12-30T16:09:53,189][WARN ][o.o.m.f.FsHealthService ] [pescold01-spc] health check of [/usr/share/opensearch/data/nodes/0] took [5803ms] which is above the warn threshold of [5s] Dec 30 17:10:12 pescold01-spc pescold01-elastic[6944]: [2023-12-30T16:10:12,859][INFO ][o.o.c.c.Coordinator ] [pescold01-spc] cluster-manager node [{pesmaster03-spc}{9jX0k5J6Q5muN5DPn4vu1Q}{LzacVARJRgWVhs0mytyRtA}{10.xx.xx.xx}{10.xx.xx.xx:9300}{mr}{shard_indexing_pressure_enabled=true}] failed, restarting discovery pescold01-elastic[6944]: org.opensearch.OpenSearchException: node [{pesmaster03-spc}{9jX0k5J6Q5muN5DPn4vu1Q}{LzacVARJRgWVhs0mytyRtA}{10.xx.xx.xx}{10.xx.xx.xx:9300}{mr}{shard_indexing_pressure_enabled=true}] failed [3] consecutive checks ```

radu.gheorghe · January 16, 2024, 3:43pm

It’s a warning that the path is slow, it shouldn’t prevent something from happening, I think the node disconnects for other reasons. My biggest suspect is that the management thread pool is exhausted. Monitoring OpenSearch should shed some light on this.

I’m suspecting the management thread pool because I’ve seen this issue a number of times with HDD-backed nodes. The problem is that OpenSearch needs to check its own health, which means each node accesses its local shards every once in a while. If you have a lot of shards on the node, this can become an IO bottleneck.

My suggestion is to resolve the immediate problem (e.g. remove some old data manually) and then migrate cold nodes to SSD-backed nodes. I would make sure that the SSD is locally attached, not over the network (e.g. EBS) because network latency tends to become the bottleneck. It might sound like I’m telling you to throw money at the problem, but usually it’s cheaper overall to have local SSD storage, because even if nodes themselves are more expensive, you can put more data on them. Of course it depends on hardware/instance options, but usually that’s the case.

Topic		Replies	Views
Health check of [/var/lib/opensearch/nodes/0] failed, took [223936ms] which is above the healthy threshold of [1m] OpenSearch troubleshoot , configure	1	826	October 23, 2023
How can I improve the health check? OpenSearch	0	218	June 29, 2024
Need to know impact of increasing primary shards for existing index OpenSearch	1	38	September 20, 2024
Upgrading from OpenSearch 1.2.4 Tar File to OpenSearch 2.15.0 RPM OpenSearch upgrade	3	104	July 10, 2024
Troubleshoot Snapshots repository OpenSearch troubleshoot	0	296	October 11, 2023

Increase health-check threshhold

Related topics