I encountered a problem bringing a ES 7.7.0 cluster back from a full cluster restart. The cluster has 3 master, 3 coordinating, 2 hot, 2 warm, and 2 cold nodes. The plugin version says it is 1.8.0.0
All nodes appear to have rejoined the cluster successfully:
ip heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
10.2.3.197 8 99 1 0.12 0.22 0.22 dilt - eshost1551.example.com
10.2.3.202 67 95 27 1.24 0.95 0.98 ilm * eshost1570.example.com
10.2.3.200 3 84 1 0.09 0.08 0.12 dilt - eshost1550.example.com
10.2.3.201 13 97 2 0.18 0.15 0.14 ilm - eshost1569.example.com
10.2.3.205 9 97 2 0.27 0.15 0.12 l - eshost1573.example.com
10.2.3.196 1 63 1 0.13 0.15 0.27 dilt - eshost1554.example.com
10.2.3.206 8 97 1 0.04 0.04 0.06 l - eshost1574.example.com
10.2.3.199 3 96 2 0.20 0.14 0.14 dilt - eshost1549.example.com
10.2.3.204 8 75 2 0.02 0.08 0.12 l - eshost1572.example.com
10.2.3.203 9 94 2 0.11 0.11 0.13 ilm - eshost1571.example.com
10.2.3.195 63 94 3 0.20 0.18 0.24 dilt - eshost1553.example.com
10.2.3.198 2 95 2 0.20 0.23 0.22 dilt - eshost1552.example.com
But ~2000 of ~2400 shards are unassigned – one hot node (eshost1553) has shards allocated to it. Using the “/_cluster/reroute?retry_failed=true” endpoint forces another allocation attempt, but everything just returns to UNASSIGNED status. The “/_cluster/allocation/explain” endpoint says it failed to create the index with a NullPointerException:
{
"index": "security-auditlog-2022.01.30",
"shard": 0,
"primary": true,
"current_state": "unassigned",
"unassigned_info": {
"reason": "MANUAL_ALLOCATION",
"at": "2022-07-01T17:09:15.230Z",
"details": "failed shard on node [ZZFUy43-So6NBdyhM_dPBw]: failed to create index, failure NullPointerException[null]",
"last_allocation_status": "awaiting_info"
},
"can_allocate": "yes",
"allocate_explanation": "can allocate the shard",
"target_node": {
"id": "jhzbGKS8TQePfepXOb_0kw",
"name": "eshost1549.example.com",
"transport_address": "10.2.3.199:9300",
"attributes": {
"ml.machine_memory": "67368648704",
"ml.max_open_jobs": "20",
"xpack.installed": "true",
"box_type": "cold",
"transform.node": "true"
}
},
"allocation_id": "pi2twkWMSly5j9K6-GaGDQ",
"node_allocation_decisions": [
{
"node_id": "ZZFUy43-So6NBdyhM_dPBw",
"node_name": "eshost1550.example.com",
"transport_address": "10.2.3.200:9300",
"node_attributes": {
"ml.machine_memory": "67368660992",
"ml.max_open_jobs": "20",
"xpack.installed": "true",
"box_type": "cold",
"transform.node": "true"
},
"node_decision": "yes",
"store": {
"in_sync": true,
"allocation_id": "JVRoSHUcTRiuXg_YvcKilw"
}
},
{
"node_id": "jhzbGKS8TQePfepXOb_0kw",
"node_name": "eshost1549.example.com",
"transport_address": "10.2.3.199:9300",
"node_attributes": {
"ml.machine_memory": "67368648704",
"ml.max_open_jobs": "20",
"xpack.installed": "true",
"box_type": "cold",
"transform.node": "true"
},
"node_decision": "yes",
"store": {
"in_sync": true,
"allocation_id": "pi2twkWMSly5j9K6-GaGDQ"
}
},
{
"node_id": "Jw9QJ3S5Tvyd36-iT8wprw",
"node_name": "eshost1552.example.com",
"transport_address": "10.2.3.198:9300",
"node_attributes": {
"ml.machine_memory": "67368648704",
"ml.max_open_jobs": "20",
"xpack.installed": "true",
"box_type": "warm",
"transform.node": "true"
},
"node_decision": "no",
"store": {
"found": false
}
},
{
"node_id": "t0jOfIYASESl8rQ66RPUyg",
"node_name": "eshost1551.example.com",
"transport_address": "10.2.3.197:9300",
"node_attributes": {
"ml.machine_memory": "67368660992",
"ml.max_open_jobs": "20",
"xpack.installed": "true",
"box_type": "warm",
"transform.node": "true"
},
"node_decision": "no",
"store": {
"found": false
}
},
{
"node_id": "z5llgKCRQ16M1rG0QhiHew",
"node_name": "eshost1554.example.com",
"transport_address": "10.2.3.196:9300",
"node_attributes": {
"ml.machine_memory": "67368419328",
"ml.max_open_jobs": "20",
"xpack.installed": "true",
"box_type": "hot",
"transform.node": "true"
},
"node_decision": "no",
"store": {
"found": false
}
},
{
"node_id": "zN0fjtZhTHiNkxtW5f3zqg",
"node_name": "eshost1553.example.com",
"transport_address": "10.2.3.195:9300",
"node_attributes": {
"ml.machine_memory": "67368419328",
"ml.max_open_jobs": "20",
"xpack.installed": "true",
"box_type": "hot",
"transform.node": "true"
},
"node_decision": "no",
"store": {
"found": false
}
}
]
}
Debug logging shows basically the same thing – the trace shows the NullPointerException coming from com.amazon.opendistroforelasticsearch.security.OpenDistroSecurityPlugin.onIndexModule(OpenDistroSecurityPlugin.java:519):
[2022-07-01T11:38:39,888][DEBUG][o.e.i.IndicesService ] [eshost.example.com] creating Index [[security-auditlog-2020.11.26/Ne-p6ajnSyCv-5xM-UcOMA]], shards [1]/[1] - reason [CREATE_INDEX]
[2022-07-01T11:38:39,893][DEBUG][o.e.g.PersistedClusterStateService] [eshost.example.com] writing cluster state took [0ms]; wrote global metadata [false] and metadata for [16] indices and skipped [760] unchanged indices
[2022-07-01T11:38:39,894][DEBUG][c.a.o.s.OpenDistroSecurityPlugin] [eshost.example.com] Handle complianceConfig=null/dlsFlsAvailable: true/auditLog=class com.amazon.opendistroforelasticsearch.security.auditlog.NullAuditLog for onIndexModule() of index security-auditlog-2020.11.26
[2022-07-01T11:38:39,894][WARN ][o.e.i.c.IndicesClusterStateService] [eshost.example.com] [security-auditlog-2020.11.26][0] marking and sending shard failed due to [failed to create index]
java.lang.NullPointerException: null
at com.amazon.opendistroforelasticsearch.security.OpenDistroSecurityPlugin.onIndexModule(OpenDistroSecurityPlugin.java:519) ~[?:?]
at org.elasticsearch.plugins.PluginsService.onIndexModule(PluginsService.java:267) ~[elasticsearch-7.7.0.jar:7.7.0]
at org.elasticsearch.indices.IndicesService.createIndexService(IndicesService.java:607) ~[elasticsearch-7.7.0.jar:7.7.0]
at org.elasticsearch.indices.IndicesService.createIndex(IndicesService.java:549) ~[elasticsearch-7.7.0.jar:7.7.0]
at org.elasticsearch.indices.IndicesService.createIndex(IndicesService.java:176) ~[elasticsearch-7.7.0.jar:7.7.0]
at org.elasticsearch.indices.cluster.IndicesClusterStateService.createIndices(IndicesClusterStateService.java:484) ~[elasticsearch-7.7.0.jar:7.7.0]
at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:246) ~[elasticsearch-7.7.0.jar:7.7.0]
at org.elasticsearch.cluster.service.ClusterApplierService.lambda$callClusterStateAppliers$5(ClusterApplierService.java:517) ~[elasticsearch-7.7.0.jar:7.7.0]
at java.lang.Iterable.forEach(Iterable.java:75) [?:1.8.0_251]
at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:514) [elasticsearch-7.7.0.jar:7.7.0]
at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:485) [elasticsearch-7.7.0.jar:7.7.0]
at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:432) [elasticsearch-7.7.0.jar:7.7.0]
at org.elasticsearch.cluster.service.ClusterApplierService.access$100(ClusterApplierService.java:73) [elasticsearch-7.7.0.jar:7.7.0]
at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:176) [elasticsearch-7.7.0.jar:7.7.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:633) [elasticsearch-7.7.0.jar:7.7.0]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [elasticsearch-7.7.0.jar:7.7.0]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [elasticsearch-7.7.0.jar:7.7.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_251]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_251]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_251]
[2022-07-01T11:38:39,895][DEBUG][o.e.c.a.s.ShardStateAction] [eshost.example.com] sending [internal:cluster/shard/failure] to [YpOcM6LtTIKVV3swKpZAWA] for shard entry [shard id [[security-auditlog-2020.11.26][0]], allocation id [n1zocy8ZTFqgFckuyVd9lg], primary term [0], message [failed to create index], failure [NullPointerException[null]], markAsStale [true]]
Trace logging shows:
-- index [[security-auditlog-2020.11.26/Ne-p6ajnSyCv-5xM-UcOMA]]
----shard_id [security-auditlog-2020.11.26][0]
--------[security-auditlog-2020.11.26][0], node[null], [P], recovery_source[existing store recovery; bootstrap_history_uuid=false], s[UNASSIGNED], unassigned_info[[reason=MANUAL_ALLOCATION], at[2022-07-01T17:05:16.424Z], delayed=false, details[failed shard on node [z5llgKCRQ16M1rG0QhiHew]: failed to create index, failure NullPointerException[null]], allocation_status[deciders_throttled]]
--------[security-auditlog-2020.11.26][0], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2022-07-01T04:17:17.657Z], delayed=false, allocation_status[no_attempt]]
What can we do to understand what is causing the NullPointerException?