Unassigned Shards with error ‘Failed to create index’ - NullPointerException

I encountered a problem bringing a ES 7.7.0 cluster back from a full cluster restart. The cluster has 3 master, 3 coordinating, 2 hot, 2 warm, and 2 cold nodes. The plugin version says it is 1.8.0.0

All nodes appear to have rejoined the cluster successfully:

ip            heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
10.2.3.197            8          99   1    0.12    0.22     0.22 dilt      -      eshost1551.example.com
10.2.3.202           67          95  27    1.24    0.95     0.98 ilm       *      eshost1570.example.com
10.2.3.200            3          84   1    0.09    0.08     0.12 dilt      -      eshost1550.example.com
10.2.3.201           13          97   2    0.18    0.15     0.14 ilm       -      eshost1569.example.com
10.2.3.205            9          97   2    0.27    0.15     0.12 l         -      eshost1573.example.com
10.2.3.196            1          63   1    0.13    0.15     0.27 dilt      -      eshost1554.example.com
10.2.3.206            8          97   1    0.04    0.04     0.06 l         -      eshost1574.example.com
10.2.3.199            3          96   2    0.20    0.14     0.14 dilt      -      eshost1549.example.com
10.2.3.204            8          75   2    0.02    0.08     0.12 l         -      eshost1572.example.com
10.2.3.203            9          94   2    0.11    0.11     0.13 ilm       -      eshost1571.example.com
10.2.3.195           63          94   3    0.20    0.18     0.24 dilt      -      eshost1553.example.com
10.2.3.198            2          95   2    0.20    0.23     0.22 dilt      -      eshost1552.example.com

But ~2000 of ~2400 shards are unassigned – one hot node (eshost1553) has shards allocated to it. Using the “/_cluster/reroute?retry_failed=true” endpoint forces another allocation attempt, but everything just returns to UNASSIGNED status. The “/_cluster/allocation/explain” endpoint says it failed to create the index with a NullPointerException:

{
  "index": "security-auditlog-2022.01.30",
  "shard": 0,
  "primary": true,
  "current_state": "unassigned",
  "unassigned_info": {
    "reason": "MANUAL_ALLOCATION",
    "at": "2022-07-01T17:09:15.230Z",
    "details": "failed shard on node [ZZFUy43-So6NBdyhM_dPBw]: failed to create index, failure NullPointerException[null]",
    "last_allocation_status": "awaiting_info"
  },
  "can_allocate": "yes",
  "allocate_explanation": "can allocate the shard",
  "target_node": {
    "id": "jhzbGKS8TQePfepXOb_0kw",
    "name": "eshost1549.example.com",
    "transport_address": "10.2.3.199:9300",
    "attributes": {
      "ml.machine_memory": "67368648704",
      "ml.max_open_jobs": "20",
      "xpack.installed": "true",
      "box_type": "cold",
      "transform.node": "true"
    }
  },
  "allocation_id": "pi2twkWMSly5j9K6-GaGDQ",
  "node_allocation_decisions": [
    {
      "node_id": "ZZFUy43-So6NBdyhM_dPBw",
      "node_name": "eshost1550.example.com",
      "transport_address": "10.2.3.200:9300",
      "node_attributes": {
        "ml.machine_memory": "67368660992",
        "ml.max_open_jobs": "20",
        "xpack.installed": "true",
        "box_type": "cold",
        "transform.node": "true"
      },
      "node_decision": "yes",
      "store": {
        "in_sync": true,
        "allocation_id": "JVRoSHUcTRiuXg_YvcKilw"
      }
    },
    {
      "node_id": "jhzbGKS8TQePfepXOb_0kw",
      "node_name": "eshost1549.example.com",
      "transport_address": "10.2.3.199:9300",
      "node_attributes": {
        "ml.machine_memory": "67368648704",
        "ml.max_open_jobs": "20",
        "xpack.installed": "true",
        "box_type": "cold",
        "transform.node": "true"
      },
      "node_decision": "yes",
      "store": {
        "in_sync": true,
        "allocation_id": "pi2twkWMSly5j9K6-GaGDQ"
      }
    },
    {
      "node_id": "Jw9QJ3S5Tvyd36-iT8wprw",
      "node_name": "eshost1552.example.com",
      "transport_address": "10.2.3.198:9300",
      "node_attributes": {
        "ml.machine_memory": "67368648704",
        "ml.max_open_jobs": "20",
        "xpack.installed": "true",
        "box_type": "warm",
        "transform.node": "true"
      },
      "node_decision": "no",
      "store": {
        "found": false
      }
    },
    {
      "node_id": "t0jOfIYASESl8rQ66RPUyg",
      "node_name": "eshost1551.example.com",
      "transport_address": "10.2.3.197:9300",
      "node_attributes": {
        "ml.machine_memory": "67368660992",
        "ml.max_open_jobs": "20",
        "xpack.installed": "true",
        "box_type": "warm",
        "transform.node": "true"
      },
      "node_decision": "no",
      "store": {
        "found": false
      }
    },
    {
      "node_id": "z5llgKCRQ16M1rG0QhiHew",
      "node_name": "eshost1554.example.com",
      "transport_address": "10.2.3.196:9300",
      "node_attributes": {
        "ml.machine_memory": "67368419328",
        "ml.max_open_jobs": "20",
        "xpack.installed": "true",
        "box_type": "hot",
        "transform.node": "true"
      },
      "node_decision": "no",
      "store": {
        "found": false
      }
    },
    {
      "node_id": "zN0fjtZhTHiNkxtW5f3zqg",
      "node_name": "eshost1553.example.com",
      "transport_address": "10.2.3.195:9300",
      "node_attributes": {
        "ml.machine_memory": "67368419328",
        "ml.max_open_jobs": "20",
        "xpack.installed": "true",
        "box_type": "hot",
        "transform.node": "true"
      },
      "node_decision": "no",
      "store": {
        "found": false
      }
    }
  ]
}

Debug logging shows basically the same thing – the trace shows the NullPointerException coming from com.amazon.opendistroforelasticsearch.security.OpenDistroSecurityPlugin.onIndexModule(OpenDistroSecurityPlugin.java:519):

[2022-07-01T11:38:39,888][DEBUG][o.e.i.IndicesService     ] [eshost.example.com] creating Index [[security-auditlog-2020.11.26/Ne-p6ajnSyCv-5xM-UcOMA]], shards [1]/[1] - reason [CREATE_INDEX]
[2022-07-01T11:38:39,893][DEBUG][o.e.g.PersistedClusterStateService] [eshost.example.com] writing cluster state took [0ms]; wrote global metadata [false] and metadata for [16] indices and skipped [760] unchanged indices
[2022-07-01T11:38:39,894][DEBUG][c.a.o.s.OpenDistroSecurityPlugin] [eshost.example.com] Handle complianceConfig=null/dlsFlsAvailable: true/auditLog=class com.amazon.opendistroforelasticsearch.security.auditlog.NullAuditLog for onIndexModule() of index security-auditlog-2020.11.26
[2022-07-01T11:38:39,894][WARN ][o.e.i.c.IndicesClusterStateService] [eshost.example.com] [security-auditlog-2020.11.26][0] marking and sending shard failed due to [failed to create index]
java.lang.NullPointerException: null
        at com.amazon.opendistroforelasticsearch.security.OpenDistroSecurityPlugin.onIndexModule(OpenDistroSecurityPlugin.java:519) ~[?:?]
        at org.elasticsearch.plugins.PluginsService.onIndexModule(PluginsService.java:267) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.indices.IndicesService.createIndexService(IndicesService.java:607) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.indices.IndicesService.createIndex(IndicesService.java:549) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.indices.IndicesService.createIndex(IndicesService.java:176) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.indices.cluster.IndicesClusterStateService.createIndices(IndicesClusterStateService.java:484) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:246) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.cluster.service.ClusterApplierService.lambda$callClusterStateAppliers$5(ClusterApplierService.java:517) ~[elasticsearch-7.7.0.jar:7.7.0]
        at java.lang.Iterable.forEach(Iterable.java:75) [?:1.8.0_251]
        at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:514) [elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:485) [elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:432) [elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.cluster.service.ClusterApplierService.access$100(ClusterApplierService.java:73) [elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:176) [elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:633) [elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [elasticsearch-7.7.0.jar:7.7.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_251]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_251]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_251]
[2022-07-01T11:38:39,895][DEBUG][o.e.c.a.s.ShardStateAction] [eshost.example.com] sending [internal:cluster/shard/failure] to [YpOcM6LtTIKVV3swKpZAWA] for shard entry [shard id [[security-auditlog-2020.11.26][0]], allocation id [n1zocy8ZTFqgFckuyVd9lg], primary term [0], message [failed to create index], failure [NullPointerException[null]], markAsStale [true]]

Trace logging shows:

-- index [[security-auditlog-2020.11.26/Ne-p6ajnSyCv-5xM-UcOMA]]
----shard_id [security-auditlog-2020.11.26][0]
--------[security-auditlog-2020.11.26][0], node[null], [P], recovery_source[existing store recovery; bootstrap_history_uuid=false], s[UNASSIGNED], unassigned_info[[reason=MANUAL_ALLOCATION], at[2022-07-01T17:05:16.424Z], delayed=false, details[failed shard on node [z5llgKCRQ16M1rG0QhiHew]: failed to create index, failure NullPointerException[null]], allocation_status[deciders_throttled]]
--------[security-auditlog-2020.11.26][0], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=CLUSTER_RECOVERED], at[2022-07-01T04:17:17.657Z], delayed=false, allocation_status[no_attempt]]

What can we do to understand what is causing the NullPointerException?

In case anyone else experiences the same problem – we figured out the problem: there was a typo in the “opendistro_security.audit.config.disabled_rest_categories” configuration that was added to the elasticsearch.yml file.