OpenDistro plugin 1.6 are not having any timeout for BackendRegistry (the authcz() method)

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):

OpenDistro 1.6

Describe the issue:
We are using Opendistro 1.6 with Elastic 7.6. We are using LDAP for authentication, during a network issue we see Elastic node becomes unresponsive even though network issue is resolved.
Elastic service only recovers when Elastic node is restarted.
On analysing thread dumps we see they are all waiting for the same object. Which is an asynchronous task waiting to complete.

We went through the code referenced by the stacktrace, starting from OpenDistro’s BackendRegistry (the authcz() method) which calls LocalManualCache.get() from Google Guava’s LocalCache (version 25.1, used by OpenDistro 1.6 ). This effectively says “get this entry from cache and use the actual authentication backend as a fallback”.

Further up the trace we get to this part of Segment.get():

// at this point e is either null or expired;
return lockedGetOrLoad(key, hash, loader);

So we know that the entry wasn’t in cache (i.e. we can’t authenticate the user using the local cache). The lockedGetOrLoad() method effectively says “if there’s another [asynchronous] call to fetch this entry [from the authentication backend], wait for it. Otherwise, initiate a call”. Once again, the stacktrace tells us the correct branch, because we get here:

// The entry already exists. Wait for loading.
return waitForLoadingValue(e, key, valueReference);

We notice the getUninterruptibly() call further up in the stacktrace - it effectively waits indefinitely for that asynchronous call (that SettableFuture that all the threads are waiting on). No timeout.

But there must be another issue: why does that asynchronous call get stuck indefinitely? Shouldn’t it throw an error?

We checked the code for the most recent version of Google Guava’s LocalCache as well as the most recent version of OpenSearch (not OpenDistro, OpenSearch) Security’s BackendRegistry and the path we’re interested in is essentially the same: BackendRegistry calls LocalManualCache.get(), and that get(), in turn, still waits indefinitely if there’s no existing entry and there’s another fetch in progress.

Is there any reason why this timeout is not set?
Are there any other configuration we are missing to avoid this issue?

Thread dump details:
at jdk.internal.misc.Unsafe.park(java.base@11.0.12/Native Method)
at java.util.concurrent.locks.LockSupport.park(java.base@11.0.12/

Relevant Logs or Screenshots:

Hi @gchakkalakkal1

How many nodes do you have? How often does this happen? Does the Elasticsearch node always become unresponsive after the network issue?

This is 21 node cluster with 18 data nodes and 3 master. It does not happen all the time during a network glitch and also not in all nodes. We have faced 2 or 3 times in last 3 month.
But it could be because of the user is already in cache and LDAP interaction would have been avoided.