Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
OpenDistro 1.6
Describe the issue:
We are using Opendistro 1.6 with Elastic 7.6. We are using LDAP for authentication, during a network issue we see Elastic node becomes unresponsive even though network issue is resolved.
Elastic service only recovers when Elastic node is restarted.
On analysing thread dumps we see they are all waiting for the same com.google.common.util.concurrent.SettableFuture object. Which is an asynchronous task waiting to complete.
We went through the code referenced by the stacktrace, starting from OpenDistro’s BackendRegistry (the authcz() method) which calls LocalManualCache.get() from Google Guava’s LocalCache (version 25.1, used by OpenDistro 1.6 ). This effectively says “get this entry from cache and use the actual authentication backend as a fallback”.
Further up the trace we get to this part of Segment.get():
// at this point e is either null or expired;
return lockedGetOrLoad(key, hash, loader);
So we know that the entry wasn’t in cache (i.e. we can’t authenticate the user using the local cache). The lockedGetOrLoad() method effectively says “if there’s another [asynchronous] call to fetch this entry [from the authentication backend], wait for it. Otherwise, initiate a call”. Once again, the stacktrace tells us the correct branch, because we get here:
// The entry already exists. Wait for loading.
return waitForLoadingValue(e, key, valueReference);
We notice the getUninterruptibly() call further up in the stacktrace - it effectively waits indefinitely for that asynchronous call (that SettableFuture that all the threads are waiting on). No timeout.
But there must be another issue: why does that asynchronous call get stuck indefinitely? Shouldn’t it throw an error?
We checked the code for the most recent version of Google Guava’s LocalCache as well as the most recent version of OpenSearch (not OpenDistro, OpenSearch) Security’s BackendRegistry and the path we’re interested in is essentially the same: BackendRegistry calls LocalManualCache.get(), and that get(), in turn, still waits indefinitely if there’s no existing entry and there’s another fetch in progress.
Is there any reason why this timeout is not set?
Are there any other configuration we are missing to avoid this issue?
Configuration:
Thread dump details:
at jdk.internal.misc.Unsafe.park(java.base@11.0.12/Native Method)
at java.util.concurrent.locks.LockSupport.park(java.base@11.0.12/LockSupport.java:194)
at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:497)
at com.google.common.util.concurrent.AbstractFuture$TrustedFuture.get(AbstractFuture.java:83)
at com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:196)
at com.google.common.cache.LocalCache$LoadingValueReference.waitForValue(LocalCache.java:3580)
at com.google.common.cache.LocalCache$Segment.waitForLoadingValue(LocalCache.java:2174)
at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2161)
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2044)
at com.google.common.cache.LocalCache.get(LocalCache.java:3951)
at com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4870)
at com.amazon.opendistroforelasticsearch.security.auth.BackendRegistry.authcz(BackendRegistry.java:655)
at com.amazon.opendistroforelasticsearch.security.auth.BackendRegistry.authenticate(BackendRegistry.java:461)
at com.amazon.opendistroforelasticsearch.security.filter.OpenDistroSecurityRestFilter.checkAndAuthenticateRequest(OpenDistroSecurityRestFilter.java:146)
at com.amazon.opendistroforelasticsearch.security.filter.OpenDistroSecurityRestFilter.access$000(OpenDistroSecurityRestFilter.java:63)
at com.amazon.opendistroforelasticsearch.security.filter.OpenDistroSecurityRestFilter$1.handleRequest(OpenDistroSecurityRestFilter.java:93)
at org.elasticsearch.rest.RestController.dispatchRequest(RestController.java:227)
Relevant Logs or Screenshots: