Possible issue?

Hi, we have a indexing heaving cluster that for the last few weeks since upgrading to 1.3.0 has been experiencing timeouts a few times a week. During the period where indexing stops I see this in the logs:

org.elasticsearch.transport.RemoteTransportException: [master][1.1.1.1:9300][cluster:admin/ism/update/managedindexmetadata], Caused by: org.elasticsearch.cluster.metadata.ProcessClusterEventTimeoutException: failed to process cluster event (opendistro-ism) within 30s, at org.elasticsearch.cluster.service.MasterService$Batcher.lambda$onTimeout$0(MasterService.java:134) ~[elasticsearch-7.3.2.jar:7.3.2], at java.util.ArrayList.forEach(ArrayList.java:1540) ~[?:?], at org.elasticsearch.cluster.service.MasterService$Batcher.lambda$onTimeout$1(MasterService.java:133) ~[elasticsearch-7.3.2.jar:7.3.2], at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:688) ~[elasticsearch-7.3.2.jar:7.3.2], at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?], at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?], at java.lang.Thread.run(Thread.java:835) ~[?:?]

Not sure if this is anything important but the state of the managed ism indices was “running”.
For a test I removed all policies and managed indices from ISM to see if there is any improvement.
Has anyone else seen issues like this?

Hi @jasonrojas,

How many managed indices do you have in ISM?
When you removed the policies/managed indices from ISM did you see the timeouts stop or were they still happening?

Thanks,
Drew

I think I had manually applied the policy to about 30 indices.

The cluster had no issues last night so only time will tell if this was the issue.

@jasonrojas

30 indices wouldn’t be enough to cause master cluster state queue to be backed up purely from ISM.
If you could also answer these questions, might be able to help us pinpoint and replicate:

  • How much data do you have in the cluster
  • How many indices/shards do you have
  • What type of instances are you using
  • What is the ingestion throughput you average

Also is the timeout exceptions only occurring for ISM cluster events or other cluster events too?

Thanks,
Drew