ISM policy fails on StateMetaData

Opensearch 2.11.1

We have an ISM policy for indices matching “network--rollover-” that rolls them over after they are 1 hour old and have a primary shard size of at least 1 GB. The indices are kept for 1 day before being deleted.

This succeeds for two of the three matching indices, but fails on the third index with the error message:
Failed to find state=StateMetaData(name=rollover, startTime=1716488483625) in policy=network

My guess is that this is somehow related to the startTime on the index but can’t imagine how/why this is failing.

I do see this SSL error in the logs that was recorded at around that startTime ( 1716488483625 = Thursday, May 23, 2024 6:21:23.625 PM )

[2024-05-23T18:16:52,824][ERROR][o.o.s.s.h.n.SecuritySSLNettyHttpServerTransport] [flow-app] Exception during establishing a SSL connection: java.net.SocketException: Connection reset
java.net.SocketException: Connection reset
        at sun.nio.ch.SocketChannelImpl.throwConnectionReset(SocketChannelImpl.java:394) ~[?:?]
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:426) ~[?:?]
        at org.opensearch.transport.CopyBytesSocketChannel.readFromSocketChannel(CopyBytesSocketChannel.java:156) ~[transport-netty4-client-2.11.1.jar:2.11.1]
        at org.opensearch.transport.CopyBytesSocketChannel.doReadBytes(CopyBytesSocketChannel.java:141) ~[transport-netty4-client-2.11.1.jar:2.11.1]
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:151) [netty-transport-4.1.100.Final.jar:4.1.100.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788) [netty-transport-4.1.100.Final.jar:4.1.100.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:689) [netty-transport-4.1.100.Final.jar:4.1.100.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:652) [netty-transport-4.1.100.Final.jar:4.1.100.Final]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562) [netty-transport-4.1.100.Final.jar:4.1.100.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) [netty-common-4.1.100.Final.jar:4.1.100.Final]
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.100.Final.jar:4.1.100.Final]
        at java.lang.Thread.run(Thread.java:833) [?:?]
[2024-05-23T18:17:27,955][INFO ][o.o.p.PluginsService     ] [flow-app] PluginService:onIndexModule index:[network-FCODEX-2.3-rollover-000001/hJ8hRwvATWWxfSJBjbzemA]
[2024-05-23T18:17:28,081][INFO ][o.o.c.m.MetadataMappingService] [flow-app] [network-FCODEX-2.3-rollover-000001/hJ8hRwvATWWxfSJBjbzemA] update_mapping [_doc]
[2024-05-23T18:17:28,213][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [flow-app] Detected cluster change event for destination migration
[2024-05-23T18:19:12,901][INFO ][o.o.j.s.JobSweeper       ] [flow-app] Running full sweep
[2024-05-23T18:20:41,765][INFO ][o.o.j.s.JobScheduler     ] [flow-app] Will delay 136366 miliseconds for next execution of job network-PCODEX-2.3-rollover-000001
[2024-05-23T18:20:42,165][INFO ][o.o.i.i.ManagedIndexRunner] [flow-app] Executing attempt_rollover for network-PCODEX-2.3-rollover-000001
[2024-05-23T18:20:42,170][INFO ][o.o.i.i.ManagedIndexRunner] [flow-app] Finished executing attempt_rollover for network-PCODEX-2.3-rollover-000001
[2024-05-23T18:21:06,649][INFO ][o.o.j.s.JobScheduler     ] [flow-app] Will delay 78082 miliseconds for next execution of job network-TLMTRY_FCODEX-2.3-rollover-000001
[2024-05-23T18:21:07,197][INFO ][o.o.i.i.ManagedIndexRunner] [flow-app] Executing attempt_rollover for network-TLMTRY_FCODEX-2.3-rollover-000001
[2024-05-23T18:21:07,200[INFO ][o.o.i.i.ManagedIndexRunner] [flow-app] Finished executing attempt_rollover for network-TLMTRY_FCODEX-2.3-rollover-000001
[2024-05-23T18:21:23,586][INFO ][o.o.j.s.JobScheduler     ] [flow-app] Will delay 1970 miliseconds for next execution of job network-FCODEX-2.3-rollover-000001
[2024-05-23T18:21:23,617][INFO ][o.o.j.s.JobScheduler     ] [flow-app] Descheduling jobId: hJ8hRwvATWWxfSJBjbzemA

Any help in translating the error message or restarting the policy so that it picks up the current indices would be appreciated.

@dxturner Do yo get any other connectivity errors? Is the reported error only visible during rollover process?
How many nodes do you have in your cluster?
Is your cluster status always Green?