Started rolling restart of cluster using SAML auth, after first node restarted, auth is broken

I have a 3 node cluster, running in Amazon ECS using container images that are based on the official images, but with our required file changes added. I started a rolling restart of this cluster, with no update to the container image, and after restarting the first node, SAML authentication is now broken. I have confirmed that if I remove this node from the cluster, auth works again.

No changes to SAML config, opensearch.yml, or the container image have been made.

File permissions on all files created by the container match the working containers.

SAML Config with sensitive portions redacted:

      saml_auth_domain:
        description: "Authenticate via Okta - For Human Users"
        http_enabled: true
        transport_enabled: false
        order: 2
        http_authenticator:
          type: saml
          challenge: true
          config:
            idp:
              metadata_file: /usr/share/opensearch/plugins/opensearch-security/securityconfig/okta_metadata.xml
              entity_id: http://www.okta.com/REDACTED
            sp:
              entity_id: https://osd.example.com
            kibana_url: https://osd.example.com/
            exchange_key: 'REDACTED'
        authentication_backend:
          type: noop

I have also confirmed that the exchange_key and okta_metadata.xml are identical between working and non-working host.

Errors:

[2022-08-04T12:25:48,754][WARN ][stderr                   ] [dev-02.example.local] Aug 04, 2022 12:25:48 PM org.apache.cxf.rs.security.jose.jws.JwsCompactConsumer verifySignatureWith
[2022-08-04T12:25:48,754][WARN ][stderr                   ] [dev-02.example.local] WARNING: Invalid Signature
[2022-08-04T12:25:48,755][WARN ][stderr                   ] [dev-02.example.local] Aug 04, 2022 12:25:48 PM org.apache.cxf.rs.security.jose.jws.JwsCompactConsumer verifySignatureWith
[2022-08-04T12:25:48,755][WARN ][stderr                   ] [dev-02.example.local] WARNING: Invalid Signature
[2022-08-04T12:25:48,755][INFO ][c.a.d.a.h.j.AbstractHTTPJwtAuthenticator] [dev-02.example.local] Extracting JWT token from REDACTED_TOKEN_LIKE_STRING failed
com.amazon.dlic.auth.http.jwt.keybyoidc.BadCredentialsException: Invalid JWT signature
	at com.amazon.dlic.auth.http.jwt.keybyoidc.JwtVerifier.getVerifiedJwtToken(JwtVerifier.java:75) ~[opensearch-security-2.0.1.0.jar:2.0.1.0]
	at com.amazon.dlic.auth.http.jwt.AbstractHTTPJwtAuthenticator.extractCredentials0(AbstractHTTPJwtAuthenticator.java:109) [opensearch-security-2.0.1.0.jar:2.0.1.0]
	at com.amazon.dlic.auth.http.jwt.AbstractHTTPJwtAuthenticator$1.run(AbstractHTTPJwtAuthenticator.java:91) [opensearch-security-2.0.1.0.jar:2.0.1.0]
	at com.amazon.dlic.auth.http.jwt.AbstractHTTPJwtAuthenticator$1.run(AbstractHTTPJwtAuthenticator.java:88) [opensearch-security-2.0.1.0.jar:2.0.1.0]
	at java.security.AccessController.doPrivileged(AccessController.java:318) [?:?]
	at com.amazon.dlic.auth.http.jwt.AbstractHTTPJwtAuthenticator.extractCredentials(AbstractHTTPJwtAuthenticator.java:88) [opensearch-security-2.0.1.0.jar:2.0.1.0]
	at com.amazon.dlic.auth.http.saml.HTTPSamlAuthenticator.extractCredentials(HTTPSamlAuthenticator.java:163) [opensearch-security-2.0.1.0.jar:2.0.1.0]
	at org.opensearch.security.auth.BackendRegistry.authenticate(BackendRegistry.java:248) [opensearch-security-2.0.1.0.jar:2.0.1.0]
	at org.opensearch.security.filter.SecurityRestFilter.checkAndAuthenticateRequest(SecurityRestFilter.java:192) [opensearch-security-2.0.1.0.jar:2.0.1.0]
	at org.opensearch.security.filter.SecurityRestFilter$1.handleRequest(SecurityRestFilter.java:125) [opensearch-security-2.0.1.0.jar:2.0.1.0]
	at org.opensearch.rest.RestController.dispatchRequest(RestController.java:311) [opensearch-2.0.1.jar:2.0.1]
	at org.opensearch.rest.RestController.tryAllHandlers(RestController.java:397) [opensearch-2.0.1.jar:2.0.1]
	at org.opensearch.rest.RestController.dispatchRequest(RestController.java:240) [opensearch-2.0.1.jar:2.0.1]
	at org.opensearch.security.ssl.http.netty.ValidatingDispatcher.dispatchRequest(ValidatingDispatcher.java:63) [opensearch-security-2.0.1.0.jar:2.0.1.0]
	at org.opensearch.http.AbstractHttpServerTransport.dispatchRequest(AbstractHttpServerTransport.java:366) [opensearch-2.0.1.jar:2.0.1]
	at org.opensearch.http.AbstractHttpServerTransport.handleIncomingRequest(AbstractHttpServerTransport.java:445) [opensearch-2.0.1.jar:2.0.1]
	at org.opensearch.http.AbstractHttpServerTransport.incomingRequest(AbstractHttpServerTransport.java:356) [opensearch-2.0.1.jar:2.0.1]
	at org.opensearch.http.netty4.Netty4HttpRequestHandler.channelRead0(Netty4HttpRequestHandler.java:55) [transport-netty4-client-2.0.1.jar:2.0.1]
	at org.opensearch.http.netty4.Netty4HttpRequestHandler.channelRead0(Netty4HttpRequestHandler.java:41) [transport-netty4-client-2.0.1.jar:2.0.1]
	at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
	at org.opensearch.http.netty4.Netty4HttpPipeliningHandler.channelRead(Netty4HttpPipeliningHandler.java:71) [transport-netty4-client-2.0.1.jar:2.0.1]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) [netty-codec-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) [netty-codec-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) [netty-codec-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:327) [netty-codec-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:299) [netty-codec-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286) [netty-handler-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) [netty-codec-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1371) [netty-handler-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1234) [netty-handler-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1283) [netty-handler-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:510) [netty-codec-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:449) [netty-codec-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:279) [netty-codec-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:722) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:623) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:586) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496) [netty-transport-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) [netty-common-4.1.73.Final.jar:4.1.73.Final]
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.73.Final.jar:4.1.73.Final]
	at java.lang.Thread.run(Thread.java:833) [?:?]

Any advice would be greatly appreciated.

what version of opensearch if you upgraded to latest 2.1.0 version ? theres few saml bugs in that latest version

This cluster is on 2.0.1. Sorry, forgot to include that in my original message.

@tfmm /usr/share/opensearch/plugins/opensearch-security/securityconfig/ is no longer valid for version 2.x. The correct path is /usr/share/opensearch/config/opensearch-security/

This path is consistent with the other, working containers, and this cluster was created on version 2.0.1. I can attempt to update the path, but I do not see this being the issue.

Also, the likely reason this worked for me is that when I have run securityadmin.sh, I provided the -cd flag with the actual directory where the files are stored.

I moved the files to the suggested location, and re-ran securityadmin.sh, there is no change in behavior.

@tfmm Just to confirm, did you run an upgrade from 2.0.1 to 2.1.0?

No, no upgrade or any other change to the cluster was performed, only stopping the container, removing the original container host, creating a new container host and starting the container.

@tfmm Thanks for the clarification. Does your cluster get green after node restart?
Do you see all of the nodes? Have you made any changes to the opensearch.yml file?

@pablo Yes, cluster goes green with all nodes present. No changes to opensearch.yml. I did test copying the opensearch.yml from one of the working nodes to this one as a test, but no change in behavior when doing that.

@tfmm Have you noticed any errors during the startup of the node and security plugin initialization?

No, logs all look normal, and seem to show standard container and plugin startup. In fact, there are no error-level messages in the container log at all, even when auth fails, just the warning level messages that I posted above.