"Connection reset by peer" error messsges

I’m seeing “connection reset by peer” error messages appearing in my Elasticsearch log frequently. In a newly restaged deployment on Kubernetes, it’s occurring every 30 seconds. My assumption has been that the connection in question was from Fluent Bit as it was attempting to send new log messages to Elasticsearch. But I don’t see any messages (errors or otherwise) on the Fluent Bit side, so now I’m questioning the assumption.

I’ve included an example of the error message and associated stack trace below. Can someone verify that this is most likely related to the Fluent Bit connection? All of the errors are coming from ES client nodes. I had 2 client nodes and upped it to 3 but it hasn’t reduced the number of errors. Fluent Bit is the only thing feeding documents to ES.

I’ve deployed with the sample demo security enabled including the demo certs. Could that (or TLS, in general) be a factor here?

[2020-04-03T03:06:41,898][ERROR][c.a.o.s.s.h.n.OpenDistroSecuritySSLNettyHttpServerTransport] [v4m-es-client-5648c4cb49-kf84r] Exception during establishing a SSL connection: java.io.IOException: Connection reset by peer
java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:?]
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) ~[?:?]
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:276) ~[?:?]
        at sun.nio.ch.IOUtil.read(IOUtil.java:233) ~[?:?]
        at sun.nio.ch.IOUtil.read(IOUtil.java:223) ~[?:?]
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:358) ~[?:?]
        at org.elasticsearch.transport.CopyBytesSocketChannel.readFromSocketChannel(CopyBytesSocketChannel.java:137) ~[transport-netty4-client-7.4.2.jar:7.4.2]
        at org.elasticsearch.transport.CopyBytesSocketChannel.doReadBytes(CopyBytesSocketChannel.java:122) ~[transport-netty4-client-7.4.2.jar:7.4.2]
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:148) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:697) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:597) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:551) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:511) [netty-transport-4.1.38.Final.jar:4.1.38.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918) [netty-common-4.1.38.Final.jar:4.1.38.Final]
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.38.Final.jar:4.1.38.Final]
        at java.lang.Thread.run(Thread.java:835) [?:?]

In another deployment where I’m seeing the same “connection reset by peer” error messages in the ES log, I do see Fluent Bit messages that seem to line up with the ES errors (as shown in the following screenshot):

The ES_ERROR_PLOT shows instances of the “Connection reset by peer” message from the ES log, the FB_FAILED_PLOT shows instances of “failed to flush chunk” messages in the Fluent Bit log and the FB_SUCCESS_PLOT shows instances of “succeeded at retry” messages in the Fluent Bit log (which are generated when a chunk was successfully sent to ES after previously failing.

So, things seem “related” even if I can’t say one caused the other. As mentioned in the original post, I have seen instances of either error message without the corresponding one on the other side.

I did my original prototyping work with the Elastic distribution of ES and didn’t see these errors (although I never went looking for them). Since the stack trace includes references to ODFE components, could these communication issues be signs of an issue with the ODFE security plugin? Hmmm, I suppose I also didn’t have TLS enabled then either. Any thoughts on what may be causing the connection issues?

Any solution on this?
We have a similar issue with Opendistro for ES version 0.10.0

Sorry, @atorelli this problem stopped for me, but I do not remember why. I don’t believe I changed any connection configuration settings. But I may have improved the parsing I was doing in Fluent Bit to make sure the data going to Elasticsearch was formatted more consistently.

1 Like

This exception usually means that you have written to an connection that had already been loses by the peer. In other words, an application protocol error. Connection reset simply means that a TCP RST was received. TCP RST packet is that the remote side telling you the connection on which the previous TCP packet is sent is not recognized, maybe the connection has closed, maybe the port is not open, and something like these. A reset packet is simply one with no payload and with the RST bit set in the TCP header flags.

The following are possible causes for the error:

  • More commonly, it is caused by writing to a connection that the other end has already closed normally. In other words an application protocol error.

  • It can also be caused by closing a socket when there is unread data in the socket receive buffer.

  • The TCP (Transmission Control Protocol) socket is closed because the socket received a close command from a remote machine.

  • Sometimes this can also be due to heavy load causing Server to queue the message and before it can read the message is got timed out at the client end. So you can also check server health and log for excessive load causing this error.