Opensearch restarts on Openshift On-Premises

Hi,
I’m running Opensearch with 3 master and 4 data nodes on a Red Hat Openshift Cluster with Red Hat CoreOS on-premises and my containers receive signals which cause in a huge amount of restarts. I can’t see anything, i.e out of memory, in the kernel or system logs and my nodes have enough resources.
The container logs only show some info level logs and at some random point, round about 5 to 10 minutes, it logs “Killing opensearch process 10”.
Troubleshooting so far showed that the container receives a SIGCHLD signal. Therefore I modified the opensearch-docker-entrypoint.sh file. In line 105 I split the trap in four traps and copied the terminateProcess function for each signal type and also added an echo to get the signal type in my container logs. This way I can say for sure that the container is terminated by a SIGCHLD signal, but I can’t tell which process sends this signal.
I also commented out starting the Performance Analyzer and Opensearch processes. Even then, the problem still occurs.
As far I can see from the Dockerfile opensearch depends on amazonlinux:2 if we run the entrypoint script on a CentOS based image the problem does not occur.
The problem started round about 10 days ago with Opensearch 1.3.2. Meanwhile I upgraded to version 2.0.0 but the problem still exists.

Do you have any tips or suggestions to resolve this problem?

Kind regards

Dominik

One of my colleagues performed a strace. Maybe this could help to debug the problem

[root@xxxxx-xxxx-xxxx /]# strace -fp 3232461
strace: Process 3232461 attached
wait4(-1,
[{WIFSIGNALED(s) && WTERMSIG(s) == SIGTERM}], WSTOPPED|WCONTINUED, NULL) = 739
rt_sigprocmask(SIG_BLOCK, NULL, [CHLD], 8) = 0
rt_sigprocmask(SIG_BLOCK, NULL, [CHLD], 8) = 0
openat(AT_FDCWD, "/dev/null", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
fcntl(1, F_GETFD)                       = 0
fcntl(1, F_DUPFD, 10)                   = 10
fcntl(1, F_GETFD)                       = 0
fcntl(10, F_SETFD, FD_CLOEXEC)          = 0
dup2(3, 1)                              = 1
close(3)                                = 0
fcntl(2, F_DUPFD, 10)                   = 11
fcntl(2, F_GETFD)                       = 0
fcntl(11, F_SETFD, FD_CLOEXEC)          = 0
dup2(1, 2)                              = 2
kill(10, 0)                             = 0
dup2(11, 2)                             = 2
fcntl(11, F_GETFD)                      = 0x1 (flags FD_CLOEXEC)
close(11)                               = 0
dup2(10, 1)                             = 1
fcntl(10, F_GETFD)                      = 0x1 (flags FD_CLOEXEC)
close(10)                               = 0
write(1, "Killing opensearch process 10\n", 30) = 30
kill(10, SIGTERM)                       = 0
rt_sigprocmask(SIG_BLOCK, NULL, [CHLD], 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [CHLD], 8) = 0
rt_sigprocmask(SIG_SETMASK, [CHLD], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [CHLD], 8) = 0
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 143}], WSTOPPED|WCONTINUED, NULL) = 10
rt_sigprocmask(SIG_BLOCK, [CHLD TSTP TTIN TTOU], [CHLD], 8) = 0
ioctl(2, TIOCSPGRP, [1])                = -1 ENOTTY (Inappropriate ioctl for device)
rt_sigprocmask(SIG_SETMASK, [CHLD], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [CHLD], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [CHLD], 8) = 0
rt_sigprocmask(SIG_SETMASK, [CHLD], NULL, 8) = 0
openat(AT_FDCWD, "/dev/null", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
fcntl(1, F_GETFD)                       = 0
fcntl(1, F_DUPFD, 10)                   = 10
fcntl(1, F_GETFD)                       = 0
fcntl(10, F_SETFD, FD_CLOEXEC)          = 0
dup2(3, 1)                              = 1
close(3)                                = 0
fcntl(2, F_DUPFD, 10)                   = 11
fcntl(2, F_GETFD)                       = 0
fcntl(11, F_SETFD, FD_CLOEXEC)          = 0
dup2(1, 2)                              = 2
kill(11, 0)                             = 0
dup2(11, 2)                             = 2
fcntl(11, F_GETFD)                      = 0x1 (flags FD_CLOEXEC)
close(11)                               = 0
dup2(10, 1)                             = 1
fcntl(10, F_GETFD)                      = 0x1 (flags FD_CLOEXEC)
close(10)                               = 0
write(1, "Killing performance analyzer pro"..., 40) = 40
kill(11, SIGTERM)                       = 0
rt_sigprocmask(SIG_BLOCK, NULL, [CHLD], 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [CHLD], 8) = 0
rt_sigprocmask(SIG_SETMASK, [CHLD], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [CHLD], 8) = 0
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 143}], WSTOPPED|WCONTINUED, NULL) = 11
rt_sigprocmask(SIG_BLOCK, [CHLD TSTP TTIN TTOU], [CHLD], 8) = 0
ioctl(2, TIOCSPGRP, [1])                = -1 ENOTTY (Inappropriate ioctl for device)
rt_sigprocmask(SIG_SETMASK, [CHLD], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [CHLD], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [CHLD], 8) = 0
rt_sigprocmask(SIG_SETMASK, [CHLD], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD TSTP TTIN TTOU], [CHLD], 8) = 0
ioctl(2, TIOCSPGRP, [1])                = -1 ENOTTY (Inappropriate ioctl for device)
rt_sigprocmask(SIG_SETMASK, [CHLD], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [CHLD], 8) = 0
rt_sigprocmask(SIG_SETMASK, [CHLD], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [CHLD], 8) = 0
rt_sigprocmask(SIG_SETMASK, [CHLD], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=739, si_uid=1000, si_status=SIGTERM, si_utime=0, si_stime=0} ---
wait4(-1, 0x7ffef904e4d0, WNOHANG|WSTOPPED|WCONTINUED, NULL) = -1 ECHILD (No child processes)
rt_sigreturn({mask=[]})                 = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
write(1, "OpenSearch exited with code 143\n", 32) = 32
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
write(1, "Performance analyzer exited with"..., 42) = 42
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
read(255, "", 5396)                     = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
rt_sigprocmask(SIG_BLOCK, NULL, [], 8)  = 0
openat(AT_FDCWD, "/dev/null", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
fcntl(1, F_GETFD)                       = 0
fcntl(1, F_DUPFD, 10)                   = 10
fcntl(1, F_GETFD)                       = 0
fcntl(10, F_SETFD, FD_CLOEXEC)          = 0
dup2(3, 1)                              = 1
close(3)                                = 0
fcntl(2, F_DUPFD, 10)                   = 11
fcntl(2, F_GETFD)                       = 0
fcntl(11, F_SETFD, FD_CLOEXEC)          = 0
dup2(1, 2)                              = 2
kill(10, 0)                             = -1 ESRCH (No such process)
fstat(2, {st_mode=S_IFCHR|0666, st_rdev=makedev(0x1, 0x3), ...}) = 0
ioctl(2, TCGETS, 0x7ffef904ef40)        = -1 ENOTTY (Inappropriate ioctl for device)
write(2, "./opensearch-docker-entrypoint.s"..., 73) = 73
dup2(11, 2)                             = 2
fcntl(11, F_GETFD)                      = 0x1 (flags FD_CLOEXEC)
close(11)                               = 0
dup2(10, 1)                             = 1
fcntl(10, F_GETFD)                      = 0x1 (flags FD_CLOEXEC)
close(10)                               = 0
openat(AT_FDCWD, "/dev/null", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
fcntl(1, F_GETFD)                       = 0
fcntl(1, F_DUPFD, 10)                   = 10
fcntl(1, F_GETFD)                       = 0
fcntl(10, F_SETFD, FD_CLOEXEC)          = 0
dup2(3, 1)                              = 1
close(3)                                = 0
fcntl(2, F_DUPFD, 10)                   = 11
fcntl(2, F_GETFD)                       = 0
fcntl(11, F_SETFD, FD_CLOEXEC)          = 0
dup2(1, 2)                              = 2
kill(11, 0)                             = -1 ESRCH (No such process)
write(2, "./opensearch-docker-entrypoint.s"..., 73) = 73
dup2(11, 2)                             = 2
fcntl(11, F_GETFD)                      = 0x1 (flags FD_CLOEXEC)
close(11)                               = 0
dup2(10, 1)                             = 1
fcntl(10, F_GETFD)                      = 0x1 (flags FD_CLOEXEC)
close(10)                               = 0
exit_group(0)                           = ?
+++ exited with 0 +++

Hello Dominik,

thank you for bringing this up. We are also running Opensearch in our Kubernetes cluster and
see many container restarts per day with the message “Killing opensearch process 10” out of nowhere.

I followed your advice to split up the trap in the entrypoint script and can confirm that it is triggered
because of a SIGCHLD signal. I took a step further and removed the Opensearch and Performance analyzer processes from the script and even then it received a SIGCHLD without any running child process.

There is also this thread on your forum which probably seems to have the same issue?

Something weird is going on when running this container image and it would be great if you guys could
fix this problem. In our case removing CHLD from the trap in the entrypoint script works for now.