Hello Opensearch Team!
Our Data nodes dockers seems to be randomly crashing after the kernel get an invalid number. it looks like a bug because the invalid number is 2^64-1 which seems to suggest we are hitting some kind of limit somewhere.
This has happened 3 times since 2021-12-17 at seemingly random times.
Here is some information on our setup:
Our ELK stack is running on 3 servers each is powered with 196G or RAM and 2 INTEL Xeon(R) Gold 6226R
On each server we have the following docker setup:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
956073c41d09 opensearchproject/opensearch-dashboards:1.1.0 "./opensearch-dashbo…" 3 days ago Up 3 days 0.0.0.0:5601->5601/tcp ODE-KIBANA-003
1ecdbd3f560f opensearchproject/opensearch:1.1.0 "./opensearch-docker…" 3 days ago Up 3 days 9300/tcp, 9600/tcp, 0.0.0.0:9200->9200/tcp, 9650/tcp ODE-INGEST-003
dae019941665 opensearchproject/opensearch:1.1.0 "./opensearch-docker…" 3 days ago Up 3 days 9200/tcp, 9300/tcp, 9600/tcp, 9650/tcp ODE-DATA-005
7dbc8b5e5f54 opensearchproject/opensearch:1.1.0 "./opensearch-docker…" 3 days ago Up 3 days 9200/tcp, 9300/tcp, 9600/tcp, 9650/tcp ODE-DATA-006
38c695f8e8ad opensearchproject/opensearch:1.1.0 "./opensearch-docker…" 3 days ago Up 7 hours 9200/tcp, 9300/tcp, 9600/tcp, 9650/tcp ODE-DATA-016
c39786c80596 opensearchproject/opensearch:1.1.0 "./opensearch-docker…" 3 days ago Up 3 days 9300/tcp, 9650/tcp, 0.0.0.0:9600->9600/tcp, 0.0.0.0:9201->9200/tcp ODE-MASTER-003
We have 3 data nodes on each servers. The dockers for the data nodes are running with 40G RAM for the hot SSD volume (ODE-DATA-005) and 50G RAM for the 2 Warm HDD volumes (ODE-DATA-006 and 016)
Here is what we can see in /var/log/messages:
grep "Jan 24 10:23:10 <HOSTNAME> 6: " -A 50 /var/log/messages
Jan 24 10:23:10 <HOSTNAME> 6: invalid number '18446744073709551615'
Jan 24 10:23:10 <HOSTNAME> docker-compose: #033[32mODE-DATA-016 |#033[0m Killing opensearch process 34
Jan 24 10:23:10 <HOSTNAME> docker-compose: #033[32mODE-DATA-016 |#033[0m [2022-01-24T15:23:10,907][INFO ][o.o.n.Node ] [ODE-DATA-016] stopping ...
Jan 24 10:23:10 <HOSTNAME> docker-compose: #033[32mODE-DATA-016 |#033[0m [2022-01-24T15:23:10,907][INFO ][o.o.s.a.r.AuditMessageRouter] [ODE-DATA-016] Closing AuditMessageRouter
Jan 24 10:23:10 <HOSTNAME> docker-compose: #033[32mODE-DATA-016 |#033[0m [2022-01-24T15:23:10,910][INFO ][o.o.s.a.s.SinkProvider] [ODE-DATA-016] Closing InternalOpenSearchSink
Jan 24 10:23:10 <HOSTNAME> docker-compose: #033[32mODE-DATA-016 |#033[0m [2022-01-24T15:23:10,910][INFO ][o.o.s.a.s.SinkProvider] [ODE-DATA-016] Closing DebugSink
Jan 24 10:23:10 <HOSTNAME> docker-compose: #033[32mODE-DATA-016 |#033[0m [2022-01-24T15:23:10,915][INFO ][o.o.c.c.Coordinator ] [ODE-DATA-016] master node [{ODE-MASTER-001}{CP03cx4hQLSbmwQykxxG9A}{nj_PyoEiSQ2yEExdoZ4SQg}{ODE-MASTER-001}{10.0.1.5:9300}{m}{host=<MASTER>}] failed, restarting discovery
Jan 24 10:23:10 <HOSTNAME> docker-compose: #033[32mODE-DATA-016 |#033[0m org.opensearch.transport.NodeDisconnectedException: [ODE-MASTER-001][10.0.1.5:9300][disconnected] disconnected
Jan 24 10:23:12 <HOSTNAME> docker-compose: #033[32mODE-DATA-016 |#033[0m [2022-01-24T15:23:12,217][INFO ][o.o.n.Node ] [ODE-DATA-016] stopped
Jan 24 10:23:12 <HOSTNAME> docker-compose: #033[32mODE-DATA-016 |#033[0m [2022-01-24T15:23:12,217][INFO ][o.o.n.Node ] [ODE-DATA-016] closing ...
Jan 24 10:23:12 <HOSTNAME> docker-compose: #033[32mODE-DATA-016 |#033[0m [2022-01-24T15:23:12,223][INFO ][o.o.s.a.i.AuditLogImpl] [ODE-DATA-016] Closing AuditLogImpl
Jan 24 10:23:12 <HOSTNAME> docker-compose: #033[32mODE-DATA-016 |#033[0m [2022-01-24T15:23:12,229][INFO ][o.o.n.Node ] [ODE-DATA-016] closed
I do not see any info on this error in the interweebs. So I am wondering if we are hitting a new bug here?
Any info would be appreciated!