Hi, my company uses ELK for log observability. I am testing OpenSearch and potentially will retire Elasticsearch. But I encounter an OpenSearch performance issue and an OpenSearch Dashboard 502 bad gateway issue.
I am sending the same syslog(in terms of volume and data source) to an ElasticSearch(v7.13.4) cluster and an OpenSearch(v1.2.4) cluster. Both clusters are running in a Kubernetes environment. ElasticSearch is deployed using the ElasticSearch operator, and OpenSearch is deployed using a helm chart(v1.5.4). The syslog index in both clusters has 2 primary shards and 2 replica shards. Pods have the same cpu and memory limit in Kubernetes and the same JVM memory setting. I use traefik(v2.5.4) on both clusters to manage the traffic.
In ElasticSeach Kibana can show data in the last 12 months which is 100+ billion log entries. It takes Kibana more than 2 mins to load the data in the Discover tab.
However, the OpenSearch Dashboard returns a bad gateway error when a query runs over 120 seconds. If I use a shorter query time window, e.g. 3 days, it does work fine. I use the elastic-exporter to montior OpenSearch. I don’t see OpenSearch pods hitting any CPU or memory limit. I doubled the CPU and memory limit in Kubernetes, but it doesn’t help.
Search Error
Bad Gateway
_construct@http://hostname.nip.io/1/bundles/core/core.entry.js:6:4859
Wrapper@http://hostname.nip.io/1/bundles/core/core.entry.js:6:4249
_createSuperInternal@http://hostname.nip.io/1/bundles/core/core.entry.js:6:3388
HttpFetchError@http://hostname.nip.io/1/bundles/core/core.entry.js:6:6016
_callee3$@http://hostname.nip.io/1/bundles/core/core.entry.js:6:59862
tryCatch@http://hostname.nip.io/1/bundles/plugin/queryWorkbenchDashboards/queryWorkbenchDashboards.plugin.js:1:33827
invoke@http://hostname.nip.io/1/bundles/plugin/queryWorkbenchDashboards/queryWorkbenchDashboards.plugin.js:1:37786
defineIteratorMethods/</<@http://hostname.nip.io/1/bundles/plugin/queryWorkbenchDashboards/queryWorkbenchDashboards.plugin.js:1:34966
fetch_asyncGeneratorStep@http://hostname.nip.io/1/bundles/core/core.entry.js:6:52965
_next@http://hostname.nip.io/1/bundles/core/core.entry.js:6:53305
Below are the logs from the OpenSearch Dashboard pod.
{"type":"log","@timestamp":"2022-03-21T03:15:12Z","tags":["error","opensearch","data"],"pid":1,"message":"[RequestAbortedError]: Request aborted"}
{"type":"response","@timestamp":"2022-03-21T03:13:12Z","tags":[],"pid":1,"method":"post","statusCode":200,"req":{"url":"/internal/search/opensearch","method":"post","headers":{"host":"hostname.nip.io","user-agent":"Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:98.0) Gecko/20100101 Firefox/98.0","content-length":"851","accept":"*/*","accept-encoding":"gzip, deflate","accept-language":"en-US,en;q=0.5","content-type":"application/json","origin":"http://hostname.nip.io","osd-version":"1.2.0","referer":"http://hostname.nip.io/app/discover","x-forwarded-for":"10.42.1.2","x-forwarded-host":"hostname.nip.io","x-forwarded-port":"80","x-forwarded-proto":"http","x-forwarded-server":"traefik-5dd85d7db6-v47j2","x-real-ip":"10.42.1.2"},"remoteAddress":"10.42.2.4","userAgent":"Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:98.0) Gecko/20100101 Firefox/98.0","referer":"http://hostname.nip.io/app/discover"},"res":{"statusCode":200,"responseTime":120006,"contentLength":9},"message":"POST /internal/search/opensearch 200 120006ms - 9.0B"}
And logs from the traefik pod
[21/Mar/2022:03:13:12 +0000] "POST /internal/search/opensearch HTTP/1.1" 502 11 "-" "-" 2271447 "opensearch-dashboard-ad6f2cbe4eb4a634982b@kubernetescrd" "http://10.42.4.192:5601" 120003ms
I would like to know if the bad gateway is caused by OpenSearch, Dashboard, or Traefik.
And how can I improve the performance of OpenSearch.
Thank you