Issue Running opensearch-benchmark http_logs on with opensearch container

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):

Is there a standard setup for running opensearch-benchmark http_logs against running opensearch docker containers?

Describe the issue:

Benchmark hangs and according to the log coordinating node is getting a python3.9 timeout error.

Configuration:

Using a client/server topology on the same network - verified that both port and ip addr is reachable. This configuration works running NYC-Taxis.

docker pull opensearchproject/opensearch:latest

docker run -d -p 9200:9200 -p 9600:9600 -e “discovery.type=single-node” -e “OPENSEARCH_INITIAL_ADMIN_PASSWORD=” opensearchproject/opensearch:latest

opensearch-benchmark execute-test --pipeline=benchmark-only --workload=http_logs --target-host=https://<ip_address>:9200 --client-options=basic_auth_user:admin,basic_auth_password:<auth_passwd>,verify_certs:false --kill-running-processes

Relevant Logs or Screenshots:

Hangs on “Running check-cluster-health [ 0% done]”

From the Log:

2025-01-08 18:12:57,596 ActorAddr-(T|:37549)/PID:7258 osbenchmark.worker_coordinator.worker_coordinator INFO iteration-count-based schedule will determine when the schedule for [check-cluster-health] terminates.
2025-01-08 18:13:08,476 -not-actor-/PID:7258 opensearch WARNING GET https://10.16.105.153:9200/_cluster/health?wait_for_status=green&wait_for_no_relocating_shards=true [status:N/A request:10.801s]
Traceback (most recent call last):
File “/home/user/open/lib64/python3.9/site-packages/aiohttp/client_reqrep.py”, line 1059, in start
message, payload = await protocol.read() # type: ignore[union-attr]
File “/home/user/open/lib64/python3.9/site-packages/aiohttp/streams.py”, line 671, in read
await self._waiter
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File “/home/user/open/lib64/python3.9/site-packages/opensearchpy/_async/http_aiohttp.py”, line 294, in perform_request
async with self.session.request(
File “/home/user/open/lib64/python3.9/site-packages/aiohttp/client.py”, line 1425, in aenter
self._resp: _RetType = await self._coro
File “/home/user/open/lib64/python3.9/site-packages/aiohttp/client.py”, line 730, in _request
await resp.start(conn)
File “/home/user/open/lib64/python3.9/site-packages/aiohttp/client_reqrep.py”, line 1074, in start
self._continue = None
File “/home/user/open/lib64/python3.9/site-packages/aiohttp/helpers.py”, line 671, in exit
raise asyncio.TimeoutError from exc_val
asyncio.exceptions.TimeoutError

Localhost cluster health check:
Issuing a curl on localhost stdout:

{
“name” : “2d6f3ab15f2d”,
“cluster_name” : “docker-cluster”,
“cluster_uuid” : “AOLKcancRJiWWO0xZuxIOQ”,
“version” : {
“distribution” : “opensearch”,
“number” : “2.18.0”,
“build_type” : “tar”,
“build_hash” : “99a9a81da366173b0c2b963b26ea92e15ef34547”,
“build_date” : “2024-10-31T19:08:04.231254959Z”,
“build_snapshot” : false,
“lucene_version” : “9.12.0”,
“minimum_wire_compatibility_version” : “7.10.0”,
“minimum_index_compatibility_version” : “7.0.0”
},

Benchmark additionally runs fine when workload=nyc_taxis. Any help or guidance is greatly appreciated.

We are seeing similar phenomenon with nyc_taxis on amazonlinux2023 and ubuntu20.04.6.
This only started last night, before that we were runnin with no issues.

Configuration

sudo docker run -dq --name opensearch -p 9200:9200 -p 9600:9600 -v /mnt/opensearch/data:/usr/share/opensearch/data -e “discovery.type=single-node” -e “OPENSEARCH_INITIAL_ADMIN_PASSWORD=******” opensearchproject/opensearch:2.14.0

sudo docker run -dq --name opensearch-benchmark --network=host -v /mnt/benchmark/benchmark:/opensearch-benchmark/.benchmark opensearchproject/
opensearch-benchmark:1.6.0 execute-test --target-hosts https://localhost:9200 --pipeline benchmark-only --workload nyc_taxis --client-options ‘basic_auth_user:admin,basic_auth_password:*******,verify_certs:false,timeout:30’ --distri
bution-version 2.14.0;

Logs
2025-01-09 22:46:48,905 -not-actor-/PID:63 opensearch WARNING GET localhost:9200/_cluster/health?wait_for_status=green&wait_for_no_relocating_shards=true [status:408 request:30.012s]
2025-01-09 22:46:48,905 ActorAddr-(T|:41643)/PID:63 osbenchmark.worker_coordinator.runner INFO [cluster-health] has timed out. Retrying in [0.50] seconds.
2025-01-09 22:47:19,409 -not-actor-/PID:63 opensearch WARNING GET localhost:9200/_cluster/health?wait_for_status=green&wait_for_no_relocating_shards=true [status:408 request:30.002s]
2025-01-09 22:47:19,409 ActorAddr-(T|:41643)/PID:63 osbenchmark.worker_coordinator.runner INFO [cluster-health] has timed out. Retrying in [0.50] seconds.
2025-01-09 22:47:49,913 -not-actor-/PID:63 opensearch WARNING GET https:localhost:9200/_cluster/health?wait_for_status=green&wait_for_no_relocating_shards=true [status:408 request:30.002s]
2025-01-09 22:47:49,913 ActorAddr-(T|:41643)/PID:63 osbenchmark.worker_coordinator.runner INFO [cluster-health] has timed out. Retrying in [0.50] seconds.

2025-01-09 22:51:23,439 -not-actor-/PID:63 opensearch WARNING GET localhost:9200/_cluster/health?wait_for_status=green&wait_for_no_relocating_shards=true [status:N/A request:30.002s]
Traceback (most recent call last):
File “/usr/local/lib/python3.11/site-packages/opensearchpy/_async/http_aiohttp.py”, line 296, in perform_request
async with self.session.request(
File “/usr/local/lib/python3.11/site-packages/aiohttp/client.py”, line 1197, in aenter
self._resp = await self._coro
^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.11/site-packages/aiohttp/client.py”, line 608, in _request
await resp.start(conn)
File “/usr/local/lib/python3.11/site-packages/aiohttp/client_reqrep.py”, line 971, in start
with self._timer:
File “/usr/local/lib/python3.11/site-packages/aiohttp/helpers.py”, line 735, in exit
raise asyncio.TimeoutError from None
TimeoutError

(had to edit a little bit the log because as a new user I can only put two links…)

@dancoh did you find a resolution?

Hi @titoreinhart, kind of - it is a workaround - we are running the cluster with “plugins.security.disabled=true” flag and the benchmark without the SSL (https) connectivity: “–target-hosts http://localhost:9200

ok same - thanks for the confirmation :slight_smile:

Increase the timeout setting in the benchmark or verify if the cluster health is stable to resolve the network-related issue causing the timeout error.

1 Like

@dancoh @titoreinhart I’ve got the same when running a single node cluster. However, it worked when I had 2 node cluster.
I think benchmark script is waiting for Green health check. In the single node you get Yellow as not all indices are in Green state.

Single node:

curl --insecure -u admin:Eliatra123 https://localhost:9200/_cat/indices
green  open logs-221998                  N8WwN_HlT2mihrW414BggQ 5 0  0 0     1kb     1kb
green  open .opensearch-observability    9gBnn-PdRkOmw6h3RTMJKg 1 0  0 0    208b    208b
green  open .plugins-ml-config           6V1gv32cTWuUD9VodO0SqQ 1 0  1 0     4kb     4kb
green  open logs-211998                  wi-GdCF6SIW4RmZnPpsZIw 5 0  0 0     1kb     1kb
green  open reindexed-logs               5uxflNR3Sv69FStX9U_t9Q 5 0  0 0     1kb     1kb
green  open logs-231998                  21trI9JRQL2ONPVqEEgJuw 5 0  0 0     1kb     1kb
yellow open security-auditlog-2025.01.30 46VjxnjWTaKh15e2gU-7iA 1 1 16 0 115.2kb 115.2kb
green  open logs-241998                  OF_Td9r-Sj612BZBB8HVMQ 5 0  0 0     1kb     1kb
green  open logs-181998                  mZERbzmBR8ecvQdZLrevLA 5 0  0 0     1kb     1kb
green  open logs-201998                  tv311_bmSiSwT2-mDYBT1w 5 0  0 0     1kb     1kb
green  open .opendistro_security         bL4HF9TpQP6GZa19Lb2CLg 1 0 10 0  82.4kb  82.4kb
green  open logs-191998                  xe6l9TJnRTSzTSV2DWjwQQ 5 0  0 0     1kb     1kb

curl --insecure -u admin:Eliatra123 https://localhost:9200/_cluster/health?pretty
{
  "cluster_name" : "docker-cluster",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "discovered_master" : true,
  "discovered_cluster_manager" : true,
  "active_primary_shards" : 45,
  "active_shards" : 45,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 1,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 97.82608695652173
}

Two nodes cluster:

curl --insecure -u admin:Eliatra123 https://docker5.pablo.local:9200/_cat/indices
green open logs-221998                  6jYYo1CQT_W_SvspIuuoug 5 0       0 0     1kb     1kb
green open reindexed-logs               uxL_DD5YS4CWWxgEe6rNpw 5 0       0 0     1kb     1kb
green open security-auditlog-2025.01.30 QRyE9uSVSeupTNJH_4VFAQ 1 1      16 0 186.6kb    86kb
green open logs-241998                  XBAqTLUcR0ye8BctH4iLsQ 5 0       0 0     1kb     1kb
green open logs-181998                  Q7duo8W8RgqBLG_OX9GkCQ 5 0 2666530 0 218.2mb 218.2mb
green open logs-201998                  G5tI7JLlQjKTyDgu3UDvGg 5 0 5240248 0 388.3mb 388.3mb
green open .opendistro_security         4e5w9l9xQBqsxl6LXhGglQ 1 1      10 0 118.8kb  59.4kb
green open .kibana_1                    akvKKk-TTS2NR8iPNTJa0A 1 1       0 0    416b    208b
green open .plugins-ml-config           KwxJbco0ScKEIbqoMeNn3w 1 1       1 0   8.1kb     4kb
green open .opensearch-observability    CS3rl5VxSBOAFt9dsUq8Hw 1 1       0 0    416b    208b
green open logs-211998                  VyYJBoBUS4S4Nptt6JQFfA 5 0       0 0     1kb     1kb
green open logs-231998                  7WMjc6XiRUuzETZKqjdPww 5 0       0 0     1kb     1kb
green open logs-191998                  9CpQbp1aQIOr2F3iYeQLAw 5 0 9554199 0 700.4mb 700.4mb

curl --insecure -u admin:Eliatra123 https://docker5.pablo.local:9200/_cluster/health?pretty
{
  "cluster_name" : "opensearch-cluster",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 2,
  "number_of_data_nodes" : 2,
  "discovered_master" : true,
  "discovered_cluster_manager" : true,
  "active_primary_shards" : 46,
  "active_shards" : 52,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

When you look at the shards in single node cluster, you’ll notice that security plugin audit logs are set with 1 primary and 1 replica shard. Replica is in Unassigned state.
Disabling security plugin resolved your issue as security plugin audit index wasn’t created.

I would avoid disabling security plugin as results may vary.

curl --insecure -u admin:Eliatra123 https://localhost:9200/_cat/shards
logs-221998                      0 p STARTED     0    208b 172.17.0.2 c3d46fadcddb
logs-221998                      1 p STARTED     0    208b 172.17.0.2 c3d46fadcddb
logs-221998                      2 p STARTED     0    208b 172.17.0.2 c3d46fadcddb
logs-221998                      3 p STARTED     0    208b 172.17.0.2 c3d46fadcddb
logs-221998                      4 p STARTED     0    208b 172.17.0.2 c3d46fadcddb
reindexed-logs                   0 p STARTED     0    208b 172.17.0.2 c3d46fadcddb
reindexed-logs                   1 p STARTED     0    208b 172.17.0.2 c3d46fadcddb
reindexed-logs                   2 p STARTED     0    208b 172.17.0.2 c3d46fadcddb
reindexed-logs                   3 p STARTED     0    208b 172.17.0.2 c3d46fadcddb
reindexed-logs                   4 p STARTED     0    208b 172.17.0.2 c3d46fadcddb
security-auditlog-2025.01.30     0 p STARTED    16 115.2kb 172.17.0.2 c3d46fadcddb
security-auditlog-2025.01.30     0 r UNASSIGNED

Try using docker-compose example from OpenSearch documentation.

2 Likes

Thanks @pablo , what you say makes sense on hand, but on the other hand my setup worked fine for more than a year with a single-node-cluster and then suddenly stopped working. Hence, I assume that the issue is some expired certificate somewhere, and not the missing node.

@dancoh If your script stops at the same point as @titoreinhart then I don’t think it is related to your certificates.

Before the benchmark script runs the cluster health check, it deletes and creates indices on the target cluster.

[WARNING] flush_total_time is 33 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
Running delete-index                                                           [100% done]
Running create-index                                                           [100% done]
Running check-cluster-health                                                   [  0% done]
green  open logs-181998                  4qdKaidqQfiV8KslZj0ehg 5 0  0 0     1kb     1kb
green  open logs-191998                  Whcdog-JQkmlyhKyFAxFUw 5 0  0 0     1kb     1kb
green  open logs-201998                  p_6RC1NsRVSUx6fPimdm_w 5 0  0 0     1kb     1kb
green  open logs-211998                  VF18n5VRQg27lYqjKjwdmg 5 0  0 0     1kb     1kb
green  open logs-221998                  iuqNFWUuQ0mTUHkrjwOJSw 5 0  0 0     1kb     1kb
green  open logs-231998                  ondT7otcTEahVc_MgmPstg 5 0  0 0     1kb     1kb
green  open logs-241998                  8lzhCIIbSYScymPhPEL0PQ 5 0  0 0     1kb     1kb

The expired certificate shouldn’t allow the benchmark script to run these actions before a cluster health check.

What is your cluster status? If it’s yellow, you can always change the number of replica shards to a number that will fit in your cluster.