Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
OpenSearch version: 2.5.0
OpenSearch-Benchmark version: 1.1.0
OpenSearch-Benchmarkd version: 1.1.0
Describe the issue:
When I run opensearch-benchmark on the coordinator host without any load-worker-coordinator-hosts, the benchmark runs fine with no issues, like so:
opensearch-benchmark execute-test --workload-path=local/ --results-file=results.csv --results-format=csv --target-hosts=https://vpc-{aws_opensearch_node_url} --pipeline=benchmark-only
But, when I run opensearch-benchmark after the setup of opensearch-benchmarkd daemon on the coordinator and the load-worker-coordinator host, the process hangs, like so:
opensearch-benchmark execute-test --workload-path=local/ --results-file=results.csv --results-format=csv --load-worker-coordinator-hosts={ip1},{ip2} --target-hosts=https://vpc-{aws_opensearch_node_url} --pipeline=benchmark-only
If I stop the running process immediately, it will give some fatal error. But, if I stop the running process after couple minutes, it outputs SUCCESS, but the benchmark didn’t run as there is no results.csv file and also I don’t see any requests sent to the OpenSearch cluster.
How can I fix this issue as I will need to generate load from multiple hosts instead of a single host?
Note: I have tried using both my custom workload and workloads provided by opensearch-benchmark team. They both work when I run the benchmark directly only on coordinator host, but hangs when I introduce load-worker-coordinator hosts.
Configuration:
AWS node configuration:
Data Node: 3 t3.small.search
Master Node: 3 t3.small.search
Workload < 200 MB
AWS EC2 hosts:
1 coordinator host- t2.micro, Ubuntu22.04/Amazon Linux 2 (I tried with both the OS)
2 load-worker-coordinator-host- t2.micro, Ubuntu22.04/Amazon Linux 2 (I tried with both the OS)
To setup the opensearch-benchmarkd daemon on all the host, I followed esrally document with following steps:
# Coordinator Host
opensearch-benchmarkd start --node-ip {coordinator-ip} --coordinator-ip {coordinator-ip}
# Output: [INFO] Successfully started actor system on node [{coordinator-ip}] with coordinator node IP [{coordinator-ip}].
# Load Worker Host 1
opensearch-benchmarkd start --node-ip {load-host-1-ip} --coordinator-ip {coordinator-ip}
# Output: [INFO] Successfully started actor system on node [{load-host-1-ip}] with coordinator node IP [{coordinator-ip}].
# Load Worker Host 2
opensearch-benchmarkd start --node-ip {load-host-2-ip} --coordinator-ip {coordinator-ip}
# Output: [INFO] Successfully started actor system on node [{load-host-2-ip}] with coordinator node IP [{coordinator-ip}].
After I set them up, I checked the status of all of them, and the opensearch-benchmarkd outputs Running.
Relevant Logs or Screenshots:
Here are the logs from .benchmark/logs/benchmark.log for the coordinator and load worker hosts, since I setup the benchmarkd, till I stop the hanging process and stop the daemon’s on all hosts.
From coordinator host:
2023-11-06 00:58:14,241 -not-actor-/PID:14542 osbenchmark.actor INFO Starting actor system with system base [multiprocTCPBase] and capabilities [{'coordinator': True, 'ip': '{coordinator-ip}', 'Convention Address.IPv4': '{coordinator-ip}:1900'}].
2023-11-06 00:58:14,263 -not-actor-/PID:14544 root INFO ++++ Actor System gen (3, 10) started, admin @ ActorAddr-(T|:1900)
2023-11-06 01:02:00,728 -not-actor-/PID:14637 osbenchmark.benchmark INFO OS [uname_result(system='Linux', node='ip-{ip}', release='6.2.0-1012-aws', version='#12~22.04.1-Ubuntu SMP Thu Sep 7 14:01:24 UTC 2023', machine='x86_64')]
2023-11-06 01:02:00,728 -not-actor-/PID:14637 osbenchmark.benchmark INFO Python [namespace(name='cpython', cache_tag='cpython-310', version=sys.version_info(major=3, minor=10, micro=12, releaselevel='final', serial=0), hexversion=50990320, _multiarch='x86_64-linux-gnu')]
2023-11-06 01:02:00,728 -not-actor-/PID:14637 osbenchmark.benchmark INFO Benchmark version [1.1.0]
2023-11-06 01:02:00,728 -not-actor-/PID:14637 osbenchmark.utils.net INFO Connecting directly to the Internet (no proxy support).
2023-11-06 01:02:00,836 -not-actor-/PID:14637 osbenchmark.benchmark INFO Detected a working Internet connection.
2023-11-06 01:02:00,851 -not-actor-/PID:14637 osbenchmark.benchmark INFO Actor system already running locally? [True]
2023-11-06 01:02:00,852 -not-actor-/PID:14637 osbenchmark.actor INFO Joining already running actor system with system base [multiprocTCPBase].
2023-11-06 01:02:00,866 -not-actor-/PID:14637 osbenchmark.test_execution_orchestrator INFO Test Execution id [7ed10c25-43e5-49ba-8050-a9df6a2c97be]
2023-11-06 01:02:00,867 -not-actor-/PID:14637 osbenchmark.test_execution_orchestrator INFO User specified pipeline [benchmark-only].
2023-11-06 01:02:00,867 -not-actor-/PID:14637 osbenchmark.test_execution_orchestrator INFO Using configured hosts [{'host': '{host_address}', 'port': 443, 'use_ssl': True}]
2023-11-06 01:02:00,867 -not-actor-/PID:14637 osbenchmark.actor INFO Joining already running actor system with system base [multiprocTCPBase].
2023-11-06 01:02:01,99 ActorAddr-(T|:1900)/PID:14544 osbenchmark.worker_coordinator.scheduler DEBUG Registering object [<class 'osbenchmark.worker_coordinator.scheduler.DeterministicScheduler'>] for [deterministic].
2023-11-06 01:02:01,100 ActorAddr-(T|:1900)/PID:14544 osbenchmark.worker_coordinator.scheduler DEBUG Registering object [<class 'osbenchmark.worker_coordinator.scheduler.PoissonScheduler'>] for [poisson].
2023-11-06 01:02:01,107 ActorAddr-(T|:1900)/PID:14544 osbenchmark.actor INFO Capabilities [{'coordinator': True, 'ip': '{coordinator-ip}', 'Convention Address.IPv4': '{coordinator-ip}:1900', 'Thespian ActorSystem Name': 'multiprocTCPBase', 'Thespian ActorSystem Version': 2, 'Thespian Watch Supported': True, 'Python Version': (3, 10, 12, 'final', 0), 'Thespian Generation': (3, 10), 'Thespian Version': '1699232294256'}] match requirements [{'coordinator': True}].
2023-11-06 01:02:01,124 ActorAddr-(T|:42099)/PID:14652 osbenchmark.client INFO Creating OpenSearch client connected to [{'host': '{host_address}', 'port': 443, 'use_ssl': True}] with options [{'timeout': 60}]
2023-11-06 01:02:01,125 ActorAddr-(T|:42099)/PID:14652 osbenchmark.client INFO SSL support: off
2023-11-06 01:02:01,125 ActorAddr-(T|:42099)/PID:14652 osbenchmark.client INFO HTTP basic authentication: off
2023-11-06 01:02:01,125 ActorAddr-(T|:42099)/PID:14652 osbenchmark.client INFO HTTP compression: off
2023-11-06 01:02:01,989 ActorAddr-(T|:42099)/PID:14652 opensearch INFO GET https://{host_address}:443/_cluster/health?wait_for_nodes=%3E%3D1 [status:200 request:0.798s]
2023-11-06 01:02:01,990 ActorAddr-(T|:42099)/PID:14652 opensearch DEBUG > None
2023-11-06 01:02:01,990 ActorAddr-(T|:42099)/PID:14652 opensearch DEBUG < {"cluster_name":"{aws_account_no}:{cluster_name}","status":"green","timed_out":false,"number_of_nodes":6,"number_of_data_nodes":3,"discovered_master":true,"discovered_cluster_manager":true,"active_primary_shards":1,"active_shards":3,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":0,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":100.0}
2023-11-06 01:02:01,990 ActorAddr-(T|:42099)/PID:14652 osbenchmark.client INFO REST API is available for >= [1] nodes after [0] attempts.
2023-11-06 01:02:03,132 ActorAddr-(T|:42099)/PID:14652 opensearch INFO GET https://{host_address}:443/ [status:200 request:1.142s]
2023-11-06 01:02:03,133 ActorAddr-(T|:42099)/PID:14652 opensearch DEBUG > None
2023-11-06 01:02:03,195 ActorAddr-(T|:43231)/PID:14653 osbenchmark.actor INFO Received signal from test execution orchestrator to start engine.
2023-11-06 01:02:03,196 ActorAddr-(T|:43231)/PID:14653 osbenchmark.actor INFO Cluster will not be provisioned by Benchmark.
2023-11-06 01:02:03,133 ActorAddr-(T|:42099)/PID:14652 opensearch DEBUG < {
"name" : "{name}",
"cluster_name" : "{aws_account_no}:{cluster_name}",
"cluster_uuid" : "{cluster_uuid}",
"version" : {
"distribution" : "opensearch",
"number" : "2.5.0",
"build_type" : "tar",
"build_hash" : "unknown",
"build_date" : "2023-08-16T11:17:35.692449Z",
"build_snapshot" : false,
"lucene_version" : "9.4.2",
"minimum_wire_compatibility_version" : "7.10.0",
"minimum_index_compatibility_version" : "7.0.0"
},
"tagline" : "The OpenSearch Project: https://opensearch.org/"
}
2023-11-06 01:02:03,133 ActorAddr-(T|:42099)/PID:14652 osbenchmark.test_execution_orchestrator INFO Automatically derived distribution version [2.5.0]
2023-11-06 01:02:03,135 ActorAddr-(T|:42099)/PID:14652 osbenchmark.workload.loader INFO Reading workload specification file [/home/ubuntu/local/workload.json].
2023-11-06 01:02:03,148 ActorAddr-(T|:42099)/PID:14652 osbenchmark.workload.loader INFO Final rendered workload for '/home/ubuntu/local/workload.json' has been written to '/tmp/tmp0rov5x3z.json'.
2023-11-06 01:02:03,155 ActorAddr-(T|:42099)/PID:14652 osbenchmark.workload.loader INFO Loading template [definition for index collector_bundle_state_monitoring in index/bundle-state-index.json].
2023-11-06 01:02:03,156 ActorAddr-(T|:42099)/PID:14652 osbenchmark.workload.loader INFO Loading template [definition for index monitoring_index in index/monitoring_index.json].
2023-11-06 01:02:03,157 ActorAddr-(T|:42099)/PID:14652 osbenchmark.workload.loader INFO Loading template [definition for index environment_index in index/environment-index.json].
2023-11-06 01:02:03,159 ActorAddr-(T|:42099)/PID:14652 osbenchmark.workload.loader INFO Loading template [definition for index template lab_metrics_template in /home/ubuntu/local/template/lab_metrics-template.json].
2023-11-06 01:02:03,160 ActorAddr-(T|:42099)/PID:14652 osbenchmark.workload.loader INFO Loading template [definition for index template equipment_metrics in /home/ubuntu/local/template/equipment-metrics-template.json].
2023-11-06 01:02:03,161 ActorAddr-(T|:42099)/PID:14652 osbenchmark.workload.loader INFO Loading template [definition for index template test_metrics in /home/ubuntu/local/template/test-metrics-template.json].
2023-11-06 01:02:03,167 ActorAddr-(T|:42099)/PID:14652 osbenchmark.metrics INFO Creating in-memory metrics store
2023-11-06 01:02:03,167 ActorAddr-(T|:42099)/PID:14652 osbenchmark.metrics INFO Opening metrics store for test execution timestamp=[20231106T010200Z], workload=[local],test_procedure=[setup-procedure], provision_config_instance=[['external']]
2023-11-06 01:02:03,167 ActorAddr-(T|:42099)/PID:14652 osbenchmark.metrics INFO Creating file test_execution store
2023-11-06 01:02:03,167 ActorAddr-(T|:42099)/PID:14652 osbenchmark.actor INFO Asking builder to start the engine.
2023-11-06 01:02:03,167 ActorAddr-(T|:42099)/PID:14652 osbenchmark.actor INFO Capabilities [{'coordinator': True, 'ip': '{coordinator-ip}', 'Convention Address.IPv4': '{coordinator-ip}:1900', 'Thespian ActorSystem Name': 'multiprocTCPBase', 'Thespian ActorSystem Version': 2, 'Thespian Watch Supported': True, 'Python Version': (3, 10, 12, 'final', 0), 'Thespian Generation': (3, 10), 'Thespian Version': '1699232294256'}] match requirements [{'coordinator': True}].
2023-11-06 01:02:03,201 ActorAddr-(T|:42099)/PID:14652 osbenchmark.actor INFO Builder has started engine successfully.
2023-11-06 01:02:03,201 ActorAddr-(T|:42099)/PID:14652 osbenchmark.actor INFO Capabilities [{'coordinator': True, 'ip': '{coordinator-ip}', 'Convention Address.IPv4': '{coordinator-ip}:1900', 'Thespian ActorSystem Name': 'multiprocTCPBase', 'Thespian ActorSystem Version': 2, 'Thespian Watch Supported': True, 'Python Version': (3, 10, 12, 'final', 0), 'Thespian Generation': (3, 10), 'Thespian Version': '1699232294256'}] match requirements [{'coordinator': True}].
2023-11-06 01:02:03,204 ActorAddr-(T|:42099)/PID:14652 osbenchmark.actor INFO Telling worker_coordinator to prepare for benchmarking.
2023-11-06 01:02:03,225 ActorAddr-(T|:35125)/PID:14654 osbenchmark.metrics INFO Creating in-memory metrics store
2023-11-06 01:02:04,936 ActorAddr-(T|:1900)/PID:14544 osbenchmark.actor INFO Checking capabilities [{'coordinator': True, 'ip': '{coordinator-ip}', 'Convention Address.IPv4': '{coordinator-ip}:1900', 'Thespian ActorSystem Name': 'multiprocTCPBase', 'Thespian ActorSystem Version': 2, 'Thespian Watch Supported': True, 'Python Version': (3, 10, 12, 'final', 0), 'Thespian Generation': (3, 10), 'Thespian Version': '1699232294256'}] against requirements [{'ip': '{load-host-1-ip}'}] failed.
2023-11-06 01:02:03,227 ActorAddr-(T|:35125)/PID:14654 osbenchmark.metrics INFO Opening metrics store for test execution timestamp=[20231106T010200Z], workload=[local],test_procedure=[setup-procedure], provision_config_instance=[['external']]
2023-11-06 01:02:04,940 ActorAddr-(T|:1900)/PID:14544 osbenchmark.actor INFO Checking capabilities [{'coordinator': True, 'ip': '{coordinator-ip}', 'Convention Address.IPv4': '{coordinator-ip}:1900', 'Thespian ActorSystem Name': 'multiprocTCPBase', 'Thespian ActorSystem Version': 2, 'Thespian Watch Supported': True, 'Python Version': (3, 10, 12, 'final', 0), 'Thespian Generation': (3, 10), 'Thespian Version': '1699232294256'}] against requirements [{'ip': '{load-host-2-ip}'}] failed.
2023-11-06 01:02:03,227 ActorAddr-(T|:35125)/PID:14654 osbenchmark.client INFO Creating OpenSearch client connected to [{'host': '{host_address}', 'port': 443, 'use_ssl': True}] with options [{'timeout': 60, 'retry-on-timeout': True}]
2023-11-06 01:02:03,227 ActorAddr-(T|:35125)/PID:14654 osbenchmark.client INFO SSL support: off
2023-11-06 01:02:03,227 ActorAddr-(T|:35125)/PID:14654 osbenchmark.client INFO HTTP basic authentication: off
2023-11-06 01:02:03,227 ActorAddr-(T|:35125)/PID:14654 osbenchmark.client INFO HTTP compression: off
2023-11-06 01:02:03,229 ActorAddr-(T|:35125)/PID:14654 osbenchmark.worker_coordinator.worker_coordinator INFO Checking if REST API is available.
2023-11-06 01:02:04,925 ActorAddr-(T|:35125)/PID:14654 opensearch INFO GET https://{host_address}:443/_cluster/health?wait_for_nodes=%3E%3D1 [status:200 request:1.696s]
2023-11-06 01:02:04,925 ActorAddr-(T|:35125)/PID:14654 opensearch DEBUG > None
2023-11-06 01:02:04,925 ActorAddr-(T|:35125)/PID:14654 opensearch DEBUG < {"cluster_name":"{aws_account_no}:{cluster_name}","status":"green","timed_out":false,"number_of_nodes":6,"number_of_data_nodes":3,"discovered_master":true,"discovered_cluster_manager":true,"active_primary_shards":1,"active_shards":3,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":0,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":100.0}
2023-11-06 01:02:04,926 ActorAddr-(T|:35125)/PID:14654 osbenchmark.client INFO REST API is available for >= [1] nodes after [0] attempts.
2023-11-06 01:02:04,926 ActorAddr-(T|:35125)/PID:14654 osbenchmark.worker_coordinator.worker_coordinator INFO REST API is available.
2023-11-06 01:02:04,932 ActorAddr-(T|:35125)/PID:14654 opensearch INFO GET https://{host_address}:443/ [status:200 request:0.006s]
2023-11-06 01:02:04,933 ActorAddr-(T|:35125)/PID:14654 opensearch DEBUG > None
2023-11-06 01:02:04,933 ActorAddr-(T|:35125)/PID:14654 opensearch DEBUG < {
"name" : "{name}",
"cluster_name" : "{aws_account_no}:{cluster_name}",
"cluster_uuid" : "{cluster_uuid}",
"version" : {
"distribution" : "opensearch",
"number" : "2.5.0",
"build_type" : "tar",
"build_hash" : "unknown",
"build_date" : "2023-08-16T11:17:35.692449Z",
"build_snapshot" : false,
"lucene_version" : "9.4.2",
"minimum_wire_compatibility_version" : "7.10.0",
"minimum_index_compatibility_version" : "7.0.0"
},
"tagline" : "The OpenSearch Project: https://opensearch.org/"
}
2023-11-06 01:02:04,933 ActorAddr-(T|:35125)/PID:14654 osbenchmark.actor INFO Starting prepare workload process on hosts [['{load-host-1-ip}', '{load-host-2-ip}']]
2023-11-06 01:02:04,934 ActorAddr-(T|:35125)/PID:14654 osbenchmark.actor INFO Checking capabilities [{'coordinator': True, 'ip': '{coordinator-ip}', 'Convention Address.IPv4': '{coordinator-ip}:1900', 'Thespian ActorSystem Name': 'multiprocTCPBase', 'Thespian ActorSystem Version': 2, 'Thespian Watch Supported': True, 'Python Version': (3, 10, 12, 'final', 0), 'Thespian Generation': (3, 10), 'Thespian Version': '1699232294256'}] against requirements [{'ip': '{load-host-1-ip}'}] failed.
2023-11-06 01:02:04,935 ActorAddr-(T|:35125)/PID:14654 osbenchmark.actor INFO Checking capabilities [{'coordinator': True, 'ip': '{coordinator-ip}', 'Convention Address.IPv4': '{coordinator-ip}:1900', 'Thespian ActorSystem Name': 'multiprocTCPBase', 'Thespian ActorSystem Version': 2, 'Thespian Watch Supported': True, 'Python Version': (3, 10, 12, 'final', 0), 'Thespian Generation': (3, 10), 'Thespian Version': '1699232294256'}] against requirements [{'ip': '{load-host-2-ip}'}] failed.
2023-11-06 01:07:04,943 ActorAddr-(T|:1900)/PID:14544 osbenchmark.actor INFO Checking capabilities [{'coordinator': True, 'ip': '{coordinator-ip}', 'Convention Address.IPv4': '{coordinator-ip}:1900', 'Thespian ActorSystem Name': 'multiprocTCPBase', 'Thespian ActorSystem Version': 2, 'Thespian Watch Supported': True, 'Python Version': (3, 10, 12, 'final', 0), 'Thespian Generation': (3, 10), 'Thespian Version': '1699232294256'}] against requirements [{'ip': '{load-host-1-ip}'}] failed.
2023-11-06 01:07:04,945 ActorAddr-(T|:1900)/PID:14544 osbenchmark.actor INFO Checking capabilities [{'coordinator': True, 'ip': '{coordinator-ip}', 'Convention Address.IPv4': '{coordinator-ip}:1900', 'Thespian ActorSystem Name': 'multiprocTCPBase', 'Thespian ActorSystem Version': 2, 'Thespian Watch Supported': True, 'Python Version': (3, 10, 12, 'final', 0), 'Thespian Generation': (3, 10), 'Thespian Version': '1699232294256'}] against requirements [{'ip': '{load-host-2-ip}'}] failed.
2023-11-06 01:08:23,78 -not-actor-/PID:14637 osbenchmark.test_execution_orchestrator INFO User has cancelled the benchmark (detected by test execution orchestrator).
2023-11-06 01:08:23,80 -not-actor-/PID:14637 osbenchmark.test_execution_orchestrator INFO Telling benchmark actor to exit.
2023-11-06 01:08:23,81 ActorAddr-(T|:42099)/PID:14652 osbenchmark.actor INFO BenchmarkActor received unknown message [ActorExitRequest] (ignoring).
2023-11-06 01:08:23,83 ActorAddr-(T|:43231)/PID:14653 osbenchmark.actor INFO BuilderActor#receiveMessage unrecognized(msg = [<class 'thespian.actors.ActorExitRequest'>] sender = [ActorAddr-(T|:42099)])
2023-11-06 01:08:23,87 ActorAddr-(T|:35125)/PID:14654 osbenchmark.actor INFO Main worker_coordinator received ActorExitRequest and will terminate all load generators.
2023-11-06 01:08:23,94 ActorAddr-(T|:42099)/PID:14652 osbenchmark.actor INFO BenchmarkActor received unknown message [ChildActorExited:ActorAddr-(T|:43231)] (ignoring).
2023-11-06 01:08:23,95 ActorAddr-(T|:42099)/PID:14652 osbenchmark.actor INFO BenchmarkActor received unknown message [ChildActorExited:ActorAddr-(T|:35125)] (ignoring).
2023-11-06 01:11:31,811 -not-actor-/PID:14683 osbenchmark.actor INFO Joining already running actor system with system base [multiprocTCPBase].
2023-11-06 01:11:31,832 -not-actor-/PID:14545 root INFO ActorSystem Logging Shutdown
2023-11-06 01:11:31,835 -not-actor-/PID:14544 root INFO ---- Actor System shutdown
From load worker host 1:
2023-11-06 00:59:48,642 -not-actor-/PID:3820 osbenchmark.actor INFO Starting actor system with system base [multiprocTCPBase] and capabilities [{'coordinator': False, 'ip': '{load-host-1-ip}', 'Convention Address.IPv4': '{coordinator-ip}:1900'}].
2023-11-06 00:59:48,663 -not-actor-/PID:3822 root INFO ++++ Actor System gen (3, 10) started, admin @ ActorAddr-(T|:1900)
2023-11-06 01:12:27,341 -not-actor-/PID:3895 osbenchmark.actor INFO Joining already running actor system with system base [multiprocTCPBase].
2023-11-06 01:12:27,361 -not-actor-/PID:3823 root INFO ActorSystem Logging Shutdown
2023-11-06 01:12:27,363 -not-actor-/PID:3822 root INFO ---- Actor System shutdown
From load worker host 2:
2023-11-06 01:00:49,540 -not-actor-/PID:4269 osbenchmark.actor INFO Starting actor system with system base [multiprocTCPBase] and capabilities [{'coordinator': False, 'ip': '{load-host-2-ip}', 'Convention Address.IPv4': '{coordinator-ip}:1900'}].
2023-11-06 01:00:49,562 -not-actor-/PID:4271 root INFO ++++ Actor System gen (3, 10) started, admin @ ActorAddr-(T|:1900)
2023-11-06 01:12:38,784 -not-actor-/PID:4346 osbenchmark.actor INFO Joining already running actor system with system base [multiprocTCPBase].
2023-11-06 01:12:38,804 -not-actor-/PID:4272 root INFO ActorSystem Logging Shutdown
2023-11-06 01:12:38,806 -not-actor-/PID:4271 root INFO ---- Actor System shutdown