Extremely high latencies with TLS

Hi,
I have a 8 node opensearch cluster (3 master+5 data nodes). The cluster has internode TLS communication enabled on port 50141 and it accepts external requests on port 50140 also on TLS (https).

For the same query on the same cluster, I consistently see about 50x-100x higher latency with TLS (external+internode) than without it (no TLS external, no TLS internode).

For example, this index search query below takes about 10-35 secs with TLS enabled vs 150-300ms without TLS enabled :
‘curl -vv -u admin:xxxx -X GET “https://xxx.xxx.com:50140/pm_*/_search?size=0&pretty” -k’

without tls:
curl -vv -u admin:xxxx -X GET “http://xxx.xxx.com:50140/pm_*/_search?size=0&pretty

All kinds of requests have such extremely high latencies, so this does not appear to be related to query of data type.

The cluster has about 680 indexes and 7 primary shards per index on average. Replica set to 1.
(I understand the shard density per node is quite high however the latency comparison here is for the same cluster).

Is such an extreme degradation expected?

security plugin config snippet:
`## Security Plugin Configurations ##

enable advanced features

plugins.security.advanced_modules_enabled: true

inter-node TLS configs

plugins.security.ssl.transport.keystore_filepath: opensearch-keystore.jks
plugins.security.ssl.transport.truststore_filepath: opensearch-truststore.jks
plugins.security.ssl.transport.enforce_hostname_verification: true
plugins.security.ssl.transport.resolve_hostname: true
plugins.security.ssl.transport.enabled_protocols:

  • “TLSv1.3”
  • “TLSv1.2”

plugins.security.nodes_dn: [“CN=dataxyz*”, “CN=masterxyz*”]

support dynamic management of the whitelisted nodes_dn

plugins.security.nodes_dn_dynamic_config_enabled: true

plugins.security.config_index_name: .security
plugins.security.allow_default_init_securityindex: true

secure client communication configs

plugins.security.ssl.http.enabled: True
plugins.security.ssl.http.keystore_filepath: opensearch-keystore.jks
plugins.security.ssl.http.truststore_filepath: opensearch-truststore.jks
plugins.security.ssl.http.clientauth_mode: OPTIONAL
plugins.security.ssl.http.enabled_protocols:

  • “TLSv1.3”
  • “TLSv1.2”

Enable role based access to the REST management API

plugins.security.restapi.roles_enabled: [“all_access”]

password policy

plugins.security.restapi.password_validation_regex: ‘(?=.[A-Z])(?=.[^a-zA-Z\d])(?=.[0-9])(?=.[a-z]).{8,}’
plugins.security.restapi.password_validation_error_message: “Password must be minimum 8 characters long and must contain at least one uppercase letter, one lowercase letter, one digit, and one special character.”

enable audit logs

plugins.security.audit.type: log4j
plugins.security.audit.config.log4j.logger_name: audit_log
plugins.security.audit.config.log4j.level: INFO`

Hi @Tysonheart

Could you please share:

  • if you are running cluster in containers
  • what JDK version you are using
  • if possible, share nodes hot threads [1]

I suspect you are running into [2] which causes SSL/TLS to underperform.

[1] Nodes hot threads - OpenSearch documentation
[2] Avoiding JVM Delays Caused by Random Number Generation

Hi @reta

  • Cluster deployed on physical servers. However the same behavior is observed with another cluster on virtual machines (OCI cloud) too.
  • JDK: We use azul zing JDK.
    $ java -version
    java version “11.0.14.1.101” 2022-04-16 LTS
    Java Runtime Environment Zing22.02.100.0+2 (build 11.0.14.1.101+3-LTS)
    Zing 64-Bit Tiered VM Zing22.02.100.0+2 (build 11.0.14.1.101-zing_22.02.100.0-b2-product-linux-X86_64, mixed mode)

We ruled out GC issues in both the cases (TLS, and non TLS).

Operating system is Oracle Enterprise Linux 7.9.

  • Hot threads - I will get back on this after changing cluster back to enable TLS.

We use the prometheus exporter plugin and surprisingly the prometheus metrics did not show such high latencies. Queues were not overloaded and index, search etc latency metrics did not reflect the values we observed with the curl command line. That has really confused us.

Also,
$ cat /opt/zing/zing-jdk11/conf/security/java.security | grep securerandom.source=
securerandom.source=file:/dev/random

$ time head -n 1 /dev/random

real 0m0.001s
user 0m0.000s
sys 0m0.001s

Thanks @Tysonheart , very puzzling, I hope the hot threads will give us a hint

Hi @reta
Below (link) is a hot threads output which was executed 10 times sequentially with a 2 second sleep between each execution.

I couldn’t find anything in particular except for plenty of calls to security manager checkPermission. Is that something that could introduce the latencies? We don’t have any custom security policies configured.

command:
for i in {1..10}; do curl -k -u admin:xxxx -XGET 'https://datanode110.xyz.com:50140/_nodes/hot_threads?pretty' >> hotthreads-2secondinterval.log; sleep 2; done

Cluster info:
10 opensearch nodes running opensearch 1.2 on Azul zing 11.

Hi @Tysonheart ,

Thanks a lot for the details, it looks very weird and indeed no pointers to TLS/SSL issues along the network stack. The calls to SecurityManager::checkPermissions would happen even if TLS/SSL is disabled but in this case the security plugin is not installed, or is it?

@reta The security plugin is bundled and installed here.

@Tysonheart really puzzled by the issue, could you please try non-search endpoints (let say, /_cat/nodes), is the similar slowdown observed:

curl -vv -u admin:xxxx -X GET “https://xxx.xxx.com:50140/_cat/nodes” -k’
curl -vv -u admin:xxxx -X GET “http://xxx.xxx.com:50140/_cat/nodes”

Besides that, could you please try with security cache disabled:

plugins.security.cache.ttl_minutes: 0

Thank you.

Update here:
I upgraded to opensearch 2.2.1 and that fixed the high latencies; still not sure what exactly caused the issue; but looks like we are out of the problem here. My guess is it could be some bug fix in the security plugin but can’t put my finger on it.

The cache ttl setting did not make much difference; if anything it increased the latencies a little bit.
Appreciate your assistance @reta .

The cat/nodes API also had a degradation but not as huge as the other endpoints. It was 2x the latencies.