Extremely high latencies with TLS

Tysonheart · August 18, 2022, 8:22pm

Hi,
I have a 8 node opensearch cluster (3 master+5 data nodes). The cluster has internode TLS communication enabled on port 50141 and it accepts external requests on port 50140 also on TLS (https).

For the same query on the same cluster, I consistently see about 50x-100x higher latency with TLS (external+internode) than without it (no TLS external, no TLS internode).

For example, this index search query below takes about 10-35 secs with TLS enabled vs 150-300ms without TLS enabled :
‘curl -vv -u admin:xxxx -X GET “https://xxx.xxx.com:50140/pm_*/_search?size=0&pretty” -k’

without tls:
curl -vv -u admin:xxxx -X GET “http://xxx.xxx.com:50140/pm_*/_search?size=0&pretty”

All kinds of requests have such extremely high latencies, so this does not appear to be related to query of data type.

The cluster has about 680 indexes and 7 primary shards per index on average. Replica set to 1.
(I understand the shard density per node is quite high however the latency comparison here is for the same cluster).

Is such an extreme degradation expected?

security plugin config snippet:
`## Security Plugin Configurations ##

enable advanced features

plugins.security.advanced_modules_enabled: true

inter-node TLS configs

plugins.security.ssl.transport.keystore_filepath: opensearch-keystore.jks
plugins.security.ssl.transport.truststore_filepath: opensearch-truststore.jks
plugins.security.ssl.transport.enforce_hostname_verification: true
plugins.security.ssl.transport.resolve_hostname: true
plugins.security.ssl.transport.enabled_protocols:

“TLSv1.3”
“TLSv1.2”

plugins.security.nodes_dn: [“CN=dataxyz*”, “CN=masterxyz*”]

support dynamic management of the whitelisted nodes_dn

plugins.security.nodes_dn_dynamic_config_enabled: true

plugins.security.config_index_name: .security
plugins.security.allow_default_init_securityindex: true

secure client communication configs

plugins.security.ssl.http.enabled: True
plugins.security.ssl.http.keystore_filepath: opensearch-keystore.jks
plugins.security.ssl.http.truststore_filepath: opensearch-truststore.jks
plugins.security.ssl.http.clientauth_mode: OPTIONAL
plugins.security.ssl.http.enabled_protocols:

“TLSv1.3”
“TLSv1.2”

Enable role based access to the REST management API

plugins.security.restapi.roles_enabled: [“all_access”]

password policy

plugins.security.restapi.password_validation_regex: ‘(?=.[A-Z])(?=.[^a-zA-Z\d])(?=.[0-9])(?=.[a-z]).{8,}’
plugins.security.restapi.password_validation_error_message: “Password must be minimum 8 characters long and must contain at least one uppercase letter, one lowercase letter, one digit, and one special character.”

enable audit logs

plugins.security.audit.type: log4j
plugins.security.audit.config.log4j.logger_name: audit_log
plugins.security.audit.config.log4j.level: INFO`

reta · August 18, 2022, 8:52pm

Hi @Tysonheart

Could you please share:

if you are running cluster in containers
what JDK version you are using
if possible, share nodes hot threads [1]

I suspect you are running into [2] which causes SSL/TLS to underperform.

[1] Nodes hot threads - OpenSearch documentation
[2] Avoiding JVM Delays Caused by Random Number Generation

Tysonheart · August 18, 2022, 9:27pm

Hi @reta

Cluster deployed on physical servers. However the same behavior is observed with another cluster on virtual machines (OCI cloud) too.
JDK: We use azul zing JDK.
$ java -version
java version “11.0.14.1.101” 2022-04-16 LTS
Java Runtime Environment Zing22.02.100.0+2 (build 11.0.14.1.101+3-LTS)
Zing 64-Bit Tiered VM Zing22.02.100.0+2 (build 11.0.14.1.101-zing_22.02.100.0-b2-product-linux-X86_64, mixed mode)

We ruled out GC issues in both the cases (TLS, and non TLS).

Operating system is Oracle Enterprise Linux 7.9.

Hot threads - I will get back on this after changing cluster back to enable TLS.

We use the prometheus exporter plugin and surprisingly the prometheus metrics did not show such high latencies. Queues were not overloaded and index, search etc latency metrics did not reflect the values we observed with the curl command line. That has really confused us.

Also,
$ cat /opt/zing/zing-jdk11/conf/security/java.security | grep securerandom.source=
securerandom.source=file:/dev/random

$ time head -n 1 /dev/random

…

real	0m0.001s
user	0m0.000s
sys	0m0.001s

reta · August 22, 2022, 11:57am

Thanks @Tysonheart , very puzzling, I hope the hot threads will give us a hint

Tysonheart · August 31, 2022, 9:53pm

Hi @reta
Below (link) is a hot threads output which was executed 10 times sequentially with a 2 second sleep between each execution.

I couldn’t find anything in particular except for plenty of calls to security manager checkPermission. Is that something that could introduce the latencies? We don’t have any custom security policies configured.

command:
for i in {1..10}; do curl -k -u admin:xxxx -XGET 'https://datanode110.xyz.com:50140/_nodes/hot_threads?pretty' >> hotthreads-2secondinterval.log; sleep 2; done

gist.github.com

https://gist.github.com/TysonHeart/04bcb716223d74125d04809d44b41ec5#file-hotthreads-10times-2secondinterval-log

hotthreads-10times-2secondinterval.log

::: {datanode109.xyz.com}{SgHVvZR8RI6-pH6u6hbCjA}{CS9DAWzeTrGI9d0-bqfD_w}{datanode109.xyz.com}{10.218.9.213:50141}{d}{shard_indexing_pressure_enabled=true}
   Hot threads at 2022-08-31T09:35:52.708Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
   
   61.4% (307.1ms out of 500ms) cpu usage by thread 'opensearch[datanode109.xyz.com][transport_worker][T#2]'
     5/10 snapshots sharing following 103 elements
       org.opensearch.security.transport.SecurityInterceptor.ensureCorrectHeaders(SecurityInterceptor.java:228)
       org.opensearch.security.transport.SecurityInterceptor.sendRequestDecorate(SecurityInterceptor.java:201)
       org.opensearch.security.OpenSearchSecurityPlugin$7$2.sendRequest(OpenSearchSecurityPlugin.java:661)
       app//org.opensearch.transport.TransportService.sendRequest(TransportService.java:763)
       app//org.opensearch.transport.TransportService.sendChildRequest(TransportService.java:838)

This file has been truncated. show original

Cluster info:
10 opensearch nodes running opensearch 1.2 on Azul zing 11.

reta · September 1, 2022, 4:31pm

Hi @Tysonheart ,

Thanks a lot for the details, it looks very weird and indeed no pointers to TLS/SSL issues along the network stack. The calls to SecurityManager::checkPermissions would happen even if TLS/SSL is disabled but in this case the security plugin is not installed, or is it?

Tysonheart · September 2, 2022, 4:57pm

@reta The security plugin is bundled and installed here.

reta · September 2, 2022, 7:17pm

@Tysonheart really puzzled by the issue, could you please try non-search endpoints (let say, /_cat/nodes), is the similar slowdown observed:

curl -vv -u admin:xxxx -X GET “https://xxx.xxx.com:50140/_cat/nodes” -k’
curl -vv -u admin:xxxx -X GET “http://xxx.xxx.com:50140/_cat/nodes”

Besides that, could you please try with security cache disabled:

plugins.security.cache.ttl_minutes: 0

Thank you.

Tysonheart · September 19, 2022, 5:14pm

Update here:
I upgraded to opensearch 2.2.1 and that fixed the high latencies; still not sure what exactly caused the issue; but looks like we are out of the problem here. My guess is it could be some bug fix in the security plugin but can’t put my finger on it.

The cache ttl setting did not make much difference; if anything it increased the latencies a little bit.
Appreciate your assistance @reta .

The cat/nodes API also had a degradation but not as huge as the other endpoints. It was 2x the latencies.

Topic		Replies	Views
Cross Cluster Search Added latency OpenSearch	0	28	November 11, 2024
Docker OpenSearch : Wildcard SSL TLS handshake issue Security troubleshoot	14	1461	April 14, 2022
Having a latency issue with OpenSearch General Feedback troubleshoot	4	2226	June 30, 2022
Transport SSL Certificate Error Security troubleshoot	13	1743	April 28, 2022
CCS setup for kubernetes clusters Cross-Cluster Replication discuss , configure , install	1	565	July 11, 2022