I post this question here because I suppose this also concerns opensearch.
I noticed some strange behavior when multiple clients (like a service scaled up by the autoscaler in kubernetes) connects to a elasticsearch/opendistro cluster. The connections were accepted slowly and were eventually rejected because of the full tcp backlog. After some investigation I noticed that connections using client certificates are slower established compared to the ones without client certificate.
A small test script visualizes the difference. It tries to establish 5000 connections, sends /_cluster/health every 15 seconds and timeouts and then retries after 5 seconds (‘ok’ means /_cluster/health request was successful, ‘connections’ are established connections)
According to that config you’re using LDAP with SSL certificate. As far as I understood, you were testing LDAP with and without a secured connection (SSL cert). Without a secured connection (HTTP port 389) you have no performance issues (no timeouts). With SSL cert enabled (HTTPS port 636) you get timeouts with some requests.
I’m testing ssl encrypted connections to elasticsearch, with and without client cert. I assume the ldap server should never be asked, because the client cert names and the anonymous user are in the skip_users list.
skip_users:
- <redacted:various internal_users>
- <redacted:*.domain which matches the client certs>
- "opendistro_security_anonymous"
skip_users will work only for authorization. Plug-in will still try to authenticate client certs with LDAP and basic authentication. Could you try to change the authentication order as per the below:
I’ll test the new order. Howerver, during my tests I’ve just tested client-cert vs. without-client-cert. In none of these tests a basic authentication header was sent. I would expect that in this case, the authc ldap part will be ignored.
this was a hint in the right direction. Removing the authz->ldap section made the client certificate requests fast. This is still confusing as no significant amount of ldap requests are visible with tcpdump.
Do you have any idea to limit the ldap role lookup to ldap users?
@sezuan2 if you change the authentication order, with ldap being last, the look up should only be done if ldap is used, meaning basic_auth and client_cert, failed.
Yes, but it didn’t help. I also removed the ldap section from authentication, but it didn’t help, too. For unknown reason, it seems to do a ldap role lookup for client certificate users but not basic authenticated users.
@sezuan2 after further looking into this, I can see that the call to ldap is performed by design even for cert users, However I am not able to reproduce the delay that you are experiencing. You should be able to skip users using wildcard (like you have with .domain), this need to match the full cn, can you try to use "" as a starting point to see if this skips the ldap section altogether and work backwards from there?
I think it’s not caused by the ldap lookup itself. When you check the flamegraphs above you see the suspicious large amount of time spend in getEntry and lockedOrGetLoad. It’s strange that this doesn’t happen with basic auth. I also removed all skip_users, retested with basic auth, but the request time was still good.
If it’s really caused by lock issues, a high number of threads is probably required to replicate. I’m testing this on a cluster whose nodes have 52cores/104 threads.