Opensearch ldaps troubles (yet again)

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
docker environment with opensearch 2.11.1 and having (yet again) trouble with ldaps.

In juni '22 I setup a cluster with opensearch 1.3.2, and had trouble with ldaps. Now setting up a (new) cluster with opensearch 2.11.1 and yet again I’m having trouble… Seems it’s not a winning combo for me :smiley:
For reference this was my topic then: Opensearch ldaps
Doubt it’s related but still.
That cluster is still running perfectly fine, with AD activated and working.
The new cluster is giving me headaches :frowning:

The error message:

`[2024-01-15T10:29:13,441][WARN ][c.a.d.a.l.b.LDAPAuthorizationBackend] [opensearch-node3] Unable to connect to ldapserver xxxxxxx:636 due to [org.ldaptive.provider.ConnectionException@788749905::resultCode=PROTOCOL_ERROR, matchedDn=null, responseControls=null, referralURLs=null, messageId=-1, message=javax.naming.CommunicationException: xxxxxx:636 [Root exception is java.net.UnknownHostException: xxxxxxxxx], providerException=javax.naming.CommunicationException: xxxxxxxxxx:636 [Root exception is java.net.UnknownHostException: xxxxxxxx]]. Try next.
[2024-01-15T10:29:13,445][WARN ][o.o.s.a.BackendRegistry  ] [opensearch-node3] Authentication finally failed for xxxxxxx from 10.0.7.55:50128`

My docker compose (it’s in portainer):

version: '3.5'
services:
  opensearch-node1:
    image: opensearchproject/opensearch:2.11.1
    container_name: opensearch-node1
    environment:
      - cluster.name=opensearch-cluster
      - node.name=opensearch-node1
      - discovery.seed_hosts=opensearch-node1,opensearch-node2,opensearch-node3
      - cluster.initial_master_nodes=opensearch-node1
      - bootstrap.memory_lock=false # along with the memlock settings below, disables swapping
      - "OPENSEARCH_JAVA_OPTS=-Xms512m -Xmx512m"
    ulimits:
      memlock:
        soft: -1
        hard: -1
      nofile:
        soft: 65536 # maximum number of open files for the OpenSearch user, set to at least 65536 on modern systems
        hard: 65536
    configs:
      - source: ldap_auth_elk
        target: /usr/share/opensearch/config/opensearch-security/config.yml
      - source: opensearch_roles
        target: /usr/share/opensearch/config/opensearch-security/roles_mapping.yml
      - source: cert.pem
        target: /usr/share/opensearch/config/cert.pem
      - source: opensearch_internal_users
        target: /usr/share/opensearch/config/opensearch-security/internal_users.yml
    volumes:
      - /mnt/docker_shared_storage/opensearch/esdata1:/usr/share/opensearch/data
    ports:
      - target: 9200
        published: 9200
        protocol: tcp
        mode: host
    networks: 
          - proxy 

  opensearch-node2:
    image: opensearchproject/opensearch:2.11.1
    container_name: opensearch-node2
    environment:
      - cluster.name=opensearch-cluster
      - node.name=opensearch-node2
      - discovery.seed_hosts=opensearch-node1,opensearch-node2,opensearch-node3
      - cluster.initial_master_nodes=opensearch-node1
      - bootstrap.memory_lock=false # along with the memlock settings below, disables swapping
      - "OPENSEARCH_JAVA_OPTS=-Xms512m -Xmx512m"
    ulimits:
      memlock:
        soft: -1
        hard: -1
      nofile:
        soft: 65536 # maximum number of open files for the OpenSearch user, set to at least 65536 on modern systems
        hard: 65536
    configs:
      - source: ldap_auth_elk
        target: /usr/share/opensearch/config/opensearch-security/config.yml
      - source: opensearch_roles
        target: /usr/share/opensearch/config/opensearch-security/roles_mapping.yml
      - source: cert.pem
        target: /usr/share/opensearch/config/cert.pem
      - source: opensearch_internal_users
        target: /usr/share/opensearch/config/opensearch-security/internal_users.yml
    volumes:
      - /mnt/docker_shared_storage/opensearch/esdata2:/usr/share/opensearch/data
    ports:
      - target: 9200
        published: 9200
        protocol: tcp
        mode: host
    networks:
      - proxy

  opensearch-node3:
    image: opensearchproject/opensearch:2.11.1
    container_name: opensearch-node3
    environment:
      - cluster.name=opensearch-cluster
      - node.name=opensearch-node3
      - discovery.seed_hosts=opensearch-node1,opensearch-node2,opensearch-node3
      - cluster.initial_master_nodes=opensearch-node1
      - bootstrap.memory_lock=false # along with the memlock settings below, disables swapping
      - "OPENSEARCH_JAVA_OPTS=-Xms512m -Xmx512m"
    ulimits:
      memlock:
        soft: -1
        hard: -1
      nofile:
        soft: 65536 # maximum number of open files for the OpenSearch user, set to at least 65536 on modern systems
        hard: 65536
    configs:
      - source: ldap_auth_elk
        target: /usr/share/opensearch/config/opensearch-security/config.yml
      - source: opensearch_roles
        target: /usr/share/opensearch/config/opensearch-security/roles_mapping.yml
      - source: cert.pem
        target: /usr/share/opensearch/config/cert.pem
      - source: opensearch_internal_users
        target: /usr/share/opensearch/config/opensearch-security/internal_users.yml
    volumes:
      - /mnt/docker_shared_storage//opensearch/esdata3:/usr/share/opensearch/data
    ports:
      - target: 9200
        published: 9200
        protocol: tcp
        mode: host
    networks:
      - proxy

  opensearch-dashboards:
    image: opensearchproject/opensearch-dashboards:2.11.1
    container_name: opensearch-dashboards
    environment:
      - OPENSEARCH_HOSTS=["https://opensearch-node1:9200","https://opensearch-node2:9200","https://opensearch-node3:9200"]
      - SERVER_BASEPATH=/dash
      - SERVER_REWRITEBASEPATH=true
    deploy: 
      labels:
        - "traefik.enable=true"
        - "traefik.http.services.dash-svc.loadbalancer.server.port=5601"
        - "traefik.domain=xxxxxxxxxx"
        - "traefik.http.routers.dash-rtr.service=dash-svc"
        - "traefik.http.routers.dash-rtr.entrypoints=http"
        - "traefik.docker.network=proxy"
        - "traefik.http.routers.dash-rtr.rule= Host(`xxxxxxxxxxx`) && PathPrefix(`/dash`)" 
        - "traefik.http.services.dash-svc.loadbalancer.passhostheader=true"
        - "traefik.http.routers.dash-rtr.middlewares=cors"
        - "traefik.http.middlewares.cors.headers.accesscontrolallowmethods=GET,OPTIONS,POST"
        - "traefik.http.middlewares.cors.headers.accesscontrolmaxage=100"
        - "traefik.http.middlewares.cors.headers.addvaryheader=true"
        - "traefik.http.middlewares.cors.headers.accessControlAllowOriginList=*"
        - "traefik.http.services.dash-svc.loadbalancer.sticky.cookie=true"
        - "traefik.http.services.dash-svc.loadbalancer.sticky.cookie.samesite=none"
    ports:
      - 5601:5601
    configs:
      - source: opensearch_kibana
        target: /usr/share/opensearch-dashboards/config/opensearch_dashboards.yml    
    networks:
      - proxy

  logstash:
    image: opensearchproject/logstash-oss-with-opensearch-output-plugin:8.9.0
    container_name: logstash
    configs:
     - source: opensearch_logstash
       target: /config-dir/logstash_http_json.conf
    command: logstash -f /config-dir/logstash_http_json.conf
    #volumes:
    #- /mnt/docker/opensearch/logstash:/config-dir
    environment:
      - OPENSEARCH_HOSTS='["https://opensearch-node1:9200","https://opensearch-node2:9200","https://opensearch-node3:9200"]'
    ports:
      - 5043:5043
      - 5044:5044
    networks:
      - proxy

configs:
  ldap_auth_elk:
    external: true
  opensearch_kibana:
    external: true
  opensearch_roles:
    external: true
  opensearch_logstash:
    external: true
  cert.pem: 
    external: true
  opensearch_internal_users:
    external: true
networks:
  proxy:
    external:
      name: proxy

my ldap_auth_elk (config.yml):

---
_meta:
  type: "config"
  config_version: 2

config:
  dynamic:
    http:
      anonymous_auth_enabled: false
    authc:
      internal_auth:
        order: 0
        description: "HTTP basic authentication using the internal user database"
        http_enabled: true
        transport_enabled: true
        http_authenticator:
          type: basic
          challenge: true
        authentication_backend:
          type: internal
      ldap_auth:
        order: 1
        description: "Authenticate using LDAP"
        http_enabled: true
        transport_enabled: true
        http_authenticator:
          type: basic
          challenge: true
        authentication_backend:
          type: ldap
          config:
            enable_ssl: true
            pemtrustedcas_filepath: "cert.pem"
            enable_start_tls: false
            enable_ssl_client_auth: false
            verify_hostnames: true
            hosts:
            - xxxxxxx:636
            bind_dn: 'CN=xx,OU=xx,OU=xxx,DC=xxx,DC=xxx'
            password: 'xxxxxx'
            userbase: 'OU=xxx,OU=xxxx,OU=xxxx,OU=xxxx,DC=xxx,DC=xxxx'
            usersearch: '(mail={0})'
            username_attribute: null
    authz:
      ldap_roles:
        description: "Authorize using LDAP"
        http_enabled: true
        transport_enabled: true
        authorization_backend:
          type: ldap
          config:
            enable_ssl: true
            pemtrustedcas_filepath: "cert.pem"
            enable_start_tls: false
            enable_ssl_client_auth: false
            verify_hostnames: true
            hosts:
            - xxxxxx:636
            bind_dn: 'CN=xxxx,OU=xxxx,OU=xxxx,DC=xxx,DC=xxxx'
            password: 'xxxxxx'
            skip_users:
              - "admin"
              - "kibanaserver"
            rolebase: 'OU=xxx,OU=xxx,OU=xxxx,DC=xxx,DC=xxxx'
            rolesearch: "(member={0})"
            userroleattribute: null
            userrolename: disabled
            rolename: cn
            resolve_nested_roles: true

I’m presuming it’s something stupid (yet again) but I don’t see it.

When i log in with my local admin account, I can see the config is being loaded in, but I’m getting the above mentioned error whenever i try and log in with my domain account…
The cert.pem is the same cert file i’m using in the other (working) cluster.
I can ping and resolve my ldap server from my linux (red hat) host.

Any suggestions are welcome, let me know if you need additional config files.

@Scarecrow The error refers to LDAP hostname.

"Root exception is java.net.UnknownHostException". 

Can you resolve the LDAP server’s FQDN from all OpenSearch nodes?
How do you resolve DNS names in the host? Is it a DNS server or /etc/hosts file? If DNS, try adding LDAP server to /etc/hosts.

Also, I’ve compared your configs and found that you’ve enabled hostname verification in authc.

verify_hostnames: true

the linux host use the /etc/resolv.conf nameserver to find the hostname. The /etc/hosts (in the linux host) is “empty”, as in only localhost is in there.

I’m not sure how I can resolve the ldap server inside the docker container?

@Scarecrow Have you tried adding the LDAP FQDN to /etc/hosts?

The default DNS resolution order in Linux should be:

  1. /etc/hosts
  2. /etc/resolv.conf

I had issues with my /etc/resolv.conf a few times in the docker container so I decided to use /etc/hosts.
Just check if this is the case in your scenario.

Also, have you tried running ping or nslookup inside the Docker container? curl should also do the job.

update the /etc/hosts in the linux host with the ip adress and the hostname.
Restarted the stack, tried logging in: same error

tried to do curl from inside the docker container:
curl hostname => could not resolve host
curl ip address => failed to connect to xx.xx.xx.xx port 636 after 1005ms: couldn’t connect to server

not sure if the curl command is the right one (i meant without variables)?

edit: can I try to update the config.yml to use the ip address instead of the FQDN? It’s a loadbalancer, so shouldn’t change much/not.

I adapted the config.yml to use the IP address instead of the FQDN:

[2024-01-15T15:19:48,960][WARN ][c.a.d.a.l.b.LDAPAuthorizationBackend] [opensearch-node1] Unable to connect to ldapserver xx.xx.xx.xx:636 due to [org.ldaptive.provider.ConnectionException@602115874::resultCode=PROTOCOL_ERROR, matchedDn=null, responseControls=null, referralURLs=null, messageId=-1, message=javax.naming.CommunicationException: xx.xx.xx.xx:636 [Root exception is java.net.NoRouteToHostException: No route to host], providerException=javax.naming.CommunicationException: xx.xx.xx.xx:636 [Root exception is java.net.NoRouteToHostException: No route to host]]. Try next.

so seems something different is going on that I don’t understand :confused:

alrighty, some google time later and a rabbit hole of firewalld rules that have been added, borked everything and deleted again I’ve found the culprit :slight_smile:

after check the no route to host error I came accross this: Docker - No route to host - Stack Overflow

more specific:

firewall-cmd --permanent --zone=public --add-rich-rule='rule family=ipv4 source address=172.27.0.0/16 accept'
firewall-cmd --reload

Which borked everything completely.
Removed the rule, rebooted the 3 hosts.
After some firewalld logs checking i saw this:

docker-ingress failed iptables no chain/target/match by that name

Down the google rabbit hole:
Iptables no chain/target/match by that name docker - Quick fix!.
Basicly the docker service starts up before the firewalld service does. The docker service adds the firewall rules, so since the firewall service isn’t running, it can’t be added and therefore has no way of getting out.
A simpel service docker restart fixes everything :smiley:
(I had to revert back to the FQDN instead of ip address since the cert doesn’t match the hostname obviously).

So case closed and fixed. Thanks again @pablo :slight_smile:

1 Like

@Scarecrow Thanks for sharing the solution.