Document Level Security at scale

Versions
OpenSearch v2.11.0

Describe the issue:
I have 1000 indices spanning over last 40 days, storing 10 TBs (including 1 replica) under patterns logs-webeos-*, logs-paas-*, logs-app-* containing documents like the below, most importantly specifying data.cluster_name and data.namespace.

"data": {
  "server_name": "my_server:8080",
  "srvconn": "0",
  "time_backend_response": "0",
  "actconn": "1",
  "time_queue": "0",
  "pid": "1175",
  "program": "haproxy",
  "http_verb": "GET",
  "client_port": "48760",
  "syslog_timestamp": "Nov 27 00:00:37",
  "backend_name": "name:health-checks",
  "beconn": "0",
  "client_ip": "my_client_ip",
  "captured_response_cookie": "-",
  "haproxy_log_type": "HTTP_logs",
  "cluster_name": "test_cluster",
  "http_status_code": "200",
  "captured_request_cookie": "-",
  "termination_state": "--NI",
  "feconn": "1",
  "srv_queue": "0",
  "syslog_server": "my_syslog_server",
  "http_version": "1.1",
  "bytes_read": "821",
  "captured_request_headers": "my_captured_request_headers",
  "retries": "0",
  "backend_queue": "0",
  "time_request": "0",
  "accept_date": "27/Nov/2023:00:00:37.370",
  "namespace": "ingress-health-checks",
  "frontend_name": "public",
  "time_duration": "0",
  "http_request": "/",
  "time_backend_connect": "0"
},
"metadata": {
  "partition": "24",
  "type_prefix": "logs",
  "kafka_timestamp": 1701043240883,
  "host": "11.86.8.9",
  "json": "true",
  "producer": "openshift",
  "topic": "openshift_logs",
  "_id": "15454bf1-d105-c8d3-9912-36a9556123d5",
  "type": "test",
  "timestamp": 1701043237000
}

Then, I want to set Document Level Security, so that people only see their documents and not the rest. This is the DLS query to achieve that:

{
  "bool": {
    "must": [
      {
        "term": {
          "data.cluster_name": "$CLUSTER_NAME"
        }
      },
      {
        "term": {
          "data.namespace": "$NAMESPACE"
        }
      }
    ]
  } 
}

And this is an example role, let’s name it my_role1:

"cluster_permissions": [],
"index_permissions": [
  {
    "index_patterns": [
      "logs-webeos-*", "logs-paas-*", "logs-app-*"
    ],
    "dls": $DLS_AS_DEFINED_ABOVE,
    "allowed_actions": ["read"],
    "fls": [],
    "masked_fields": []
  }
],
"tenant_permissions": [
  {
    "tenant_patterns": ["global_tenant"],
    "allowed_actions": ["kibana_all_read"]
  }
]

I have many of these roles, each time defining the appropriate $CLUSTER_NAME and $NAMESPACE. Since I have integrated LDAP for authz, I map each role to the appropriate group of people using the backend_role and/or the user as sometimes only the project owner exists and some other times it is the project admin group.

{
  "users": [
    $PROJECT_OWNER
  ],
  "backend_roles": [
    $PROJECT_ADMIN_GROUP
  ],
  "hosts": []
}

So, a PUT _plugins/_security/api/rolesmapping/my_role1 with the above body does the trick.

I have 6 data nodes, 3 master nodes and 3 client nodes supporting this cluster.

  • Each data node has 31 GB RAM and 2560 GB of local SSD disk space with 16 vcpus.
  • Each master node has 4 GB RAM and 2 vcpus
  • Each client node has 16 GB RAM and 8 vcpus

Scaling problem
I want to create a total of 5000 roles and respective role mappings to make my DLS scenario work. However, already on the scale of 300 roles and mappings, the data nodes become too busy (going over 90% heap utilisation), and they are no longer able to cope with the ingestion load.

Do you have any scaling suggestions on the above?

Would moving data.cluster_name and data.namespace fields one level above (not nested under “data”) have a significant impact on the performance?

Should these 5000 roles be as stripped-down as possible? For example maybe we should remove tenant_permissions from them and apply it on another role globally?

If you have any idea how to scale this better I would be more than happy :slight_smile:
Many thanks in advance for your time!

Hi @spapadop,

Have you looked at DLS evaluation modes to optimise the behaviour?
Please see here: Document-level security - OpenSearch documentation

best,
Mantas

Thanks for the suggestion, indeed we haven’t tried that.
Given that we use a term-level query for DLS, I guess we can only set it to filter-level.

The default adaptive value seemed to me already optimal, but why not, I’ll give it a try to see if “filter-level” helps.

Hi @Mantas,

As per documentation, I tried to set dis mode to “filter-level” in opensearch.yml

plugins.security.dls.mode: filter-level

However, it seems like the setting is not recognized:

[2023-12-04T13:39:14,659][ERROR][o.o.b.OpenSearchUncaughtExceptionHandler] [node1] uncaught exception in thread [main]
org.opensearch.bootstrap.StartupException: java.lang.IllegalArgumentException: unknown setting [plugins.security.dls.mode] did you mean any of [plugins.security.disabled, plugins.security.audit.type, plugins.security.ssl_only, plugins.security.cert.oid]?
        at org.opensearch.bootstrap.OpenSearch.init(OpenSearch.java:184) ~[opensearch-2.11.1.jar:2.11.1]
        at org.opensearch.bootstrap.OpenSearch.execute(OpenSearch.java:171) ~[opensearch-2.11.1.jar:2.11.1]
        at org.opensearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:104) ~[opensearch-2.11.1.jar:2.11.1]
        at org.opensearch.cli.Command.mainWithoutErrorHandling(Command.java:138) ~[opensearch-cli-2.11.1.jar:2.11.1]
        at org.opensearch.cli.Command.main(Command.java:101) ~[opensearch-cli-2.11.1.jar:2.11.1]
        at org.opensearch.bootstrap.OpenSearch.main(OpenSearch.java:137) ~[opensearch-2.11.1.jar:2.11.1]
        at org.opensearch.bootstrap.OpenSearch.main(OpenSearch.java:103) ~[opensearch-2.11.1.jar:2.11.1]

I have tested it on OpenSearch v2.0.0 and as far back as OpenSearch v1.3.0:

And it looks like this bug was already present. While doing my research in GitHub I noticed you already filed an issue: [BUG] DLS evaluation mode cannot be adapted on opensearch.yml · Issue #3794 · opensearch-project/security · GitHub

Please keep me updated if any progress.

Thanks,
Mantas

Thanks @Mantas.
Also, regarding the original subject of the thread, I guess it is affected by the general performance regression as described here:

I’ll await the next release to benchmark DLS again for my use-case.