Document Level Security at scale

spapadop · November 27, 2023, 10:14am

Versions
OpenSearch v2.11.0

Describe the issue:
I have 1000 indices spanning over last 40 days, storing 10 TBs (including 1 replica) under patterns logs-webeos-*, logs-paas-*, logs-app-* containing documents like the below, most importantly specifying data.cluster_name and data.namespace.

"data": {
  "server_name": "my_server:8080",
  "srvconn": "0",
  "time_backend_response": "0",
  "actconn": "1",
  "time_queue": "0",
  "pid": "1175",
  "program": "haproxy",
  "http_verb": "GET",
  "client_port": "48760",
  "syslog_timestamp": "Nov 27 00:00:37",
  "backend_name": "name:health-checks",
  "beconn": "0",
  "client_ip": "my_client_ip",
  "captured_response_cookie": "-",
  "haproxy_log_type": "HTTP_logs",
  "cluster_name": "test_cluster",
  "http_status_code": "200",
  "captured_request_cookie": "-",
  "termination_state": "--NI",
  "feconn": "1",
  "srv_queue": "0",
  "syslog_server": "my_syslog_server",
  "http_version": "1.1",
  "bytes_read": "821",
  "captured_request_headers": "my_captured_request_headers",
  "retries": "0",
  "backend_queue": "0",
  "time_request": "0",
  "accept_date": "27/Nov/2023:00:00:37.370",
  "namespace": "ingress-health-checks",
  "frontend_name": "public",
  "time_duration": "0",
  "http_request": "/",
  "time_backend_connect": "0"
},
"metadata": {
  "partition": "24",
  "type_prefix": "logs",
  "kafka_timestamp": 1701043240883,
  "host": "11.86.8.9",
  "json": "true",
  "producer": "openshift",
  "topic": "openshift_logs",
  "_id": "15454bf1-d105-c8d3-9912-36a9556123d5",
  "type": "test",
  "timestamp": 1701043237000
}

Then, I want to set Document Level Security, so that people only see their documents and not the rest. This is the DLS query to achieve that:

{
  "bool": {
    "must": [
      {
        "term": {
          "data.cluster_name": "$CLUSTER_NAME"
        }
      },
      {
        "term": {
          "data.namespace": "$NAMESPACE"
        }
      }
    ]
  } 
}

And this is an example role, let’s name it my_role1:

"cluster_permissions": [],
"index_permissions": [
  {
    "index_patterns": [
      "logs-webeos-*", "logs-paas-*", "logs-app-*"
    ],
    "dls": $DLS_AS_DEFINED_ABOVE,
    "allowed_actions": ["read"],
    "fls": [],
    "masked_fields": []
  }
],
"tenant_permissions": [
  {
    "tenant_patterns": ["global_tenant"],
    "allowed_actions": ["kibana_all_read"]
  }
]

I have many of these roles, each time defining the appropriate $CLUSTER_NAME and $NAMESPACE. Since I have integrated LDAP for authz, I map each role to the appropriate group of people using the backend_role and/or the user as sometimes only the project owner exists and some other times it is the project admin group.

{
  "users": [
    $PROJECT_OWNER
  ],
  "backend_roles": [
    $PROJECT_ADMIN_GROUP
  ],
  "hosts": []
}

So, a PUT _plugins/_security/api/rolesmapping/my_role1 with the above body does the trick.

I have 6 data nodes, 3 master nodes and 3 client nodes supporting this cluster.

Each data node has 31 GB RAM and 2560 GB of local SSD disk space with 16 vcpus.
Each master node has 4 GB RAM and 2 vcpus
Each client node has 16 GB RAM and 8 vcpus

Scaling problem
I want to create a total of 5000 roles and respective role mappings to make my DLS scenario work. However, already on the scale of 300 roles and mappings, the data nodes become too busy (going over 90% heap utilisation), and they are no longer able to cope with the ingestion load.

Do you have any scaling suggestions on the above?

Would moving data.cluster_name and data.namespace fields one level above (not nested under “data”) have a significant impact on the performance?

Should these 5000 roles be as stripped-down as possible? For example maybe we should remove tenant_permissions from them and apply it on another role globally?

If you have any idea how to scale this better I would be more than happy
Many thanks in advance for your time!

Mantas · November 27, 2023, 10:36am

Hi @spapadop,

Have you looked at DLS evaluation modes to optimise the behaviour?
Please see here: Document-level security - OpenSearch documentation

best,
Mantas

spapadop · November 27, 2023, 11:02am

Thanks for the suggestion, indeed we haven’t tried that.
Given that we use a term-level query for DLS, I guess we can only set it to filter-level.

The default adaptive value seemed to me already optimal, but why not, I’ll give it a try to see if “filter-level” helps.

spapadop · December 4, 2023, 12:50pm

Hi @Mantas,

As per documentation, I tried to set dis mode to “filter-level” in opensearch.yml

plugins.security.dls.mode: filter-level

However, it seems like the setting is not recognized:

[2023-12-04T13:39:14,659][ERROR][o.o.b.OpenSearchUncaughtExceptionHandler] [node1] uncaught exception in thread [main]
org.opensearch.bootstrap.StartupException: java.lang.IllegalArgumentException: unknown setting [plugins.security.dls.mode] did you mean any of [plugins.security.disabled, plugins.security.audit.type, plugins.security.ssl_only, plugins.security.cert.oid]?
        at org.opensearch.bootstrap.OpenSearch.init(OpenSearch.java:184) ~[opensearch-2.11.1.jar:2.11.1]
        at org.opensearch.bootstrap.OpenSearch.execute(OpenSearch.java:171) ~[opensearch-2.11.1.jar:2.11.1]
        at org.opensearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:104) ~[opensearch-2.11.1.jar:2.11.1]
        at org.opensearch.cli.Command.mainWithoutErrorHandling(Command.java:138) ~[opensearch-cli-2.11.1.jar:2.11.1]
        at org.opensearch.cli.Command.main(Command.java:101) ~[opensearch-cli-2.11.1.jar:2.11.1]
        at org.opensearch.bootstrap.OpenSearch.main(OpenSearch.java:137) ~[opensearch-2.11.1.jar:2.11.1]
        at org.opensearch.bootstrap.OpenSearch.main(OpenSearch.java:103) ~[opensearch-2.11.1.jar:2.11.1]

Mantas · December 5, 2023, 3:40pm

I have tested it on OpenSearch v2.0.0 and as far back as OpenSearch v1.3.0:

And it looks like this bug was already present. While doing my research in GitHub I noticed you already filed an issue: [BUG] DLS evaluation mode cannot be adapted on opensearch.yml · Issue #3794 · opensearch-project/security · GitHub

Please keep me updated if any progress.

Thanks,
Mantas

spapadop · December 11, 2023, 2:17pm

Thanks @Mantas.
Also, regarding the original subject of the thread, I guess it is affected by the general performance regression as described here:

github.com/opensearch-project/security

[BUG] DLS performance has regressed with new serialization format

opened 04:22AM - 30 Nov 23 UTC

peternied

bug

**What is the bug?** Some users of OpenSearch have seen a performance decrease …associated with DLS queries. Larger DLS queries (more characters) have a larger impact. **How can one reproduce the bug?** 1. Checkout my fork with the repro `git clone https://github.com/peternied/security.git` 2. Checkout the most recent security plugin version with the new tests `git checkout dls-perf` 3. Execute the new tests `./gradlew integrationTest --tests org.opensearch.security.DlsTests.testDlsLargerQueryScenarios -x jacocoTestReport` 4. Open the test report `./build/reports/tests/integrationTest/classes/org.opensearch.security.DlsTests.html` 5. View the standard output ``` Creating 5 indices with 1 document User, Count, Avg, Max, Min, Std ms reader, 100, 11.07, 17, 8, 1.67 Attached READER with role DLS_ONLY_LONG_VALUE User, Count, Avg, Max, Min, Std ms reader, 100, 8.45, 17, 6, 1.64 Finished checks in 5282ms Creating 50 indices with 1 document User, Count, Avg, Max, Min, Std ms reader, 100, 13.68, 21, 10, 2.27 Attached READER with role DLS_ONLY_LONG_VALUE User, Count, Avg, Max, Min, Std ms reader, 100, 252.04, 538, 228, 31.20 Finished checks in 30905ms Creating 100 indices with 1 document User, Count, Avg, Max, Min, Std ms reader, 100, 12.56, 19, 10, 1.78 Attached READER with role DLS_ONLY_LONG_VALUE User, Count, Avg, Max, Min, Std ms reader, 100, 984.57, 1027, 916, 21.51 Finished checks in 107909ms ``` **What is the expected behavior?** After the `DLS_ONLY_LONG_VALUE` is added, the AVG should not jump up so much. **Do you have any additional context?** You can collect measures from the 2.3 build by running `git checkout dls-perf` from that branch to collect numbers from that version - https://github.com/opensearch-project/security/compare/main...peternied:security:dls-perf - https://github.com/opensearch-project/security/compare/2.3...peternied:security:2.3-dls-perf This issue was introduced in the following PR which did improve serialization in many scenarios, but seems to be impacting DLS queries as stored in headers. - https://github.com/opensearch-project/security/pull/2802

I’ll await the next release to benchmark DLS again for my use-case.

Topic		Replies	Views
Document Level Security fails with nested documents Security	3	1113	June 18, 2021
Verification of complete document collection OpenSearch discuss	0	139	December 19, 2023
Data Prepper Dynamic Index using keys from logs Data Prepper configure	3	656	May 11, 2023
"search_as_you_type" "index_prefix" field does not return document with "match" query but does with "match_phrase" query OpenSearch discuss , troubleshoot	0	282	July 31, 2023
What should be the search query syntax to match documents OpenSearch all-clients	0	383	May 6, 2022

Document Level Security at scale

Related Topics