Versions
OpenSearch v2.11.0
Describe the issue:
I have 1000 indices spanning over last 40 days, storing 10 TBs (including 1 replica) under patterns logs-webeos-*
, logs-paas-*
, logs-app-*
containing documents like the below, most importantly specifying data.cluster_name
and data.namespace
.
"data": {
"server_name": "my_server:8080",
"srvconn": "0",
"time_backend_response": "0",
"actconn": "1",
"time_queue": "0",
"pid": "1175",
"program": "haproxy",
"http_verb": "GET",
"client_port": "48760",
"syslog_timestamp": "Nov 27 00:00:37",
"backend_name": "name:health-checks",
"beconn": "0",
"client_ip": "my_client_ip",
"captured_response_cookie": "-",
"haproxy_log_type": "HTTP_logs",
"cluster_name": "test_cluster",
"http_status_code": "200",
"captured_request_cookie": "-",
"termination_state": "--NI",
"feconn": "1",
"srv_queue": "0",
"syslog_server": "my_syslog_server",
"http_version": "1.1",
"bytes_read": "821",
"captured_request_headers": "my_captured_request_headers",
"retries": "0",
"backend_queue": "0",
"time_request": "0",
"accept_date": "27/Nov/2023:00:00:37.370",
"namespace": "ingress-health-checks",
"frontend_name": "public",
"time_duration": "0",
"http_request": "/",
"time_backend_connect": "0"
},
"metadata": {
"partition": "24",
"type_prefix": "logs",
"kafka_timestamp": 1701043240883,
"host": "11.86.8.9",
"json": "true",
"producer": "openshift",
"topic": "openshift_logs",
"_id": "15454bf1-d105-c8d3-9912-36a9556123d5",
"type": "test",
"timestamp": 1701043237000
}
Then, I want to set Document Level Security, so that people only see their documents and not the rest. This is the DLS query to achieve that:
{
"bool": {
"must": [
{
"term": {
"data.cluster_name": "$CLUSTER_NAME"
}
},
{
"term": {
"data.namespace": "$NAMESPACE"
}
}
]
}
}
And this is an example role, let’s name it my_role1
:
"cluster_permissions": [],
"index_permissions": [
{
"index_patterns": [
"logs-webeos-*", "logs-paas-*", "logs-app-*"
],
"dls": $DLS_AS_DEFINED_ABOVE,
"allowed_actions": ["read"],
"fls": [],
"masked_fields": []
}
],
"tenant_permissions": [
{
"tenant_patterns": ["global_tenant"],
"allowed_actions": ["kibana_all_read"]
}
]
I have many of these roles, each time defining the appropriate $CLUSTER_NAME
and $NAMESPACE
. Since I have integrated LDAP for authz, I map each role to the appropriate group of people using the backend_role and/or the user as sometimes only the project owner exists and some other times it is the project admin group.
{
"users": [
$PROJECT_OWNER
],
"backend_roles": [
$PROJECT_ADMIN_GROUP
],
"hosts": []
}
So, a PUT _plugins/_security/api/rolesmapping/my_role1
with the above body does the trick.
I have 6 data nodes, 3 master nodes and 3 client nodes supporting this cluster.
- Each data node has 31 GB RAM and 2560 GB of local SSD disk space with 16 vcpus.
- Each master node has 4 GB RAM and 2 vcpus
- Each client node has 16 GB RAM and 8 vcpus
Scaling problem
I want to create a total of 5000 roles and respective role mappings to make my DLS scenario work. However, already on the scale of 300 roles and mappings, the data nodes become too busy (going over 90% heap utilisation), and they are no longer able to cope with the ingestion load.
Do you have any scaling suggestions on the above?
Would moving data.cluster_name
and data.namespace
fields one level above (not nested under “data”) have a significant impact on the performance?
Should these 5000 roles be as stripped-down as possible? For example maybe we should remove tenant_permissions from them and apply it on another role globally?
If you have any idea how to scale this better I would be more than happy
Many thanks in advance for your time!