Security analytics detector matches _ws_ on Text fields but fails on Keywords

It came to my attention that the detector was triggering many false positives, and it happened after I changed the index’s text fields to keyword. Upon investigation, found out that the rules were replacing whitespaces with “_ws_” escape sequence. For this I created two indexes both with just one attribute. In one index the datatype is keyword and the other is text. A test rule was also created.

Here’s an example of the detection logic in the rule:

detection:
  condition: Selection_1
  Selection_1:
    companyName|all:
      - microsoft corp

Here’s the security analytics generated detection query:

"query": "companyName: \"microsoft_ws_corp\""

As keyword field is not analyzed, my understanding is that the keyword detector wasn’t triggered because “_ws_” isn’t present in the ingested document.

{
  "log.attributes.companyName": "microsoft corp"
}

But my text detector worked, think it’s because text fields have analyzers.

to. test my theory about whitespaces, I ingested the following document to the keyword index and a finding was generated.

{
  "log.attributes.companyName": "microsoft_ws_corp"
}

I also queried the exact query string, but no documents were returned fro both the indexes, even the document present in finding wasn’t returned. Maybe the way detectors query the indices are different from what I thought. Anyways that’s a topic for another day.

Shouldn’t opensearch handle the difference between text and keyword in security analytics? I thought the escape sequences are kept in place so that it’s handled in different field types. I also found this exact issue raised in github back in May 2024: SIGMA rule translation -> lucene query replaces spaces " " with "_ws_" which lucene doesnt understand. · Issue #1024 · opensearch-project/security-analytics · GitHub . Someone tried fixing the issue, but they ultimately gave up.

What can be done for the detector to work properly for keyword fields? I know reverting back to text field is an option, but do I have any other options? I explored the usage of custom analyzers, but my application does alot of querying in the indexes, so I fear all that will be affected. Any solution to this? Why was the space replaced with “_ws_” which ultimately made the detector to fail for keyword fields?

@mutant Thank you for the question. One of the workarounds that you are use is a custom analyzer, see example below:

Create the index with custom analyzer:

PUT /test-sa-bug
{
  "settings": {
    "analysis": {
      "analyzer": {
        "rule_analyzer": {
          "tokenizer": "keyword",
          "char_filter": ["rule_ws_filter"]
        }
      },
      "char_filter": {
        "rule_ws_filter": {
          "type": "pattern_replace",
          "pattern": "(_ws_)",
          "replacement": " "
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "companyName": {
        "type": "keyword",
        "fields": {
          "text": {
            "type": "text",
            "analyzer": "rule_analyzer"
          }
        }
      }
    }
  }
}

Create a document:

POST /test-sa-bug/_doc
{ "companyName": "microsoft corp" }

Use CompanyName.text in query:

GET /test-sa-bug/_search
{
  "query": {
    "query_string": { "query": "companyName.text:\"microsoft corp\"" }
  }
}

If you want this to be applied to existing index, you would need to reindex to a new index with custom analyzer.

Hope this helps

Yes, I did try custom analyzers. It worked with the detectors. My application also uses term and wildcard queries to retrieve data from opensearch too, so that also worked. But it didn’t work for aggregation queries, got an illegal argument exception for it. Is there any way to make aggregation queries work with text field, but keyword tokenizer?

Multi-fields seem to be a solid option too, but that’ll increase out storage costs

Sidenote, I also tried using custom normalizers on keyword fields. Unfortunately the detector creation failed via UI. Then I created a detector and then added the index via dev tools, that made the cluster to collapse. Saw lots of alerting exceptions in docker logs. Wasn’t able to recover the cluster, had to clear the volume and start from scratch. So I’m ruling out normalizers.

@mutant have you explored using fielddata: true,

You should be able to use the following:

PUT /test-sa
{
  "settings": {
    "analysis": {
      "analyzer": {
        "rule_analyzer": {
          "tokenizer": "keyword",
          "char_filter": ["rule_ws_filter"]
        }
      },
      "char_filter": {
        "rule_ws_filter": {
          "type": "pattern_replace",
          "pattern": "(_ws_)",
          "replacement": " "
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "companyName": {
        "type": "keyword",
        "fields": {
          "text": {
            "type": "text",
            "analyzer": "rule_analyzer",
            "fielddata": true
          }
        }
      }
    }
  }
}

POST /test-sa/_bulk
{ "index": {} }
{ "companyName": "microsoft corp" }
{ "index": {} }
{ "companyName": "microsoft corp" }
{ "index": {} }
{ "companyName": "apple inc" }

GET /test-sa/_search
{
  "size": 0,
  "aggs": {
    "by_company": {
      "terms": { "field": "companyName.text" }
    }
  }
}

This will however increase your heap usage as I think fielddata loads into JVM heap on first aggregation and stays cached there, rather than using disk.

fieldData isn’t a risk that I want to take haha.

Anyways can we expect a fix for detectors to work with keyword fields? And, any expected fix for keyword normalizers to work with detectors?

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.