Kibana Alerting - Randomly Triggering No Data Returned

Hi there,

We have Kibana set up and running on a 6.2 AWS Elasticsearch Service Elasticsearch cluster that we use to store and monitor all of our production logs also using Logstash to parse and post to the ES cluster. We recently set up a few alerts on the Alerting feature and every morning at 9:28 am GMT+1 for some reason the alert is triggered unexpectedly even though when you manually search the query that it is using there is no results that match the query.

The alert uses an extraction query that searches all indicies under the wildcard phplogs-*. The extraction query is as follows:

{
"size": 500,
"query": {
    "bool": {
        "must": [
            {
                "match_all": {
                    "boost": 1
                }
            },
            {
                "match_phrase": {
                    "priority": {
                        "query": "CRITICAL",
                        "slop": 0,
                        "boost": 1
                    }
                }
            },
            {
                "range": {
                    "@timestamp": {
                        "from": "now-60s",
                        "to": "now",
                        "include_lower": true,
                        "include_upper": true,
                        "format": "epoch_millis",
                        "boost": 1
                    }
                }
            }
        ],
        "adjust_pure_negative": true,
        "boost": 1
    }
},
"version": true,
"_source": {
    "includes": [],
    "excludes": []
},
"stored_fields": "*",
"docvalue_fields": [
    "@timestamp",
    "git_date"
],
"script_fields": {},
"sort": [
    {
        "@timestamp": {
            "order": "desc",
            "unmapped_type": "boolean"
        }
    }
],
"aggregations": {
    "2": {
        "date_histogram": {
            "field": "@timestamp",
            "time_zone": "Europe/London",
            "interval": "1s",
            "offset": 0,
            "order": {
                "_key": "asc"
            },
            "keyed": false,
            "min_doc_count": 1
        }
    }
},
"highlight": {
    "pre_tags": [
        "@kibana-highlighted-field@"
    ],
    "post_tags": [
        "@/kibana-highlighted-field@"
    ],
    "fragment_size": 2147483647,
    "fields": {
        "*": {}
    }
}

}

All it does is pretty much search over the last 60s of logs and see if there were any occurrences of a log with the priority CRITICAL. The alert is then triggered off the following trigger condition:

ctx.results[0].hits.total > 0

We have had occurrences where the alert has worked as expected and actually caught valid logs with the priority CRITICAL. However, there are times when the alert triggers unexpectedly even though none of the logs match the search criteria. Just wondering if anyone was experiencing similar issues or could shed some light on the problem.

There was also an occurrence of Slack being pinged when no data or alert was triggered. When I looked at the dashboard there was no record of the trigger ever being triggered during this time.

Thanks!

Hi @dooleyb1,

Could you share some log output from Elasticsearch when this alerting is getting triggered.

Also given you know a timestamp on when this occurs could you put that time stamp into your range query in the from and to. I suspect that maybe the data was there but when you reran the query over 60 seconds have passed and therefore no data was returned, but I could be wrong.

Thanks,
Lucas

Hi @lucaswin-amzn,

I don’t believe this is the case. The alarm was triggered and the alert was sent to slack on both of the following times:

  • 1:45PM
  • 2:08PM

As you can see from the image below there were no results returned from the query during this time period:

Also, when you go to look at the monitor within the Alerting plugin there is no record of the alert being triggered within this time period.

There is also no entries within the index pattern opendistro-* that indicate that the monitor went into error status etc.

@dooleyb1,
This is an interesting problem, just to confirm you are saying no alert was generated, meaning the trigger did not actually evaluate to true (which is intended) but a slack notification was still sent, with the contents being empty?

Thanks,
Lucas Winkelmann

Yes, as far as the monitor was concerned no events occurred on either of them times. However, the trigger action to alert slack of an event was fired twice. This isn’t the first occurrence of this either it has happened on multiple occasions.

@dooleyb1, could you please share the elasticsearch logs from a timestamp you mentioned this occurs at? (False positive evaluation).

We are going to need a bit more information to get to the bottom of this.

I have also created a github issue: More logging · Issue #77 · opendistro-for-elasticsearch/alerting · GitHub to include more logging during trigger evaluation specifically to make this easier to debug.

Hi @lucaswin-amzn,

Unfortunately I can’t share logs due to some of them holding sensitive information. However, we just had another case of the alarm triggering there and this time there was a log within the .opendistro-* index pattern. The log was as follows:

  "_index": ".opendistro-alerting-alert-history-2019.07.01-000033",
  "_type": "_doc",
  "_id": "j_w-9msBG_FesKtRm4DM",
  "_score": 1,
  "_routing": "s_fXUGsBN0LxauNhoAZk",
  "_source": {
    "monitor_id": "s_fXUGsBN0LxauNhoAZk",
    "monitor_version": 7,
    "monitor_name": "NGINX-Connection-Timeout",
    "trigger_id": "MBTZUGsBU8IfIsXnFfDg",
    "trigger_name": "Over 10 Hits",
    "state": "COMPLETED",
    "error_message": null,
    "alert_history": [
      {
        "timestamp": 1563204426648,
        "message": "Error fetching inputs:\nElasticsearchTimeoutException[java.util.concurrent.TimeoutException: Timeout waiting for task.]; nested: TimeoutException[Timeout waiting for task.];; java.util.concurrent.TimeoutException: Timeout waiting for task."
      }
    ],
    "severity": "1",
    "start_time": 1563204426497,
    "last_notification_time": 1563204426497,
    "end_time": 1563204457475,
    "acknowledged_time": null
  },
  "fields": {
    "end_time": [
      "2019-07-15T15:27:37.475Z"
    ],
    "last_notification_time": [
      "2019-07-15T15:27:06.497Z"
    ],
    "start_time": [
      "2019-07-15T15:27:06.497Z"
    ]
  }
}```

Is there any way to prevent an alert being triggered when the alarm is in an error state?

Thanks, 
Brandon

Is there any specific steps to reproduce this issue ? It would be extremely helpful.

@mihirsoni No unfortunately there are not. The occurrences are quite sporadic and unknown as to why. We had another occurrence of the same opendistro-alerting-alert-history alert as above again today causing an alarm to be triggered.

Hi @dooleyb1,

Do all of these notifications you are receiving have an associated alert with them? The alert you showed above has an error in the alert history meaning it would of notified you because of that error. You might be receiving notifications from errors happening instead. If you still have all the alerts you could try looking at the alert history on each of them and seeing if they all have errors.

Update on this. As a workaround to prevent these errors from firing I have added && ctx.error == null to the alert trigger conditions.