Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
v2.7.0
Describe the issue :
We have missing logs due to a filebeat bug where If the API server is unavailable when filebeat tries to determine which node it’s running on then no std stream logs are shipped. This is fixed in 7.15.0 from what i can see.
opened 05:28PM - 07 Jan 21 UTC
closed 11:50AM - 28 Jul 21 UTC
discuss
Team:Integrations
For confirmed bugs, please report:
- Version: 7.9.1
- Operating System: Centos… 7 (we use a very lightly modified version of the public docker image)
- Discuss Forum URL: https://discuss.elastic.co/t/kubernetes-autodiscover-fails-if-filebeat-cannot-determine-its-node-name/260375
- Steps to Reproduce: Difficult to orchestrate, but seems to happen if the K8S API server is unavailable when filebeat queries its own pod data to determine the node it's running on.
If the API server is unavailable when filebeat tries to determine which node it's running on, filebeat might report an error like the following:
`{"level":"error","timestamp":"2021-01-05T09:10:12.114Z","logger":"autodiscover.pod","caller":"kubernetes/util.go:117","message":"kubernetes: Querying for pod failed with error: Get \"{API_SERVER}/api/v1/namespaces/monitoring/pods/filebeat-ds-j6tfs\": dial tcp: i/o timeout"}`
After this, no std stream logs are shipped.
Based on some admittedly naive code inspection (apologies if this is down the wrong path), it seems something like the following might be happening:
- The query against the API server fails in [DiscoverKubernetesNode](https://github.com/elastic/beats/blob/9c09f0a2a3f5f3a12703eef28c25f1022863746d/libbeat/common/kubernetes/util.go#L117), which returns "localhost" as a default
- The k8s autodiscover provider uses this as a watch filter (e.g., in [NewPodEventer](https://github.com/elastic/beats/blob/9c09f0a2a3f5f3a12703eef28c25f1022863746d/libbeat/autodiscover/providers/kubernetes/pod.go#L71)
- Since there is no "localhost" node, no events are received
Our autodiscover config is along the lines of:
```
filebeat.autodiscover:
providers:
- type: kubernetes
templates:
- condition:
equals:
kubernetes.container.name: "c1"
config:
- type: container
paths:
- "/var/lib/docker/containers/${data.kubernetes.container.id}/*.log"
exclude_lines: ['^\x00']
close_inactive: 1m
- <a couple more conditions for other containers plus a general fallback>
```
AFAICT the only indication that filebeat is in this state is the error trace. For our use case, it would be preferable for file beat to either:
- retry the API server query rather than fall back to "localhost"
- shut down on this error so that it can retry on restart
elastic:master
← MichaelKatsoulis:update-DiscoverKubernetesNode-with-error-handling
opened 08:00AM - 19 Jul 21 UTC
## What does this PR do?
This PR enhances `DiscoverKubernetesNode` function w… ith error handling. The function, in case kubernetes node is not found will return an error message. The `defaultNode` was changed from `localhost` to the environment variable `NODE_NAME`.
The callers of the function will handle the error message accordingly to their functionality.
1. Kubernetes autodiscover (`NewNodeEventer`, `NewPodEventer`) will fail in case kubernetes Node is not found and this will lead to metricbeat/filebeat/heartbeat failure. This is according to discussions in https://github.com/elastic/beats/issues/23400
2. `add_kubernetes_metadata` processor will fail in case Node is not found and log the error message. However Beats will not fail but events will not be enriched with k8s metadata .
3. Metricbeat `NewContainerMetadataEnricher` and `NewResourceMetadataEnricher` will fail in case node is not found. Metricbeat will not fail but events will not be enriched with k8s metadata for kubernetes metricbeat module.
4. Elastic agent kubernetes dynamic provider will not start in case node is not found. A debug message is logged. Agent will skip this provider.
## Why is it important?
Currently when k8s autodiscover fail due to Node not found filebeat and metricbeat just proceed with normal operation without shipping any k8s logs and metrics. This is fixed. [Issue](https://github.com/elastic/beats/issues/23400)
Until there is a way in agent/fleet to configure properly Kubernetes dynamic provider and set the Node name correctly, an update was needed in `DiscoverKubernetesNode` to try to find the node name from `NODE_NAME` env var instead returning `localhost` as default. This will allow us to add condition to leverage kubernetes provider for controllermanager and scheduler data streams [PR](https://github.com/elastic/integrations/pull/1324) .
## Checklist
- [x] My code follows the style guidelines of this project
- [x] I have commented my code, particularly in hard-to-understand areas
- [x] I have added tests that prove my fix is effective or that my feature works
- [x] I have added an entry in `CHANGELOG.next.asciidoc` or `CHANGELOG-developer.next.asciidoc`.
## Related issues
Closes https://github.com/elastic/beats/issues/23400
## How to test this PR locally
In this PR detailed units tests for all cases of `DiscoverKubernetesNode` function have been added.
Inside beats repo execute `go test -v github.com/elastic/beats/v7/libbeat/common/kubernetes/...`
For manual testing of e.g. `metricbeat`.
1. Following steps of [test metricbeat on k8s](https://github.com/elastic/beats/blob/master/metricbeat/module/kubernetes/_meta/test/docs/README.md) update `metricbeat.yml` in `metricbeat-daemonset-config` ConfigMap with
```
metricbeat.autodiscover:
providers:
- type: kubernetes
hints.enabled: true
```
From `DaemonSet` env vars comment out
```
# - name: NODE_NAME
# valueFrom:
# fieldRef:
# fieldPath: spec.nodeName
```
Compile and copy metricbeat binary in the pod(the pod sleeps, see instructions) and start metricbeat.
Notice metricbeat failing.
<img width="1782" alt="beat_failure" src="https://user-images.githubusercontent.com/26270880/127178014-a9d708da-fad5-48f2-bedf-e3daf687c12b.png">
2. Comment back in `NODE_NAME ` environment variable from daemonset and rerun metricbeat. Notice the debug and info logs. Node is retrieved from the env var.
<img width="1572" alt="metricbeat_node_from_env" src="https://user-images.githubusercontent.com/26270880/127177612-91ce4046-ad66-43c9-91ec-6a485656b15e.png">
This is a major issue as the fix which i think is in 7.15.0 opensearch doesn’t support. Can support be added for version 7.15.0?