Recomendations for k8s logs shipping operator

tru64jurus · November 15, 2022, 2:09pm

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):

2.3.0

Describe the issue:

Looking for recommendations for Kubernetes logging operator to collect logs from all pods and k8s hosts misc logs. The one we use now runs into performance issues where logs collection stops .

Configuration:

Current setup fluentbit → fluentd → kafka <----> logstash → Opensearch.

Relevant Logs or Screenshots:

Nicolaegis · November 15, 2022, 3:41pm

Hi @tru64jurus,

The setup you have looks pretty good and it may need some tuning on k8s and fluentbit side.
I would like to know more about the performance issues you have.

Thank you,
Nicoale Vartolomei

Elasticsearch/OpenSearch & Solr Consulting, Production Support & Training Sematext Cloud - Full Stack Observability

tru64jurus · November 15, 2022, 6:28pm

@Nicolaegis Here is example of issues reported fluentbit not recovering after forward upstream fluentd restart .

github.com/fluent/fluent-bit

Fluentbit 1.7.2 doesn't recover after forward upstream downtime

opened 08:09AM - 31 Mar 21 UTC

closed 02:21AM - 26 May 21 UTC

fpytloun

Stale

## Bug Report **Describe the bug** Using Fluentbit forwarding into Fluentd u…pstream is working fine but when I restart upstream Fluentd, I will start getting following errors that are fine: ``` [2021/03/31 07:48:25] [ warn] [engine] chunk '7-1617176895.69632323.flb' cannot be retried: task_id=7, input=forward.0 > output=forward.0 [2021/03/31 07:48:25] [error] [net] TCP connection failed: fluentd:32233 (Connection refused) [2021/03/31 07:48:25] [error] [net] cannot connect to fluentd:32233 [2021/03/31 07:48:25] [error] [output:forward:forward.0] no upstream connections available [2021/03/31 07:48:25] [ warn] [engine] chunk '7-1617176882.905759681.flb' cannot be retried: task_id=18, input=systemd.1 > output=forward.0 [2021/03/31 07:48:25] [error] [net] TCP connection failed: fluentd:32233 (Connection refused) [2021/03/31 07:48:25] [error] [net] cannot connect to fluentd:32233 [2021/03/31 07:48:25] [error] [output:forward:forward.0] no upstream connections available [2021/03/31 07:48:25] [ warn] [engine] chunk '7-1617176889.866633403.flb' cannot be retried: task_id=47, input=emitter_for_rewrite_tag.4 > output=forward.0 [2021/03/31 07:48:25] [error] [net] TCP connection failed: fluentd:32233 (Connection refused) [2021/03/31 07:48:25] [error] [net] cannot connect to fluentd:32233 [2021/03/31 07:48:25] [error] [output:forward:forward.0] no upstream connections available [2021/03/31 07:48:25] [ warn] [engine] chunk '7-1617176892.116668507.flb' cannot be retried: task_id=15, input=forward.0 > output=forward.0 ``` and then when upstream goes up it will start spamming console forever with connection timed out errors: ``` [2021/03/30 21:54:41] [error] [upstream] connection #172 to fluentd:32233 timed out after 10 seconds [2021/03/30 21:54:41] [error] [upstream] connection #174 to fluentd:32233 timed out after 10 seconds [2021/03/30 21:54:41] [error] [upstream] connection #191 to fluentd:32233 timed out after 10 seconds [2021/03/30 21:54:41] [error] [upstream] connection #174 to fluentd:32233 timed out after 10 seconds [2021/03/30 21:54:41] [error] [upstream] connection #176 to fluentd:32233 timed out after 10 seconds [2021/03/30 21:54:41] [error] [upstream] connection #174 to fluentd:32233 timed out after 10 seconds [2021/03/30 21:54:41] [error] [upstream] connection #171 to fluentd:32233 timed out after 10 seconds [2021/03/30 21:54:41] [error] [upstream] connection #175 to fluentd:32233 timed out after 10 seconds [2021/03/30 21:54:41] [error] [upstream] connection #176 to fluentd:32233 timed out after 10 seconds [2021/03/30 21:54:41] [error] [upstream] connection #173 to fluentd:32233 timed out after 10 seconds [2021/03/30 21:54:41] [error] [upstream] connection #178 to fluentd:32233 timed out after 10 seconds [2021/03/30 21:54:41] [error] [upstream] connection #172 to fluentd:32233 timed out after 10 seconds ``` It basically spams log with 12 messages every second from every affected Fluentbit (there can be hundreds) and it will not recover even after upstream Fluentd is up again. Once fluentbit is restarted, error goes away. I think this is caused by retries of chunks that it failed to submit because I cannot see any successful retries in log. Moreover it seems that Prometheus metrics are no longer updated with retry counts. Metric `fluentbit_output_errors_total{name="forward.0"}` is still 0, same for `fluentbit_output_retries_failed_total{name="forward.0"}` and `fluentbit_output_retries_total{name="forward.0"}` I tried to downgrade to Fluentbit 1.6.10 and it works fine so I am suspecting it's caused by multi-worker feature in 1.7.x. **To Reproduce** - Steps to reproduce the problem: ``` output-forward.conf: | [OUTPUT] Name forward workers 1 Match * Self_Hostname fluentbit Host ${FLUENT_FORWARD_HOST} Port ${FLUENT_FORWARD_PORT} tls On tls.verify On tls.ca_file /secrets/identity/server_ca.crt tls.crt_file /secrets/identity/client.crt tls.key_file /secrets/identity/client.key ``` **Expected behavior** - Fluentbit should not spam logs so much when upstream is down. - Fluentbit should recover properly - Metrics should be properly updated to be able to determine wrong behavior in monitoring **Your Environment** * Version used: 1.7.2 * Configuration: * Environment name and version (e.g. Kubernetes? What version?): Kubernetes * Server type and version: * Operating System and version: * Filters and plugins: **Additional context**

github.com/fluent/fluent-bit

fluent bit restart losing/dropping logs

opened 04:52PM - 18 Jun 20 UTC

closed 05:50PM - 11 Jan 21 UTC

ajosep514

troubleshooting

## Bug Report First of thank you for creating this project and for diligently w…orking to make it better. I am trying to create a log pipeline for our metrics tracking using application logs. I installed fluent-bit on the app server which tails the logs and forwards it (via upstream) to a fluentd box for aggregation. **Describe the bug** When fluent bit is restarted (or stopped and started), logs are getting lost. I believe the logs that are in the memory buffer are getting dropped when the service is stopped. When the service resumes (restarts) the logs continue forwarding several documents later. By varying the "Mem_Buf_Limit" setting I was able to increase/decrease the number of logs lost by the restart. **To Reproduce** 1. Create a list of json documents with a serially incremented variable. ``` { mssg: 'the value is ', i: 1000 } { mssg: 'the value is ', i: 1001 } { mssg: 'the value is ', i: 1002 } { mssg: 'the value is ', i: 1003 } { mssg: 'the value is ', i: 1004 } { mssg: 'the value is ', i: 1005 } { mssg: 'the value is ', i: 1006 } { mssg: 'the value is ', i: 1007 } { mssg: 'the value is ', i: 1008 } ``` Note: my script created 10M lines to test various features. 2. Update the config to tail the file. 3. Start the fluent bit service. 4. After some time stop the service. (note the last document received by fluentD) ``` 2020-06-18T16:09:30+00:00 abhi_event_manager {"log":"{ mssg: 'the value is ', i: 391336 }","td_host":"poc-abhi"} ``` 5. After a few seconds start the service again. 6. check the document after the stop, notice the log output has a gap in the series. ``` 2020-06-18T16:10:43+00:00 abhi_event_manager {"log":"{ mssg: 'the value is ', i: 399626 }","td_host":"poc-abhi"} ``` In this case 8290 lines were lost, this number increases as the Mem_buf_limit increases. ``` . . . 2020-06-18T16:09:30+00:00 abhi_event_manager {"log":"{ mssg: 'the value is ', i: 391336 }","td_host":"poc-abhi"} 2020-06-18T16:10:43+00:00 abhi_event_manager {"log":"{ mssg: 'the value is ', i: 399626 }","td_host":"poc-abhi"} . . ``` **Expected behavior** It would be great if the logs are not dropped during restart. (may be the offset stored in the db should track the position of the log that was sent out and not just when it was read into the engine) During restart logs in the buffer should get flushed to the filesystem before shutdown and on startup first process these files before reading new logs. **Your Environment** * Version used: 1.3.5 and 1.4.6 * Configuration: (Note: I tried change storage.type to memory and filesystem, changing Mem_Buf_Limit, Buffer_Chunk_Size,Buffer_Max_Size but nothing helped) ``` [SERVICE] Flush 1 Daemon Off Log_Level trace Log_File /var/log/td_bit.log Parsers_File parsers.conf Plugins_File plugins.conf HTTP_Server On HTTP_Listen 0.0.0.0 HTTP_Port 2020 storage.path /var/log/tdbit_storage/ storage.sync full storage.checksum off storage.backlog.mem_limit 1M [INPUT] Name tail Tag abhi_event_manager Path /var/log/test_logs/test15.log Db /var/log/td.db Mem_Buf_Limit 1M Parser json Buffer_Chunk_Size 1k Buffer_Max_Size 1k storage.type memory [FILTER] Name record_modifier Match * Record td_host poc-abhi [OUTPUT] Name forward Match abhi_event_manager Upstream upstream.conf Self_Hostname poc-abhi Retry_Limit False ``` * Environment name and version (e.g. Kubernetes? What version?): Installed via rpm (1.3.5) and via build (1.4.6) * Server type and version: virtual machine on openstack * Operating System and version: Centos 7 * Filters and plugins: Filter mentioned above, no plugins. On the fluentD side, for testing I am just outputting the logs to a file: (below is the config) ``` <match abhi_event_manager> @type file path /var/log/abhi_evm </match> ``` The goal is to be able to forward logs using fluent bit from the application servers to a centralized fluentD where we would perform aggregation on the log events and use it for metrics reporting. So losing logs will lead to inaccurate metrics.

github.com/fluent/fluent-bit

fluentbit not recovering after fluentd restart

opened 05:49PM - 05 Sep 22 UTC

kingnarmer

status: waiting-for-triage

## Bug Report **Describe the bug** I am running into issues with k8s fluentb…it not recovering after fluentd restart. I have read all similar github issues opened for this error, non of workarounds resolved the issue on my case. Setup I have is fluentbit (1.9.7) --> fluentd --> kafka(3.2.1) , fluentd service is ClusterIP . Appreciate any pointer to remediate this issue. Thanks Summary of changes tried **To Reproduce** - Restarted fluetnd statfulset - Example log message if applicable: ``` Errors I see when fluentbit [tls] error: unexpected EOF [output:forward:forward.0] no upstream connections available Then followed by several logs of [ info] [task] re-schedule retry=0x7ff34206b938 2040 in the next 56 seconds ``` - Steps to reproduce the problem: Restart fluentd statefulset. **Expected behavior** fluentbit will test connectivity to fluentd then resume sending logs. **Your Environment** * Version used: 1.9.7 * Configuration: * Environment name and version (e.g. Kubernetes? What version?): 1.22 * Server type and version: ubuntu 20 * Operating System and version: * Filters and plugins: **Additional context** Summary of changes tried ```fluentbit net.keepalive_max_recycle 100 and 200 mem_buf_limit 5MB and 10MB 20MB buffer_chunk_size 1M buffer_max_size 1M and 5MB. fluentd Tried fluentd statefulset of 1 and 2.

mikeyGlitz · January 7, 2023, 12:34pm

This logging operator will enable you to push logs into OpenSearch the same way your current configuration works:

fluentbit → fluentd → <----> logstash → Opensearch.
Logging → Flow → output (OpenSearch)

tru64jurus · February 5, 2023, 2:12am

@mikeyGlitz this logging operator have an issue with fluentbit failing to reconnect to fluentd after fluentd statefulset restart .

Topic		Replies	Views
How to minikube's system logs share in opensearch via fluentbit? Kubernetes	3	144	September 2, 2024
Fluentbit not connecting to elasticsearch cluster running in Kubernetes Security	1	5263	June 29, 2020
Scalable Kubernetes Logging stack with Opensearch OpenDistro	2	970	July 26, 2021
Opensearch dashboard is not getting log lines after 10-15 minutes OpenSearch troubleshoot	2	804	August 2, 2023
Join us on March 29th for our Community Meeting Community community-meeting	1	389	March 24, 2022

Recomendations for k8s logs shipping operator

Thank you, Nicoale Vartolomei

Related topics

Thank you,
Nicoale Vartolomei