Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
2.3.0
Describe the issue :
Looking for recommendations for Kubernetes logging operator to collect logs from all pods and k8s hosts misc logs. The one we use now runs into performance issues where logs collection stops .
Configuration :
Current setup fluentbit → fluentd → kafka <----> logstash → Opensearch.
Relevant Logs or Screenshots :
Hi @tru64jurus ,
The setup you have looks pretty good and it may need some tuning on k8s and fluentbit side.
I would like to know more about the performance issues you have.
Thank you,
Nicoale Vartolomei
Elasticsearch/OpenSearch & Solr Consulting, Production Support & Training Sematext Cloud - Full Stack Observability
@Nicolaegis Here is example of issues reported fluentbit not recovering after forward upstream fluentd restart .
opened 08:09AM - 31 Mar 21 UTC
closed 02:21AM - 26 May 21 UTC
Stale
## Bug Report
**Describe the bug**
Using Fluentbit forwarding into Fluentd u… pstream is working fine but when I restart upstream Fluentd, I will start getting following errors that are fine:
```
[2021/03/31 07:48:25] [ warn] [engine] chunk '7-1617176895.69632323.flb' cannot be retried: task_id=7, input=forward.0 > output=forward.0
[2021/03/31 07:48:25] [error] [net] TCP connection failed: fluentd:32233 (Connection refused)
[2021/03/31 07:48:25] [error] [net] cannot connect to fluentd:32233
[2021/03/31 07:48:25] [error] [output:forward:forward.0] no upstream connections available
[2021/03/31 07:48:25] [ warn] [engine] chunk '7-1617176882.905759681.flb' cannot be retried: task_id=18, input=systemd.1 > output=forward.0
[2021/03/31 07:48:25] [error] [net] TCP connection failed: fluentd:32233 (Connection refused)
[2021/03/31 07:48:25] [error] [net] cannot connect to fluentd:32233
[2021/03/31 07:48:25] [error] [output:forward:forward.0] no upstream connections available
[2021/03/31 07:48:25] [ warn] [engine] chunk '7-1617176889.866633403.flb' cannot be retried: task_id=47, input=emitter_for_rewrite_tag.4 > output=forward.0
[2021/03/31 07:48:25] [error] [net] TCP connection failed: fluentd:32233 (Connection refused)
[2021/03/31 07:48:25] [error] [net] cannot connect to fluentd:32233
[2021/03/31 07:48:25] [error] [output:forward:forward.0] no upstream connections available
[2021/03/31 07:48:25] [ warn] [engine] chunk '7-1617176892.116668507.flb' cannot be retried: task_id=15, input=forward.0 > output=forward.0
```
and then when upstream goes up it will start spamming console forever with connection timed out errors:
```
[2021/03/30 21:54:41] [error] [upstream] connection #172 to fluentd:32233 timed out after 10 seconds
[2021/03/30 21:54:41] [error] [upstream] connection #174 to fluentd:32233 timed out after 10 seconds
[2021/03/30 21:54:41] [error] [upstream] connection #191 to fluentd:32233 timed out after 10 seconds
[2021/03/30 21:54:41] [error] [upstream] connection #174 to fluentd:32233 timed out after 10 seconds
[2021/03/30 21:54:41] [error] [upstream] connection #176 to fluentd:32233 timed out after 10 seconds
[2021/03/30 21:54:41] [error] [upstream] connection #174 to fluentd:32233 timed out after 10 seconds
[2021/03/30 21:54:41] [error] [upstream] connection #171 to fluentd:32233 timed out after 10 seconds
[2021/03/30 21:54:41] [error] [upstream] connection #175 to fluentd:32233 timed out after 10 seconds
[2021/03/30 21:54:41] [error] [upstream] connection #176 to fluentd:32233 timed out after 10 seconds
[2021/03/30 21:54:41] [error] [upstream] connection #173 to fluentd:32233 timed out after 10 seconds
[2021/03/30 21:54:41] [error] [upstream] connection #178 to fluentd:32233 timed out after 10 seconds
[2021/03/30 21:54:41] [error] [upstream] connection #172 to fluentd:32233 timed out after 10 seconds
```
It basically spams log with 12 messages every second from every affected Fluentbit (there can be hundreds) and it will not recover even after upstream Fluentd is up again. Once fluentbit is restarted, error goes away. I think this is caused by retries of chunks that it failed to submit because I cannot see any successful retries in log.
Moreover it seems that Prometheus metrics are no longer updated with retry counts. Metric `fluentbit_output_errors_total{name="forward.0"}` is still 0, same for `fluentbit_output_retries_failed_total{name="forward.0"}` and `fluentbit_output_retries_total{name="forward.0"}`
I tried to downgrade to Fluentbit 1.6.10 and it works fine so I am suspecting it's caused by multi-worker feature in 1.7.x.
**To Reproduce**
- Steps to reproduce the problem:
```
output-forward.conf: |
[OUTPUT]
Name forward
workers 1
Match *
Self_Hostname fluentbit
Host ${FLUENT_FORWARD_HOST}
Port ${FLUENT_FORWARD_PORT}
tls On
tls.verify On
tls.ca_file /secrets/identity/server_ca.crt
tls.crt_file /secrets/identity/client.crt
tls.key_file /secrets/identity/client.key
```
**Expected behavior**
- Fluentbit should not spam logs so much when upstream is down.
- Fluentbit should recover properly
- Metrics should be properly updated to be able to determine wrong behavior in monitoring
**Your Environment**
* Version used: 1.7.2
* Configuration:
* Environment name and version (e.g. Kubernetes? What version?): Kubernetes
* Server type and version:
* Operating System and version:
* Filters and plugins:
**Additional context**
opened 04:52PM - 18 Jun 20 UTC
closed 05:50PM - 11 Jan 21 UTC
troubleshooting
## Bug Report
First of thank you for creating this project and for diligently w… orking to make it better. I am trying to create a log pipeline for our metrics tracking using application logs. I installed fluent-bit on the app server which tails the logs and forwards it (via upstream) to a fluentd box for aggregation.
**Describe the bug**
When fluent bit is restarted (or stopped and started), logs are getting lost. I believe the logs that are in the memory buffer are getting dropped when the service is stopped. When the service resumes (restarts) the logs continue forwarding several documents later. By varying the "Mem_Buf_Limit" setting I was able to increase/decrease the number of logs lost by the restart.
**To Reproduce**
1. Create a list of json documents with a serially incremented variable.
```
{ mssg: 'the value is ', i: 1000 }
{ mssg: 'the value is ', i: 1001 }
{ mssg: 'the value is ', i: 1002 }
{ mssg: 'the value is ', i: 1003 }
{ mssg: 'the value is ', i: 1004 }
{ mssg: 'the value is ', i: 1005 }
{ mssg: 'the value is ', i: 1006 }
{ mssg: 'the value is ', i: 1007 }
{ mssg: 'the value is ', i: 1008 }
```
Note: my script created 10M lines to test various features.
2. Update the config to tail the file.
3. Start the fluent bit service.
4. After some time stop the service. (note the last document received by fluentD)
```
2020-06-18T16:09:30+00:00 abhi_event_manager {"log":"{ mssg: 'the value is ', i: 391336 }","td_host":"poc-abhi"}
```
5. After a few seconds start the service again.
6. check the document after the stop, notice the log output has a gap in the series.
```
2020-06-18T16:10:43+00:00 abhi_event_manager {"log":"{ mssg: 'the value is ', i: 399626 }","td_host":"poc-abhi"}
```
In this case 8290 lines were lost, this number increases as the Mem_buf_limit increases.
```
.
.
.
2020-06-18T16:09:30+00:00 abhi_event_manager {"log":"{ mssg: 'the value is ', i: 391336 }","td_host":"poc-abhi"}
2020-06-18T16:10:43+00:00 abhi_event_manager {"log":"{ mssg: 'the value is ', i: 399626 }","td_host":"poc-abhi"}
.
.
```
**Expected behavior**
It would be great if the logs are not dropped during restart. (may be the offset stored in the db should track the position of the log that was sent out and not just when it was read into the engine) During restart logs in the buffer should get flushed to the filesystem before shutdown and on startup first process these files before reading new logs.
**Your Environment**
* Version used: 1.3.5 and 1.4.6
* Configuration:
(Note: I tried change storage.type to memory and filesystem, changing Mem_Buf_Limit, Buffer_Chunk_Size,Buffer_Max_Size but nothing helped)
```
[SERVICE]
Flush 1
Daemon Off
Log_Level trace
Log_File /var/log/td_bit.log
Parsers_File parsers.conf
Plugins_File plugins.conf
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_Port 2020
storage.path /var/log/tdbit_storage/
storage.sync full
storage.checksum off
storage.backlog.mem_limit 1M
[INPUT]
Name tail
Tag abhi_event_manager
Path /var/log/test_logs/test15.log
Db /var/log/td.db
Mem_Buf_Limit 1M
Parser json
Buffer_Chunk_Size 1k
Buffer_Max_Size 1k
storage.type memory
[FILTER]
Name record_modifier
Match *
Record td_host poc-abhi
[OUTPUT]
Name forward
Match abhi_event_manager
Upstream upstream.conf
Self_Hostname poc-abhi
Retry_Limit False
```
* Environment name and version (e.g. Kubernetes? What version?): Installed via rpm (1.3.5) and via build (1.4.6)
* Server type and version: virtual machine on openstack
* Operating System and version: Centos 7
* Filters and plugins: Filter mentioned above, no plugins.
On the fluentD side, for testing I am just outputting the logs to a file: (below is the config)
```
<match abhi_event_manager>
@type file
path /var/log/abhi_evm
</match>
```
The goal is to be able to forward logs using fluent bit from the application servers to a centralized fluentD where we would perform aggregation on the log events and use it for metrics reporting.
So losing logs will lead to inaccurate metrics.
opened 05:49PM - 05 Sep 22 UTC
status: waiting-for-triage
## Bug Report
**Describe the bug**
I am running into issues with k8s fluentb… it not recovering after fluentd restart. I have read all similar github issues opened for this error, non of workarounds resolved the issue on my case.
Setup I have is fluentbit (1.9.7) --> fluentd --> kafka(3.2.1) , fluentd service is ClusterIP .
Appreciate any pointer to remediate this issue.
Thanks
Summary of changes tried
**To Reproduce**
- Restarted fluetnd statfulset
- Example log message if applicable:
```
Errors I see when fluentbit
[tls] error: unexpected EOF
[output:forward:forward.0] no upstream connections available
Then followed by several logs of
[ info] [task] re-schedule retry=0x7ff34206b938 2040 in the next 56 seconds
```
- Steps to reproduce the problem:
Restart fluentd statefulset.
**Expected behavior**
fluentbit will test connectivity to fluentd then resume sending logs.
**Your Environment**
* Version used: 1.9.7
* Configuration:
* Environment name and version (e.g. Kubernetes? What version?): 1.22
* Server type and version: ubuntu 20
* Operating System and version:
* Filters and plugins:
**Additional context**
Summary of changes tried
```fluentbit
net.keepalive_max_recycle 100 and 200
mem_buf_limit 5MB and 10MB 20MB
buffer_chunk_size 1M
buffer_max_size 1M and 5MB.
fluentd
Tried fluentd statefulset of 1 and 2.
This logging operator will enable you to push logs into OpenSearch the same way your current configuration works:
fluentbit → fluentd → <----> logstash → Opensearch.
Logging → Flow → output (OpenSearch)
@mikeyGlitz this logging operator have an issue with fluentbit failing to reconnect to fluentd after fluentd statefulset restart .