I have a index management policy defined that deletes indexes after 3 days. That policy appears to work in most cases. However, there are a smattering of indexes (maybe 10%) that keep re-appearing. At first, I assumed the index management plugin wasn’t doing its job (sorry @dbbaughe!) but now I suspect the problem might be elsewhere.
Here is a screenshot showing the records in the .opendistro-ism-managed-index-history index for one of these ever re-appearing indexes (I hope it is readable).
Based on what I see, it appears the Index Management plug-in is actually doing it’s job and deleting the index every 3 days. But, then, a few hours later, the index is re-created and the IM plug-in (re-)initializes the policy for it. Is that a correct interpretation of these records? I’m confused by the cases where there are two instances of the “Successfully initialized policy…” message without a “Transitioning to doomed” message. (For example: the messages on the 4/22 and 4/23).
If so, any idea what’s going on here?
The processing flow is Fluent Bit is collecting records from across a Kubernetes cluster and sending them to Elasticsearch. I have an ingest pipeline that is redirecting the incoming messages based on the namespace and timestamp.
Two possibilities that have crossed my mind:
a) Fluent Bit is still sending messages with “old” time stamps many days after that day passed;
b) ES is not completely deleting the index and later rediscovers some “old” data laying around recreates the shard.
Both theories seem far-fetched. Any other explanations?
I think I’ve sorted out what’s happening here. My Fluent Bit configuration includes Retry_Limit set to false. So, Fluent Bit is still retrying to send “chunks” from weeks ago. After the IM plugin deletes the “old” index, messages for that day are still flowing in and the index is re-created…repeat every 3 days. I’ve set a specific retry limit now to avoid this endless cycle.
Hey @GSmith,
ISM definitely doesn’t have any code paths where it creates indices for the cluster except for it’s own configuration/audit indices.
Were you able to verify if the docs in the recreated indices were indeed from very old retries? Perhaps there’s some timestamp on the doc that can confirm?
As for the logs that show initialized twice… technically we make the audit logs (docs) a best effort to reduce random failures on the overall index lifecycle (i.e. we don’t want to fail a managed index just because it couldn’t write a log.) Could you check if you see this failure log in your elasticsearch logs for the index?
Either “Failed to add history” or “failed to index indexMetaData History.”
If that’s there then we at least know why you have a gap.
@dbbaughe Sorry for the lengthy delay in responding…took while to get back to this and then wanted to make sure the zombie indexes were really gone.
I had the Fluent Bit pods restarted which forced them to “forget” about any messages that might need to be resent to Elasticsearch. I then deleted all indexes from April. And, now, after a few more days, I’m happy to report that no April indexes have re-appeared. Based on that, I think I feel comfortable that my Fluent Bit setting (retry forever) was responsible for the relentless return of “old” indexes.
I did NOT find any instanced of either of those two error messages in my Elasticsearch logs. However, those logs are only kept around for a day, so I may have missed those messages. If I notice that pattern of records in the index again, I’ll check for those messages in the ES log.
Thanks for the help.