@dbbaughe We’re seeing ERROR messages in the ODFE ES log like this: Failed to get IndexMetaData from master cluster state for index=kubernetes_cluster-kube-system-2020-06-21
followed by a stack trace. The messages appear every 2 hours which corresponds to how frequently we’ve set the ISM job to run.
The odd thing is this index doesn’t exist. It did exist but was assigned a policy that deleted it after 3 days and that appears to have worked as desired. I’ve dumped the records in the .opendistro-ism-managed-index-history-* indexes and there are only two records: one indicates the ISM policy was successfully initialized and the 2nd indicates it was transitioning to the “doomed” state (which is when it is deleted).
So, while the ISM policy appears to have worked as expected, why is the ISM plug-in attempting to get IndexMetaData for it every two hours? Our deployment creates a number of indexes each day, so it is odd that this behavior only occurs for this particular index. I’ve also checked the other ISM related index (.opendistro-ism-config) and it only contains records for the current set of indexes (i.e. only those that currently exist and are not older than 3 days).
Interesting, somehow means the job is still running even though you say the index was deleted.
That error message is during the execution of the job when it tries to get the index metadata [1][2].
So first nothing to worry about in terms of it doing some bad side effect, when it can’t get the IndexMetaData it just returns early and does nothing. So the worst thing happening right now is you have a job that is running every 2 hours and doing a cluster state request.
When the index was deleted, what should have happened is the job was deleted with it and then de-scheduled by Job Scheduler. This can definitely fail for any number of reasons (node crashes, networking issues, etc.), but we do have a background process to resolve those which is explained further down. So one of two things could have happened. Either:
The index was deleted by the job, but the cleanup of the ManagedIndexConfig job failed to delete itself.
The deletion of the ManagedIndexConfig job was successful, but Job Scheduler failed to de-schedule the job that doesn’t exist anymore.
Can you check in the .opendistro-ism-config index if there is a managed index job for that index?
Should be able to do something like this from Kibana DevTools
In both cases we do have a background process that sweeps jobs/indices/documents to cleanup. And they do run on relatively fast intervals (I believe it’s 10 mins for ISM and 5 minutes for Job Scheduler). Perhaps there is a bug in one of these. After knowing if there is a job document or not we can know where to look.
@dbbaughe Thanks for responding. I’ve submitted your query via Dev Tools and it returns no documents. Just to double-check the syntax, I changed the name of the index to one that should still exist and that returned data, so I believe the syntax is correct. I’ve also confirmed the ERRORs continue to be emitted every two hours (the current value for the opendistro.index_state_management.job_interval).
Based on your response, I guess this suggests (2) in your list of possible explanations. I dumped the records in .opendistro-job-scheduler-lock and it returned 27 documents. That happens to be one more than the number of indexes we are maintaining, so I suspect one of them corresponds to the phantom index from the 21st. Index names don’t appear to be contained in that index but there is a job_id field that appears to be a UUID that matches (in most cases) to the manged_index.index_uuid found in the .opendistro-ism-config index. There is only 1 case where there isn’t a match, so I’m guessing that is a sign of the job that should have been deleted but wasn’t. But, at this point, I’ve run out of indexes to dump, so I don’t know where to look for the actual job definition.
By the way, I should have mentioned this sooner, but I’m using ODFE 1.7.0.
To resolve your issue while the bug exists, you’ll need a way to flush those in-memory jobs that are scheduled. The easiest way would be to allocate the shards for the .opendistro-ism-config index that are on the node that has that invalid job (the one which logs the error) to a different node. This will cause the Job Scheduler to de-schedule all the jobs on that original node, including the invalid job, and reschedule based off the documents on the new node (which won’t have that non-existing job). You can also allocate it back if you need to, you just need to allocate it once to flush the in-memory jobs. Alternatively if you’re availability for the cluster is fine, you can just stop and restart the ES process on that node and it’ll have the same effect.
Thanks again for reporting this, we’ll work on getting a fix out.
Thanks @dbbaughe for getting this sorted out so quickly…and for keeping any eye on the support forums. Until the community can grow/mature to the point of having some “critical mass” of expertise, having the project’s actual developers actively engaged is the only way to ensure everyone can be successful. Thanks again.