Version in use: OpenSearch 2.3.0
Sometimes after a network problem on a part of the cluster, many tasks remain in the “_cat/tasks”.
Tasks remain hanging until nodes are reloaded.
New data is written to the indexes, no index lock occurs, the cluster state is green
GET _cat/tasks?v&h=action,type,running_time,node,task_id,parent_task_id&s=running_time:desc
If I try to find parent_task_id it doesn’t exist
GET /_tasks/R9o_zlO_TiaKJSu1Xy6ULw:423965717
{
"completed" : false,
"task" : {
"node" : "R9o_zlO_TiaKJSu1Xy6ULw",
"id" : 423965717,
"type" : "transport",
"action" : "indices:data/write/bulk[s]",
"status" : {
"phase" : "waiting_on_primary"
},
"description" : "requests[3], index[indexname-2023.01.24][36]",
"start_time_in_millis" : 1674588245439,
"running_time_in_nanos" : 32715030040165,
"cancellable" : false,
"cancelled" : false,
"parent_task_id" : "Y4s2yt7VQV25TNPXn2UIfQ:6606528859",
"headers" : { },
"resource_stats" : {
"total" : {
"cpu_time_in_nanos" : 0,
"memory_in_bytes" : 0
}
}
}
}
GET /_tasks/Y4s2yt7VQV25TNPXn2UIfQ:6606528859
"error" : {
"root_cause" : [
{
"type" : "resource_not_found_exception",
"reason" : "task [Y4s2yt7VQV25TNPXn2UIfQ:6606528859] isn't running and hasn't stored its results"
}
],
"type" : "resource_not_found_exception",
"reason" : "task [Y4s2yt7VQV25TNPXn2UIfQ:6606528859] isn't running and hasn't stored its results"
},
"status" : 404
}
The nodes that have tasks on them are those that disappeared from the cluster during a network problem. And the missing parent task belonged to a node that had no network problems.
Any idea how to get rid of these tasks without restarting nodes?
Мaybe there is some kind of mechanism for closing child tasks if the parent does not exist?