Tasks are not completed or canceled

Version in use: OpenSearch 2.3.0

Sometimes after a network problem on a part of the cluster, many tasks remain in the “_cat/tasks”.
Tasks remain hanging until nodes are reloaded.
New data is written to the indexes, no index lock occurs, the cluster state is green

GET _cat/tasks?v&h=action,type,running_time,node,task_id,parent_task_id&s=running_time:desc

If I try to find parent_task_id it doesn’t exist

GET /_tasks/R9o_zlO_TiaKJSu1Xy6ULw:423965717

{
  "completed" : false,
  "task" : {
    "node" : "R9o_zlO_TiaKJSu1Xy6ULw",
    "id" : 423965717,
    "type" : "transport",
    "action" : "indices:data/write/bulk[s]",
    "status" : {
      "phase" : "waiting_on_primary"
    },
    "description" : "requests[3], index[indexname-2023.01.24][36]",
    "start_time_in_millis" : 1674588245439,
    "running_time_in_nanos" : 32715030040165,
    "cancellable" : false,
    "cancelled" : false,
    "parent_task_id" : "Y4s2yt7VQV25TNPXn2UIfQ:6606528859",
    "headers" : { },
    "resource_stats" : {
      "total" : {
        "cpu_time_in_nanos" : 0,
        "memory_in_bytes" : 0
      }
    }
  }
}
GET /_tasks/Y4s2yt7VQV25TNPXn2UIfQ:6606528859

  "error" : {
    "root_cause" : [
      {
        "type" : "resource_not_found_exception",
        "reason" : "task [Y4s2yt7VQV25TNPXn2UIfQ:6606528859] isn't running and hasn't stored its results"
      }
    ],
    "type" : "resource_not_found_exception",
    "reason" : "task [Y4s2yt7VQV25TNPXn2UIfQ:6606528859] isn't running and hasn't stored its results"
  },
  "status" : 404
}

The nodes that have tasks on them are those that disappeared from the cluster during a network problem. And the missing parent task belonged to a node that had no network problems.

Any idea how to get rid of these tasks without restarting nodes?
Мaybe there is some kind of mechanism for closing child tasks if the parent does not exist?

@nateynate @dtaivpp - do either of you know how to assist @mouse on this? A network hiccup causes some odd behavior

Hi @mouse! I’ve asked some of the devs to chime in if they can, but I can think of a few ideas just from the screenshot here. This isn’t something I’m radically intimate with.

Ideas first:

Try to add cancellable to the list of headers you’d like. I think there might be one called timeout as well. There’s always the _cancel API operation if you feel comfortable cancelling them. But they do all seem to have something in common that matches your description of anetwork blip.

These all look like index write operations.

If a client was in the middle of a TCP connection to bulk index something, and a blip occurrs, it’s possible there’s some kind of orphaned file handle (tcp sockets are technically file handles) as a result of a network blip. If the amount of time in the timeout header hasn’t been reached, you can cancel all cancellable tasks with:

POST _tasks/_cancel 

Check out the task api reference here: https://opensearch.org/docs/2.3/api-reference/tasks/

Hope that helps!

Nate

1 Like

@nateynate , yes, you’re right, those are write operations. I believe the problem is related to the data node network loss with the replica (the request to write data to the replica went away and the response did not return).

Recording occurs through Logstash like this:
Server-> Logstash-> OS_Collector-> Datanode (primary shard) → Datanode (replica shard)
all these are 5 different servers

In this case, I do not understand where to add “cancellable”? opensearch output plugin supports “custom_headers” parameter, this is not specified in the documentation (Ship events to OpenSearch - OpenSearch documentation)?

I think it would be right there - the h query parameter. Just to check if any of the tasks are able to be cancelled. The timeout should tell you how long it will persist until it times out. If they’re past their timeout and still showing in the tasks list, you may want to file a bug report in our GitHub to see if someone else can replicate it.

I suggest being as descriptive as possible, especially when considering these network blip kinds of issues. They are by nature unpredictable.

I’ll ask again to see if I can get a developer to chime in here. Not my specific area of expertise that’s for sure.

Nate

1 Like

Just to check if any of the tasks are able to be cancelled.

I tried to manually close the tasks, an error is returned. All these tasks were not cancellable:

...
    "cancellable" : false,
    "cancelled" : false,
...

The timeout should tell you how long it will persist until it times out

Is this timeout set somewhere in the configuration? Can it be changed?
In the cluster settings (GET _cluster/settings?include_defaults), I did not find a “timeout” for more than an hour, and my tasks took more than 9 hours
In the output of the “GET /_tasks” command, there is only the start time and the running time

...
    "start_time_in_millis" : 1674588245439,
    "running_time_in_nanos" : 32715030040165,
...

@nateynate, thanks a lot for your help. A reply from a developer would be great.
But I guess it looks like a bug where something like this happens:
1 server is assigned a child task to write a replica
2 then the server disappears over the network and the parent task closes
3 the server with the child task returns to the cluster, but the task cannot complete because the parent task no longer exists.

This does not always happen. The situation is quite rare. Apparently the best option would be to wait for the repetition and collect more diagnostic information to start a bug report

@nateynate, I have repeated the situation described in the first post. This time I waited 2 weeks, the tasks are still hanging. I am ready to leave them in this state for some more time for diagnostics.

Can we contact the developers here or do we need to create another bug report?
I am ready to collect the necessary diagnostics, but I do not understand what is required.

Here is an example of answers from the first line from the screen (the rest give a similar result: the task is waiting for the primary, and the parent is missing)

GET /_tasks/G9gVoTdtT8Ons3Fxpisnog:98174152

{
  "completed": false,
  "task": {
    "node": "G9gVoTdtT8Ons3Fxpisnog",
    "id": 98174152,
    "type": "transport",
    "action": "indices:data/write/bulk[s]",
    "status": {
      "phase": "waiting_on_primary"
    },
    "description": "requests[1], index[security-auditlog-2023.07.23][0], refresh[IMMEDIATE]",
    "start_time_in_millis": 1690111112962,
    "running_time_in_nanos": 1344489031813227,
    "cancellable": false,
    "cancelled": false,
    "parent_task_id": "u9p6P23kTLqIaLls5j5GJQ:1554631381",
    "headers": {},
    "resource_stats": {
      "total": {
        "cpu_time_in_nanos": 0,
        "memory_in_bytes": 0
      }
    }
  }
}
GET /_tasks/u9p6P23kTLqIaLls5j5GJQ:1554631381

{
  "error": {
    "root_cause": [
      {
        "type": "resource_not_found_exception",
        "reason": "task [u9p6P23kTLqIaLls5j5GJQ:1554631381] isn't running and hasn't stored its results"
      }
    ],
    "type": "resource_not_found_exception",
    "reason": "task [u9p6P23kTLqIaLls5j5GJQ:1554631381] isn't running and hasn't stored its results"
  },
  "status": 404
}