Tasks are not completed or canceled

mouse · January 25, 2023, 4:36am

Version in use: OpenSearch 2.3.0

Sometimes after a network problem on a part of the cluster, many tasks remain in the “_cat/tasks”.
Tasks remain hanging until nodes are reloaded.
New data is written to the indexes, no index lock occurs, the cluster state is green

GET _cat/tasks?v&h=action,type,running_time,node,task_id,parent_task_id&s=running_time:desc

If I try to find parent_task_id it doesn’t exist

GET /_tasks/R9o_zlO_TiaKJSu1Xy6ULw:423965717

{
  "completed" : false,
  "task" : {
    "node" : "R9o_zlO_TiaKJSu1Xy6ULw",
    "id" : 423965717,
    "type" : "transport",
    "action" : "indices:data/write/bulk[s]",
    "status" : {
      "phase" : "waiting_on_primary"
    },
    "description" : "requests[3], index[indexname-2023.01.24][36]",
    "start_time_in_millis" : 1674588245439,
    "running_time_in_nanos" : 32715030040165,
    "cancellable" : false,
    "cancelled" : false,
    "parent_task_id" : "Y4s2yt7VQV25TNPXn2UIfQ:6606528859",
    "headers" : { },
    "resource_stats" : {
      "total" : {
        "cpu_time_in_nanos" : 0,
        "memory_in_bytes" : 0
      }
    }
  }
}

GET /_tasks/Y4s2yt7VQV25TNPXn2UIfQ:6606528859

  "error" : {
    "root_cause" : [
      {
        "type" : "resource_not_found_exception",
        "reason" : "task [Y4s2yt7VQV25TNPXn2UIfQ:6606528859] isn't running and hasn't stored its results"
      }
    ],
    "type" : "resource_not_found_exception",
    "reason" : "task [Y4s2yt7VQV25TNPXn2UIfQ:6606528859] isn't running and hasn't stored its results"
  },
  "status" : 404
}

The nodes that have tasks on them are those that disappeared from the cluster during a network problem. And the missing parent task belonged to a node that had no network problems.

Any idea how to get rid of these tasks without restarting nodes?
Мaybe there is some kind of mechanism for closing child tasks if the parent does not exist?

kris · February 24, 2023, 12:50am

@nateynate @dtaivpp - do either of you know how to assist @mouse on this? A network hiccup causes some odd behavior

nateynate · February 24, 2023, 5:15pm

Hi @mouse! I’ve asked some of the devs to chime in if they can, but I can think of a few ideas just from the screenshot here. This isn’t something I’m radically intimate with.

Ideas first:

Try to add cancellable to the list of headers you’d like. I think there might be one called timeout as well. There’s always the _cancel API operation if you feel comfortable cancelling them. But they do all seem to have something in common that matches your description of anetwork blip.

These all look like index write operations.

If a client was in the middle of a TCP connection to bulk index something, and a blip occurrs, it’s possible there’s some kind of orphaned file handle (tcp sockets are technically file handles) as a result of a network blip. If the amount of time in the timeout header hasn’t been reached, you can cancel all cancellable tasks with:

POST _tasks/_cancel

Check out the task api reference here: https://opensearch.org/docs/2.3/api-reference/tasks/

Hope that helps!

Nate

mouse · February 27, 2023, 1:33am

@nateynate , yes, you’re right, those are write operations. I believe the problem is related to the data node network loss with the replica (the request to write data to the replica went away and the response did not return).

Recording occurs through Logstash like this:
Server-> Logstash-> OS_Collector-> Datanode (primary shard) → Datanode (replica shard)
all these are 5 different servers

In this case, I do not understand where to add “cancellable”? opensearch output plugin supports “custom_headers” parameter, this is not specified in the documentation (Ship events to OpenSearch - OpenSearch documentation)?

nateynate · February 27, 2023, 7:23pm

I think it would be right there - the h query parameter. Just to check if any of the tasks are able to be cancelled. The timeout should tell you how long it will persist until it times out. If they’re past their timeout and still showing in the tasks list, you may want to file a bug report in our GitHub to see if someone else can replicate it.

I suggest being as descriptive as possible, especially when considering these network blip kinds of issues. They are by nature unpredictable.

I’ll ask again to see if I can get a developer to chime in here. Not my specific area of expertise that’s for sure.

Nate

mouse · February 28, 2023, 1:23am

Just to check if any of the tasks are able to be cancelled.

I tried to manually close the tasks, an error is returned. All these tasks were not cancellable:

...
    "cancellable" : false,
    "cancelled" : false,
...

The timeout should tell you how long it will persist until it times out

Is this timeout set somewhere in the configuration? Can it be changed?
In the cluster settings (GET _cluster/settings?include_defaults), I did not find a “timeout” for more than an hour, and my tasks took more than 9 hours
In the output of the “GET /_tasks” command, there is only the start time and the running time

...
    "start_time_in_millis" : 1674588245439,
    "running_time_in_nanos" : 32715030040165,
...

@nateynate, thanks a lot for your help. A reply from a developer would be great.
But I guess it looks like a bug where something like this happens:
1 server is assigned a child task to write a replica
2 then the server disappears over the network and the parent task closes
3 the server with the child task returns to the cluster, but the task cannot complete because the parent task no longer exists.

This does not always happen. The situation is quite rare. Apparently the best option would be to wait for the repetition and collect more diagnostic information to start a bug report

mouse · August 8, 2023, 12:57am

@nateynate, I have repeated the situation described in the first post. This time I waited 2 weeks, the tasks are still hanging. I am ready to leave them in this state for some more time for diagnostics.

Can we contact the developers here or do we need to create another bug report?
I am ready to collect the necessary diagnostics, but I do not understand what is required.

Here is an example of answers from the first line from the screen (the rest give a similar result: the task is waiting for the primary, and the parent is missing)

GET /_tasks/G9gVoTdtT8Ons3Fxpisnog:98174152

{
  "completed": false,
  "task": {
    "node": "G9gVoTdtT8Ons3Fxpisnog",
    "id": 98174152,
    "type": "transport",
    "action": "indices:data/write/bulk[s]",
    "status": {
      "phase": "waiting_on_primary"
    },
    "description": "requests[1], index[security-auditlog-2023.07.23][0], refresh[IMMEDIATE]",
    "start_time_in_millis": 1690111112962,
    "running_time_in_nanos": 1344489031813227,
    "cancellable": false,
    "cancelled": false,
    "parent_task_id": "u9p6P23kTLqIaLls5j5GJQ:1554631381",
    "headers": {},
    "resource_stats": {
      "total": {
        "cpu_time_in_nanos": 0,
        "memory_in_bytes": 0
      }
    }
  }
}

GET /_tasks/u9p6P23kTLqIaLls5j5GJQ:1554631381

{
  "error": {
    "root_cause": [
      {
        "type": "resource_not_found_exception",
        "reason": "task [u9p6P23kTLqIaLls5j5GJQ:1554631381] isn't running and hasn't stored its results"
      }
    ],
    "type": "resource_not_found_exception",
    "reason": "task [u9p6P23kTLqIaLls5j5GJQ:1554631381] isn't running and hasn't stored its results"
  },
  "status": 404
}

Topic		Replies	Views
Tasks like reindex still running even deleting the index OpenSearch	1	46	August 30, 2024
Tasks with "source" : "opendistro-im" stuck in queue, one is executing, rest waiting Index Management	3	554	December 9, 2021
Task cancellations when HTTP client disconnects OpenSearch	0	49	October 24, 2024
Unable to restart replication after stopping Cross-Cluster Replication	5	979	January 19, 2023
Issue with opensearch client OpenSearch Client Libraries opensearch-java	5	568	November 11, 2024

Tasks are not completed or canceled

Related topics