Tasks with "source" : "opendistro-im" stuck in queue, one is executing, rest waiting

Hi,

I am very new to opendistro.
We have it deployed in kubernetes (1.13.2), and since a few days it stops processing the logs after a while (it works for a few days after deployment and stops).

I have checked cluster state and indexes in the kibana dev tools and all is reported green.
Also GET _cluster/allocation/explain returns no unallocated stuff.

The problem is - a lot of rejected writes and a lot of tasks in pending tasks (with wait times going into many hours) all with "source" : "opendistro-im"

GET /_cat/thread_pool?v=true&h=id,name,active,rejected,completed:

Yd1JOURVRaGqRY6PuwfoMA ad-batch-task-threadpool                    0        0         0
Yd1JOURVRaGqRY6PuwfoMA ad-threadpool                               0        0         0
Yd1JOURVRaGqRY6PuwfoMA analyze                                     0        0         0
Yd1JOURVRaGqRY6PuwfoMA fetch_shard_started                         0        0         0
Yd1JOURVRaGqRY6PuwfoMA fetch_shard_store                           0        0         0
Yd1JOURVRaGqRY6PuwfoMA flush                                       0        0         0
Yd1JOURVRaGqRY6PuwfoMA force_merge                                 0        0         0
Yd1JOURVRaGqRY6PuwfoMA generic                                     0        0     55735
Yd1JOURVRaGqRY6PuwfoMA get                                         0        0         0
Yd1JOURVRaGqRY6PuwfoMA listener                                    0        0         0
Yd1JOURVRaGqRY6PuwfoMA management                                  1        0     80405
Yd1JOURVRaGqRY6PuwfoMA open_distro_job_scheduler                   0        0         0
Yd1JOURVRaGqRY6PuwfoMA opendistro_asynchronous_search_generic      0        0       618
Yd1JOURVRaGqRY6PuwfoMA refresh                                     0        0         0
Yd1JOURVRaGqRY6PuwfoMA repository_azure                            0        0         0
Yd1JOURVRaGqRY6PuwfoMA search                                      0        0         0
Yd1JOURVRaGqRY6PuwfoMA search_throttled                            0        0         0
Yd1JOURVRaGqRY6PuwfoMA snapshot                                    0        0         0
Yd1JOURVRaGqRY6PuwfoMA sql-worker                                  0        0         0
Yd1JOURVRaGqRY6PuwfoMA system_read                                 0        0         0
Yd1JOURVRaGqRY6PuwfoMA system_write                                0        0         0
Yd1JOURVRaGqRY6PuwfoMA warmer                                      0        0         0
Yd1JOURVRaGqRY6PuwfoMA write                                       0        0         0
L9UXmY5vSku7r9caSeWzBA ad-batch-task-threadpool                    0        0         0
L9UXmY5vSku7r9caSeWzBA ad-threadpool                               0        0         0
L9UXmY5vSku7r9caSeWzBA analyze                                     0        0         0
L9UXmY5vSku7r9caSeWzBA fetch_shard_started                         0        0      1127
L9UXmY5vSku7r9caSeWzBA fetch_shard_store                           0        0      1983
L9UXmY5vSku7r9caSeWzBA flush                                       0        0      3073
L9UXmY5vSku7r9caSeWzBA force_merge                                 0        0         0
L9UXmY5vSku7r9caSeWzBA generic                                     0        0   7033791
L9UXmY5vSku7r9caSeWzBA get                                         0        0      4008
L9UXmY5vSku7r9caSeWzBA listener                                    0        0         0
L9UXmY5vSku7r9caSeWzBA management                                  1        0   5929836
L9UXmY5vSku7r9caSeWzBA open_distro_job_scheduler                   0        0      4614
L9UXmY5vSku7r9caSeWzBA opendistro_asynchronous_search_generic      0        0       822
L9UXmY5vSku7r9caSeWzBA refresh                                     0        0  35359191
L9UXmY5vSku7r9caSeWzBA repository_azure                            0        0         0
L9UXmY5vSku7r9caSeWzBA search                                      0        0     67743
L9UXmY5vSku7r9caSeWzBA search_throttled                            0        0         0
L9UXmY5vSku7r9caSeWzBA snapshot                                    0        0         0
L9UXmY5vSku7r9caSeWzBA sql-worker                                  0        0         0
L9UXmY5vSku7r9caSeWzBA system_read                                 0        0       117
L9UXmY5vSku7r9caSeWzBA system_write                                0        0        95
L9UXmY5vSku7r9caSeWzBA warmer                                      0        0      1704
L9UXmY5vSku7r9caSeWzBA write                                       0        0   2195576
OPUWg9MRQjG-RbYZYqu6qA ad-batch-task-threadpool                    0        0         0
OPUWg9MRQjG-RbYZYqu6qA ad-threadpool                               0        0         0
OPUWg9MRQjG-RbYZYqu6qA analyze                                     0        0         0
OPUWg9MRQjG-RbYZYqu6qA fetch_shard_started                         0        0      1127
OPUWg9MRQjG-RbYZYqu6qA fetch_shard_store                           0        0      2105
OPUWg9MRQjG-RbYZYqu6qA flush                                       0        0      2008
OPUWg9MRQjG-RbYZYqu6qA force_merge                                 0        0         0
OPUWg9MRQjG-RbYZYqu6qA generic                                     0        0   7607326
OPUWg9MRQjG-RbYZYqu6qA get                                         0        0      4128
OPUWg9MRQjG-RbYZYqu6qA listener                                    0        0         0
OPUWg9MRQjG-RbYZYqu6qA management                                  1        0   5368835
OPUWg9MRQjG-RbYZYqu6qA open_distro_job_scheduler                   0        0         0
OPUWg9MRQjG-RbYZYqu6qA opendistro_asynchronous_search_generic      0        0       855
OPUWg9MRQjG-RbYZYqu6qA refresh                                     2        0  32284254
OPUWg9MRQjG-RbYZYqu6qA repository_azure                            0        0         0
OPUWg9MRQjG-RbYZYqu6qA search                                      0        0     54209
OPUWg9MRQjG-RbYZYqu6qA search_throttled                            0        0         0
OPUWg9MRQjG-RbYZYqu6qA snapshot                                    0        0         0
OPUWg9MRQjG-RbYZYqu6qA sql-worker                                  0        0         0
OPUWg9MRQjG-RbYZYqu6qA system_read                                 0        0       747
OPUWg9MRQjG-RbYZYqu6qA system_write                                0        0      4539
OPUWg9MRQjG-RbYZYqu6qA warmer                                      0        0      2680
OPUWg9MRQjG-RbYZYqu6qA write                                       4        0   1583153
DVlyTw4rQg25wfR1Acixkw ad-batch-task-threadpool                    0        0         0
DVlyTw4rQg25wfR1Acixkw ad-threadpool                               0        0         0
DVlyTw4rQg25wfR1Acixkw analyze                                     0        0         0
DVlyTw4rQg25wfR1Acixkw fetch_shard_started                         0        0         0
DVlyTw4rQg25wfR1Acixkw fetch_shard_store                           0        0         0
DVlyTw4rQg25wfR1Acixkw flush                                       0        0         0
DVlyTw4rQg25wfR1Acixkw force_merge                                 0        0         0
DVlyTw4rQg25wfR1Acixkw generic                                     0        0     62326
DVlyTw4rQg25wfR1Acixkw get                                         0        0         0
DVlyTw4rQg25wfR1Acixkw listener                                    0        0         0
DVlyTw4rQg25wfR1Acixkw management                                  1        0     81941
DVlyTw4rQg25wfR1Acixkw open_distro_job_scheduler                   0        0         0
DVlyTw4rQg25wfR1Acixkw opendistro_asynchronous_search_generic      0        0       629
DVlyTw4rQg25wfR1Acixkw refresh                                     0        0         0
DVlyTw4rQg25wfR1Acixkw repository_azure                            0        0         0
DVlyTw4rQg25wfR1Acixkw search                                      0        0      7999
DVlyTw4rQg25wfR1Acixkw search_throttled                            0        0         0
DVlyTw4rQg25wfR1Acixkw snapshot                                    0        0         0
DVlyTw4rQg25wfR1Acixkw sql-worker                                  0        0         0
DVlyTw4rQg25wfR1Acixkw system_read                                 0        0       196
DVlyTw4rQg25wfR1Acixkw system_write                                0        0         0
DVlyTw4rQg25wfR1Acixkw warmer                                      0        0         0
DVlyTw4rQg25wfR1Acixkw write                                       0        0         6
tQBXjgZiQLehN8sptb11zw ad-batch-task-threadpool                    0        0         0
tQBXjgZiQLehN8sptb11zw ad-threadpool                               0        0         0
tQBXjgZiQLehN8sptb11zw analyze                                     0        0         0
tQBXjgZiQLehN8sptb11zw fetch_shard_started                         0        0         0
tQBXjgZiQLehN8sptb11zw fetch_shard_store                           0        0       177
tQBXjgZiQLehN8sptb11zw flush                                       0        0      1948
tQBXjgZiQLehN8sptb11zw force_merge                                 0        0         0
tQBXjgZiQLehN8sptb11zw generic                                     0        0   3004467
tQBXjgZiQLehN8sptb11zw get                                         0        0      3221
tQBXjgZiQLehN8sptb11zw listener                                    0        0         0
tQBXjgZiQLehN8sptb11zw management                                  1        0   5418273
tQBXjgZiQLehN8sptb11zw open_distro_job_scheduler                   0        0      6269
tQBXjgZiQLehN8sptb11zw opendistro_asynchronous_search_generic      0        0       789
tQBXjgZiQLehN8sptb11zw refresh                                     0        0  35025823
tQBXjgZiQLehN8sptb11zw repository_azure                            0        0         0
tQBXjgZiQLehN8sptb11zw search                                      0        0     47624
tQBXjgZiQLehN8sptb11zw search_throttled                            0        0         0
tQBXjgZiQLehN8sptb11zw snapshot                                    0        0         0
tQBXjgZiQLehN8sptb11zw sql-worker                                  0        0         0
tQBXjgZiQLehN8sptb11zw system_read                                 0        0       446
tQBXjgZiQLehN8sptb11zw system_write                                0        0       385
tQBXjgZiQLehN8sptb11zw warmer                                      0        0      3409
tQBXjgZiQLehN8sptb11zw write                                       4     1581   1700624
_0Ow2ObvQDK2L1-P7wplIQ ad-batch-task-threadpool                    0        0         0
_0Ow2ObvQDK2L1-P7wplIQ ad-threadpool                               0        0         0
_0Ow2ObvQDK2L1-P7wplIQ analyze                                     0        0         0
_0Ow2ObvQDK2L1-P7wplIQ fetch_shard_started                         0        0         0
_0Ow2ObvQDK2L1-P7wplIQ fetch_shard_store                           0        0         0
_0Ow2ObvQDK2L1-P7wplIQ flush                                       0        0         0
_0Ow2ObvQDK2L1-P7wplIQ force_merge                                 0        0         0
_0Ow2ObvQDK2L1-P7wplIQ generic                                     0        0     61864
_0Ow2ObvQDK2L1-P7wplIQ get                                         0        0         0
_0Ow2ObvQDK2L1-P7wplIQ listener                                    0        0         0
_0Ow2ObvQDK2L1-P7wplIQ management                                  1        0     81908
_0Ow2ObvQDK2L1-P7wplIQ open_distro_job_scheduler                   0        0         0
_0Ow2ObvQDK2L1-P7wplIQ opendistro_asynchronous_search_generic      0        0       630
_0Ow2ObvQDK2L1-P7wplIQ refresh                                     0        0         0
_0Ow2ObvQDK2L1-P7wplIQ repository_azure                            0        0         0
_0Ow2ObvQDK2L1-P7wplIQ search                                      0        0    137980
_0Ow2ObvQDK2L1-P7wplIQ search_throttled                            0        0         0
_0Ow2ObvQDK2L1-P7wplIQ snapshot                                    0        0         0
_0Ow2ObvQDK2L1-P7wplIQ sql-worker                                  0        0         0
_0Ow2ObvQDK2L1-P7wplIQ system_read                                 0        0       193
_0Ow2ObvQDK2L1-P7wplIQ system_write                                0        0         0
_0Ow2ObvQDK2L1-P7wplIQ warmer                                      0        0         0
_0Ow2ObvQDK2L1-P7wplIQ write                                       0        0        10
AEJZBJVDQXGJf4XBWXlzpw ad-batch-task-threadpool                    0        0         0
AEJZBJVDQXGJf4XBWXlzpw ad-threadpool                               0        0         0
AEJZBJVDQXGJf4XBWXlzpw analyze                                     0        0         0
AEJZBJVDQXGJf4XBWXlzpw fetch_shard_started                         0        0         0
AEJZBJVDQXGJf4XBWXlzpw fetch_shard_store                           0        0         0
AEJZBJVDQXGJf4XBWXlzpw flush                                       0        0         0
AEJZBJVDQXGJf4XBWXlzpw force_merge                                 0        0         0
AEJZBJVDQXGJf4XBWXlzpw generic                                   128        0    324175
AEJZBJVDQXGJf4XBWXlzpw get                                         0        0         0
AEJZBJVDQXGJf4XBWXlzpw listener                                    0        0         0
AEJZBJVDQXGJf4XBWXlzpw management                                  1        0     87532
AEJZBJVDQXGJf4XBWXlzpw open_distro_job_scheduler                   0        0         0
AEJZBJVDQXGJf4XBWXlzpw opendistro_asynchronous_search_generic      0        0      1237
AEJZBJVDQXGJf4XBWXlzpw refresh                                     0        0         0
AEJZBJVDQXGJf4XBWXlzpw repository_azure                            0        0         0
AEJZBJVDQXGJf4XBWXlzpw search                                      0        0       332
AEJZBJVDQXGJf4XBWXlzpw search_throttled                            0        0         0
AEJZBJVDQXGJf4XBWXlzpw snapshot                                    0        0         0
AEJZBJVDQXGJf4XBWXlzpw sql-worker                                  0        0         0
AEJZBJVDQXGJf4XBWXlzpw system_read                                 0        0         0
AEJZBJVDQXGJf4XBWXlzpw system_write                                0        0         0
AEJZBJVDQXGJf4XBWXlzpw warmer                                      0        0         0
AEJZBJVDQXGJf4XBWXlzpw write                                       0        0         0
dMXQM2xgQ3mzK7kuNnarCw ad-batch-task-threadpool                    0        0         0
dMXQM2xgQ3mzK7kuNnarCw ad-threadpool                               0        0         0
dMXQM2xgQ3mzK7kuNnarCw analyze                                     0        0         0
dMXQM2xgQ3mzK7kuNnarCw fetch_shard_started                         0        0         0
dMXQM2xgQ3mzK7kuNnarCw fetch_shard_store                           0        0         0
dMXQM2xgQ3mzK7kuNnarCw flush                                       0        0         0
dMXQM2xgQ3mzK7kuNnarCw force_merge                                 0        0         0
dMXQM2xgQ3mzK7kuNnarCw generic                                     0        0     51323
dMXQM2xgQ3mzK7kuNnarCw get                                         0        0         0
dMXQM2xgQ3mzK7kuNnarCw listener                                    0        0         0
dMXQM2xgQ3mzK7kuNnarCw management                                  1        0     78307
dMXQM2xgQ3mzK7kuNnarCw open_distro_job_scheduler                   0        0         0
dMXQM2xgQ3mzK7kuNnarCw opendistro_asynchronous_search_generic      0        0       602
dMXQM2xgQ3mzK7kuNnarCw refresh                                     0        0         0
dMXQM2xgQ3mzK7kuNnarCw repository_azure                            0        0         0
dMXQM2xgQ3mzK7kuNnarCw search                                      0        0         0
dMXQM2xgQ3mzK7kuNnarCw search_throttled                            0        0         0
dMXQM2xgQ3mzK7kuNnarCw snapshot                                    0        0         0
dMXQM2xgQ3mzK7kuNnarCw sql-worker                                  0        0         0
dMXQM2xgQ3mzK7kuNnarCw system_read                                 0        0         0
dMXQM2xgQ3mzK7kuNnarCw system_write                                0        0         0
dMXQM2xgQ3mzK7kuNnarCw warmer                                      0        0         0
dMXQM2xgQ3mzK7kuNnarCw write                                       0        0         0

Here is a fragment of pending tasks GET /_cluster/pending_tasks:

{
  "tasks" : [
    {
      "insert_order" : 21163,
      "priority" : "NORMAL",
      "source" : "opendistro-im",
      "executing" : true,
      "time_in_queue_millis" : 29827891,
      "time_in_queue" : "8.2h"
    },
    {
      "insert_order" : 21164,
      "priority" : "NORMAL",
      "source" : "opendistro-im",
      "executing" : false,
      "time_in_queue_millis" : 29827891,
      "time_in_queue" : "8.2h"
    },
    {
      "insert_order" : 21165,
      "priority" : "NORMAL",
      "source" : "opendistro-im",
      "executing" : false,
      "time_in_queue_millis" : 29827891,
      "time_in_queue" : "8.2h"
    },
    {
      "insert_order" : 21167,
      "priority" : "NORMAL",
      "source" : "opendistro-im",
      "executing" : false,
      "time_in_queue_millis" : 29825193,
      "time_in_queue" : "8.2h"
    },
    {
      "insert_order" : 21166,
      "priority" : "NORMAL",
      "source" : "opendistro-im",
      "executing" : false,
      "time_in_queue_millis" : 29825193,
      "time_in_queue" : "8.2h"
    },

I have looked into data and master nodes logs and there is no single ERROR/WARNING log there before or after the time that the logs stopped.

If you have any ideas on what to investigate - please let me know.

Hi @kitlhar,

Were you able to resolve this?

In your pending tasks did you happen to see an index management rollover task? There was a bug that caused a deadlock on the master node because of the rollover operation that happens in the plugin.

If not, then are you able to see other tasks in the pending task queue? We have seen that the IM tasks which are NORMAL priority end up getting repeatedly passed on when a cluster is spamming put settings, mappings, or indexing w/ dynamic mappings such that the master is always processing these tasks which have a higher priority.

If you’re able to upgrade to OpenSearch you would benefit from the newer development of our plugins, of which for Index Management we have moved the metadata updates (the tasks you are seeing) out of the cluster state and into the internal ism config index which means there would be no more opendistro-ism metadata tasks in the pending task queue.

1 Like

Hi,

What we have observed is that the master nodes were being CPU throttled. We have increased the CPU limits for those slightly (they don’t need much, about 1000mCores I think) and then the problem stopped. But from what you explain it probably was caused by a deadlock.

I will mark opendistro for an upgrade - thx for letting me know about a fix for this issue!

Same for me when I delete a big index which is 60 shard ( 3-4Tb )

  "number_of_nodes" : 23,
  "number_of_data_nodes" : 20,
  "active_primary_shards" : 2285,
  "active_shards" : 2728,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0