Hi,
I am very new to opendistro.
We have it deployed in kubernetes (1.13.2), and since a few days it stops processing the logs after a while (it works for a few days after deployment and stops).
I have checked cluster state and indexes in the kibana dev tools and all is reported green.
Also GET _cluster/allocation/explain
returns no unallocated stuff.
The problem is - a lot of rejected writes and a lot of tasks in pending tasks (with wait times going into many hours) all with "source" : "opendistro-im"
GET /_cat/thread_pool?v=true&h=id,name,active,rejected,completed
:
Yd1JOURVRaGqRY6PuwfoMA ad-batch-task-threadpool 0 0 0
Yd1JOURVRaGqRY6PuwfoMA ad-threadpool 0 0 0
Yd1JOURVRaGqRY6PuwfoMA analyze 0 0 0
Yd1JOURVRaGqRY6PuwfoMA fetch_shard_started 0 0 0
Yd1JOURVRaGqRY6PuwfoMA fetch_shard_store 0 0 0
Yd1JOURVRaGqRY6PuwfoMA flush 0 0 0
Yd1JOURVRaGqRY6PuwfoMA force_merge 0 0 0
Yd1JOURVRaGqRY6PuwfoMA generic 0 0 55735
Yd1JOURVRaGqRY6PuwfoMA get 0 0 0
Yd1JOURVRaGqRY6PuwfoMA listener 0 0 0
Yd1JOURVRaGqRY6PuwfoMA management 1 0 80405
Yd1JOURVRaGqRY6PuwfoMA open_distro_job_scheduler 0 0 0
Yd1JOURVRaGqRY6PuwfoMA opendistro_asynchronous_search_generic 0 0 618
Yd1JOURVRaGqRY6PuwfoMA refresh 0 0 0
Yd1JOURVRaGqRY6PuwfoMA repository_azure 0 0 0
Yd1JOURVRaGqRY6PuwfoMA search 0 0 0
Yd1JOURVRaGqRY6PuwfoMA search_throttled 0 0 0
Yd1JOURVRaGqRY6PuwfoMA snapshot 0 0 0
Yd1JOURVRaGqRY6PuwfoMA sql-worker 0 0 0
Yd1JOURVRaGqRY6PuwfoMA system_read 0 0 0
Yd1JOURVRaGqRY6PuwfoMA system_write 0 0 0
Yd1JOURVRaGqRY6PuwfoMA warmer 0 0 0
Yd1JOURVRaGqRY6PuwfoMA write 0 0 0
L9UXmY5vSku7r9caSeWzBA ad-batch-task-threadpool 0 0 0
L9UXmY5vSku7r9caSeWzBA ad-threadpool 0 0 0
L9UXmY5vSku7r9caSeWzBA analyze 0 0 0
L9UXmY5vSku7r9caSeWzBA fetch_shard_started 0 0 1127
L9UXmY5vSku7r9caSeWzBA fetch_shard_store 0 0 1983
L9UXmY5vSku7r9caSeWzBA flush 0 0 3073
L9UXmY5vSku7r9caSeWzBA force_merge 0 0 0
L9UXmY5vSku7r9caSeWzBA generic 0 0 7033791
L9UXmY5vSku7r9caSeWzBA get 0 0 4008
L9UXmY5vSku7r9caSeWzBA listener 0 0 0
L9UXmY5vSku7r9caSeWzBA management 1 0 5929836
L9UXmY5vSku7r9caSeWzBA open_distro_job_scheduler 0 0 4614
L9UXmY5vSku7r9caSeWzBA opendistro_asynchronous_search_generic 0 0 822
L9UXmY5vSku7r9caSeWzBA refresh 0 0 35359191
L9UXmY5vSku7r9caSeWzBA repository_azure 0 0 0
L9UXmY5vSku7r9caSeWzBA search 0 0 67743
L9UXmY5vSku7r9caSeWzBA search_throttled 0 0 0
L9UXmY5vSku7r9caSeWzBA snapshot 0 0 0
L9UXmY5vSku7r9caSeWzBA sql-worker 0 0 0
L9UXmY5vSku7r9caSeWzBA system_read 0 0 117
L9UXmY5vSku7r9caSeWzBA system_write 0 0 95
L9UXmY5vSku7r9caSeWzBA warmer 0 0 1704
L9UXmY5vSku7r9caSeWzBA write 0 0 2195576
OPUWg9MRQjG-RbYZYqu6qA ad-batch-task-threadpool 0 0 0
OPUWg9MRQjG-RbYZYqu6qA ad-threadpool 0 0 0
OPUWg9MRQjG-RbYZYqu6qA analyze 0 0 0
OPUWg9MRQjG-RbYZYqu6qA fetch_shard_started 0 0 1127
OPUWg9MRQjG-RbYZYqu6qA fetch_shard_store 0 0 2105
OPUWg9MRQjG-RbYZYqu6qA flush 0 0 2008
OPUWg9MRQjG-RbYZYqu6qA force_merge 0 0 0
OPUWg9MRQjG-RbYZYqu6qA generic 0 0 7607326
OPUWg9MRQjG-RbYZYqu6qA get 0 0 4128
OPUWg9MRQjG-RbYZYqu6qA listener 0 0 0
OPUWg9MRQjG-RbYZYqu6qA management 1 0 5368835
OPUWg9MRQjG-RbYZYqu6qA open_distro_job_scheduler 0 0 0
OPUWg9MRQjG-RbYZYqu6qA opendistro_asynchronous_search_generic 0 0 855
OPUWg9MRQjG-RbYZYqu6qA refresh 2 0 32284254
OPUWg9MRQjG-RbYZYqu6qA repository_azure 0 0 0
OPUWg9MRQjG-RbYZYqu6qA search 0 0 54209
OPUWg9MRQjG-RbYZYqu6qA search_throttled 0 0 0
OPUWg9MRQjG-RbYZYqu6qA snapshot 0 0 0
OPUWg9MRQjG-RbYZYqu6qA sql-worker 0 0 0
OPUWg9MRQjG-RbYZYqu6qA system_read 0 0 747
OPUWg9MRQjG-RbYZYqu6qA system_write 0 0 4539
OPUWg9MRQjG-RbYZYqu6qA warmer 0 0 2680
OPUWg9MRQjG-RbYZYqu6qA write 4 0 1583153
DVlyTw4rQg25wfR1Acixkw ad-batch-task-threadpool 0 0 0
DVlyTw4rQg25wfR1Acixkw ad-threadpool 0 0 0
DVlyTw4rQg25wfR1Acixkw analyze 0 0 0
DVlyTw4rQg25wfR1Acixkw fetch_shard_started 0 0 0
DVlyTw4rQg25wfR1Acixkw fetch_shard_store 0 0 0
DVlyTw4rQg25wfR1Acixkw flush 0 0 0
DVlyTw4rQg25wfR1Acixkw force_merge 0 0 0
DVlyTw4rQg25wfR1Acixkw generic 0 0 62326
DVlyTw4rQg25wfR1Acixkw get 0 0 0
DVlyTw4rQg25wfR1Acixkw listener 0 0 0
DVlyTw4rQg25wfR1Acixkw management 1 0 81941
DVlyTw4rQg25wfR1Acixkw open_distro_job_scheduler 0 0 0
DVlyTw4rQg25wfR1Acixkw opendistro_asynchronous_search_generic 0 0 629
DVlyTw4rQg25wfR1Acixkw refresh 0 0 0
DVlyTw4rQg25wfR1Acixkw repository_azure 0 0 0
DVlyTw4rQg25wfR1Acixkw search 0 0 7999
DVlyTw4rQg25wfR1Acixkw search_throttled 0 0 0
DVlyTw4rQg25wfR1Acixkw snapshot 0 0 0
DVlyTw4rQg25wfR1Acixkw sql-worker 0 0 0
DVlyTw4rQg25wfR1Acixkw system_read 0 0 196
DVlyTw4rQg25wfR1Acixkw system_write 0 0 0
DVlyTw4rQg25wfR1Acixkw warmer 0 0 0
DVlyTw4rQg25wfR1Acixkw write 0 0 6
tQBXjgZiQLehN8sptb11zw ad-batch-task-threadpool 0 0 0
tQBXjgZiQLehN8sptb11zw ad-threadpool 0 0 0
tQBXjgZiQLehN8sptb11zw analyze 0 0 0
tQBXjgZiQLehN8sptb11zw fetch_shard_started 0 0 0
tQBXjgZiQLehN8sptb11zw fetch_shard_store 0 0 177
tQBXjgZiQLehN8sptb11zw flush 0 0 1948
tQBXjgZiQLehN8sptb11zw force_merge 0 0 0
tQBXjgZiQLehN8sptb11zw generic 0 0 3004467
tQBXjgZiQLehN8sptb11zw get 0 0 3221
tQBXjgZiQLehN8sptb11zw listener 0 0 0
tQBXjgZiQLehN8sptb11zw management 1 0 5418273
tQBXjgZiQLehN8sptb11zw open_distro_job_scheduler 0 0 6269
tQBXjgZiQLehN8sptb11zw opendistro_asynchronous_search_generic 0 0 789
tQBXjgZiQLehN8sptb11zw refresh 0 0 35025823
tQBXjgZiQLehN8sptb11zw repository_azure 0 0 0
tQBXjgZiQLehN8sptb11zw search 0 0 47624
tQBXjgZiQLehN8sptb11zw search_throttled 0 0 0
tQBXjgZiQLehN8sptb11zw snapshot 0 0 0
tQBXjgZiQLehN8sptb11zw sql-worker 0 0 0
tQBXjgZiQLehN8sptb11zw system_read 0 0 446
tQBXjgZiQLehN8sptb11zw system_write 0 0 385
tQBXjgZiQLehN8sptb11zw warmer 0 0 3409
tQBXjgZiQLehN8sptb11zw write 4 1581 1700624
_0Ow2ObvQDK2L1-P7wplIQ ad-batch-task-threadpool 0 0 0
_0Ow2ObvQDK2L1-P7wplIQ ad-threadpool 0 0 0
_0Ow2ObvQDK2L1-P7wplIQ analyze 0 0 0
_0Ow2ObvQDK2L1-P7wplIQ fetch_shard_started 0 0 0
_0Ow2ObvQDK2L1-P7wplIQ fetch_shard_store 0 0 0
_0Ow2ObvQDK2L1-P7wplIQ flush 0 0 0
_0Ow2ObvQDK2L1-P7wplIQ force_merge 0 0 0
_0Ow2ObvQDK2L1-P7wplIQ generic 0 0 61864
_0Ow2ObvQDK2L1-P7wplIQ get 0 0 0
_0Ow2ObvQDK2L1-P7wplIQ listener 0 0 0
_0Ow2ObvQDK2L1-P7wplIQ management 1 0 81908
_0Ow2ObvQDK2L1-P7wplIQ open_distro_job_scheduler 0 0 0
_0Ow2ObvQDK2L1-P7wplIQ opendistro_asynchronous_search_generic 0 0 630
_0Ow2ObvQDK2L1-P7wplIQ refresh 0 0 0
_0Ow2ObvQDK2L1-P7wplIQ repository_azure 0 0 0
_0Ow2ObvQDK2L1-P7wplIQ search 0 0 137980
_0Ow2ObvQDK2L1-P7wplIQ search_throttled 0 0 0
_0Ow2ObvQDK2L1-P7wplIQ snapshot 0 0 0
_0Ow2ObvQDK2L1-P7wplIQ sql-worker 0 0 0
_0Ow2ObvQDK2L1-P7wplIQ system_read 0 0 193
_0Ow2ObvQDK2L1-P7wplIQ system_write 0 0 0
_0Ow2ObvQDK2L1-P7wplIQ warmer 0 0 0
_0Ow2ObvQDK2L1-P7wplIQ write 0 0 10
AEJZBJVDQXGJf4XBWXlzpw ad-batch-task-threadpool 0 0 0
AEJZBJVDQXGJf4XBWXlzpw ad-threadpool 0 0 0
AEJZBJVDQXGJf4XBWXlzpw analyze 0 0 0
AEJZBJVDQXGJf4XBWXlzpw fetch_shard_started 0 0 0
AEJZBJVDQXGJf4XBWXlzpw fetch_shard_store 0 0 0
AEJZBJVDQXGJf4XBWXlzpw flush 0 0 0
AEJZBJVDQXGJf4XBWXlzpw force_merge 0 0 0
AEJZBJVDQXGJf4XBWXlzpw generic 128 0 324175
AEJZBJVDQXGJf4XBWXlzpw get 0 0 0
AEJZBJVDQXGJf4XBWXlzpw listener 0 0 0
AEJZBJVDQXGJf4XBWXlzpw management 1 0 87532
AEJZBJVDQXGJf4XBWXlzpw open_distro_job_scheduler 0 0 0
AEJZBJVDQXGJf4XBWXlzpw opendistro_asynchronous_search_generic 0 0 1237
AEJZBJVDQXGJf4XBWXlzpw refresh 0 0 0
AEJZBJVDQXGJf4XBWXlzpw repository_azure 0 0 0
AEJZBJVDQXGJf4XBWXlzpw search 0 0 332
AEJZBJVDQXGJf4XBWXlzpw search_throttled 0 0 0
AEJZBJVDQXGJf4XBWXlzpw snapshot 0 0 0
AEJZBJVDQXGJf4XBWXlzpw sql-worker 0 0 0
AEJZBJVDQXGJf4XBWXlzpw system_read 0 0 0
AEJZBJVDQXGJf4XBWXlzpw system_write 0 0 0
AEJZBJVDQXGJf4XBWXlzpw warmer 0 0 0
AEJZBJVDQXGJf4XBWXlzpw write 0 0 0
dMXQM2xgQ3mzK7kuNnarCw ad-batch-task-threadpool 0 0 0
dMXQM2xgQ3mzK7kuNnarCw ad-threadpool 0 0 0
dMXQM2xgQ3mzK7kuNnarCw analyze 0 0 0
dMXQM2xgQ3mzK7kuNnarCw fetch_shard_started 0 0 0
dMXQM2xgQ3mzK7kuNnarCw fetch_shard_store 0 0 0
dMXQM2xgQ3mzK7kuNnarCw flush 0 0 0
dMXQM2xgQ3mzK7kuNnarCw force_merge 0 0 0
dMXQM2xgQ3mzK7kuNnarCw generic 0 0 51323
dMXQM2xgQ3mzK7kuNnarCw get 0 0 0
dMXQM2xgQ3mzK7kuNnarCw listener 0 0 0
dMXQM2xgQ3mzK7kuNnarCw management 1 0 78307
dMXQM2xgQ3mzK7kuNnarCw open_distro_job_scheduler 0 0 0
dMXQM2xgQ3mzK7kuNnarCw opendistro_asynchronous_search_generic 0 0 602
dMXQM2xgQ3mzK7kuNnarCw refresh 0 0 0
dMXQM2xgQ3mzK7kuNnarCw repository_azure 0 0 0
dMXQM2xgQ3mzK7kuNnarCw search 0 0 0
dMXQM2xgQ3mzK7kuNnarCw search_throttled 0 0 0
dMXQM2xgQ3mzK7kuNnarCw snapshot 0 0 0
dMXQM2xgQ3mzK7kuNnarCw sql-worker 0 0 0
dMXQM2xgQ3mzK7kuNnarCw system_read 0 0 0
dMXQM2xgQ3mzK7kuNnarCw system_write 0 0 0
dMXQM2xgQ3mzK7kuNnarCw warmer 0 0 0
dMXQM2xgQ3mzK7kuNnarCw write 0 0 0
Here is a fragment of pending tasks GET /_cluster/pending_tasks
:
{
"tasks" : [
{
"insert_order" : 21163,
"priority" : "NORMAL",
"source" : "opendistro-im",
"executing" : true,
"time_in_queue_millis" : 29827891,
"time_in_queue" : "8.2h"
},
{
"insert_order" : 21164,
"priority" : "NORMAL",
"source" : "opendistro-im",
"executing" : false,
"time_in_queue_millis" : 29827891,
"time_in_queue" : "8.2h"
},
{
"insert_order" : 21165,
"priority" : "NORMAL",
"source" : "opendistro-im",
"executing" : false,
"time_in_queue_millis" : 29827891,
"time_in_queue" : "8.2h"
},
{
"insert_order" : 21167,
"priority" : "NORMAL",
"source" : "opendistro-im",
"executing" : false,
"time_in_queue_millis" : 29825193,
"time_in_queue" : "8.2h"
},
{
"insert_order" : 21166,
"priority" : "NORMAL",
"source" : "opendistro-im",
"executing" : false,
"time_in_queue_millis" : 29825193,
"time_in_queue" : "8.2h"
},
I have looked into data and master nodes logs and there is no single ERROR/WARNING log there before or after the time that the logs stopped.
If you have any ideas on what to investigate - please let me know.