Model is Partially responding in case of non-ML node restart

OpenSearch 2.11.1
Cluster configuration:
3 master nodes
3 ingest nodes
3 data nodes
3 ML nodes
The cluster is deployed on Kubernetes.

opensearch.yaml has the following section:

    plugins.ml_commons.only_run_on_ml_node: true
    plugins.ml_commons.task_dispatch_policy: round_robin
    plugins.ml_commons.max_ml_task_per_node: 10
    plugins.ml_commons.max_model_on_node: 10
    plugins.ml_commons.sync_up_job_interval_in_seconds: 3
    plugins.ml_commons.monitoring_request_count: 100
    plugins.ml_commons.max_register_model_tasks_per_node: 10
    plugins.ml_commons.max_deploy_model_tasks_per_node: 10
    plugins.ml_commons.allow_registering_model_via_url: false
    plugins.ml_commons.native_memory_threshold: 90
    plugins.ml_commons.model_auto_redeploy.enable: true
    plugins.ml_commons.model_auto_redeploy.lifetime_retry_times: 5

For some reason, a cluster is trying to deploy a model on any non-ML node once it is restarted.
and the model becomes Partially responding:

What is interesting is that it doesn’t matter how many nodes will be restarted. Only one (the last one) is mentioned on the Model status UI.

is the any option that prevents such a behavior?

@zane_neo , Zan can you help take a look ?

@andrii , can you check the model planning worker nodes and deploy_to_all infos in the index by invoking /_ml/models/_search or /_ml/models/{your_model_id}?return_content=false API?

@zane_neo here is an output of the command:

{
  "name": "huggingface/sentence-transformers/multi-qa-MiniLM-L6-cos-v1",
  "model_group_id": "T1MSRowBZ52_YC-EEcLS",
  "algorithm": "TEXT_EMBEDDING",
  "model_version": "1",
  "model_format": "TORCH_SCRIPT",
  "model_state": "DEPLOYED",
  "model_content_size_in_bytes": 91794759,
  "model_content_hash_value": "51f09df55d16debce6caaf381004b243a9131c4e64245cbe4bf51e788dc20196",
  "model_config": {
    "model_type": "bert",
    "embedding_dimension": 384,
    "framework_type": "SENTENCE_TRANSFORMERS",
    "all_config": """{"_name_or_path":"nreimers/MiniLM-L6-H384-uncased","architectures":["BertModel"],"attention_probs_dropout_prob":0.1,"gradient_checkpointing":false,"hidden_act":"gelu","hidden_dropout_prob":0.1,"hidden_size":384,"initializer_range":0.02,"intermediate_size":1536,"layer_norm_eps":1e-12,"max_position_embeddings":512,"model_type":"bert","num_attention_heads":12,"num_hidden_layers":6,"pad_token_id":0,"position_embedding_type":"absolute","transformers_version":"4.8.2","type_vocab_size":2,"use_cache":true,"vocab_size":30522}"""
  },
  "created_time": 1701982684193,
  "last_updated_time": 1702334722824,
  "last_registered_time": 1701982695005,
  "last_deployed_time": 1702334722823,
  "auto_redeploy_retry_times": 0,
  "total_chunks": 10,
  "planning_worker_node_count": 3,
  "current_worker_node_count": 3,
  "planning_worker_nodes": [
    "fNjI1HT0RBCd3Pd17j4pLg",
    "N1bKhyYGTle6NukCtdEI3w",
    "Q9_T0r2yTmuEh9xF6xZCPw"
  ],
  "deploy_to_all_nodes": true
}

In planning_worker_nodes I see only ML nodes:

Thanks, @andrii, current model status looks correct, this should be a temporary issue, I’ll try to reproduce this issue and get back.

Hi @andrii , this issue is bug which only affects profile API/ML dashboard, it’s a metadata issue and service won’t actually try to deploy model to data node, I’ve created
this issue to track/fix this issue: Machine Learning Dashboard shows partially response after restarting non-ml nodes · Issue #1774 · opensearch-project/ml-commons · GitHub

thanks a lot @zane_neo

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.