Model is Partially responding in case of non-ML node restart

andrii · December 11, 2023, 5:12pm

OpenSearch 2.11.1
Cluster configuration:
3 master nodes
3 ingest nodes
3 data nodes
3 ML nodes
The cluster is deployed on Kubernetes.

opensearch.yaml has the following section:

    plugins.ml_commons.only_run_on_ml_node: true
    plugins.ml_commons.task_dispatch_policy: round_robin
    plugins.ml_commons.max_ml_task_per_node: 10
    plugins.ml_commons.max_model_on_node: 10
    plugins.ml_commons.sync_up_job_interval_in_seconds: 3
    plugins.ml_commons.monitoring_request_count: 100
    plugins.ml_commons.max_register_model_tasks_per_node: 10
    plugins.ml_commons.max_deploy_model_tasks_per_node: 10
    plugins.ml_commons.allow_registering_model_via_url: false
    plugins.ml_commons.native_memory_threshold: 90
    plugins.ml_commons.model_auto_redeploy.enable: true
    plugins.ml_commons.model_auto_redeploy.lifetime_retry_times: 5

For some reason, a cluster is trying to deploy a model on any non-ML node once it is restarted.
and the model becomes Partially responding:

What is interesting is that it doesn’t matter how many nodes will be restarted. Only one (the last one) is mentioned on the Model status UI.

is the any option that prevents such a behavior?

ylwu · December 11, 2023, 7:49pm

@zane_neo , Zan can you help take a look ?

zane_neo · December 12, 2023, 3:07am

@andrii , can you check the model planning worker nodes and deploy_to_all infos in the index by invoking /_ml/models/_search or /_ml/models/{your_model_id}?return_content=false API?

andrii · December 12, 2023, 2:55pm

@zane_neo here is an output of the command:

{
  "name": "huggingface/sentence-transformers/multi-qa-MiniLM-L6-cos-v1",
  "model_group_id": "T1MSRowBZ52_YC-EEcLS",
  "algorithm": "TEXT_EMBEDDING",
  "model_version": "1",
  "model_format": "TORCH_SCRIPT",
  "model_state": "DEPLOYED",
  "model_content_size_in_bytes": 91794759,
  "model_content_hash_value": "51f09df55d16debce6caaf381004b243a9131c4e64245cbe4bf51e788dc20196",
  "model_config": {
    "model_type": "bert",
    "embedding_dimension": 384,
    "framework_type": "SENTENCE_TRANSFORMERS",
    "all_config": """{"_name_or_path":"nreimers/MiniLM-L6-H384-uncased","architectures":["BertModel"],"attention_probs_dropout_prob":0.1,"gradient_checkpointing":false,"hidden_act":"gelu","hidden_dropout_prob":0.1,"hidden_size":384,"initializer_range":0.02,"intermediate_size":1536,"layer_norm_eps":1e-12,"max_position_embeddings":512,"model_type":"bert","num_attention_heads":12,"num_hidden_layers":6,"pad_token_id":0,"position_embedding_type":"absolute","transformers_version":"4.8.2","type_vocab_size":2,"use_cache":true,"vocab_size":30522}"""
  },
  "created_time": 1701982684193,
  "last_updated_time": 1702334722824,
  "last_registered_time": 1701982695005,
  "last_deployed_time": 1702334722823,
  "auto_redeploy_retry_times": 0,
  "total_chunks": 10,
  "planning_worker_node_count": 3,
  "current_worker_node_count": 3,
  "planning_worker_nodes": [
    "fNjI1HT0RBCd3Pd17j4pLg",
    "N1bKhyYGTle6NukCtdEI3w",
    "Q9_T0r2yTmuEh9xF6xZCPw"
  ],
  "deploy_to_all_nodes": true
}

In planning_worker_nodes I see only ML nodes:

zane_neo · December 13, 2023, 5:40am

Thanks, @andrii, current model status looks correct, this should be a temporary issue, I’ll try to reproduce this issue and get back.

zane_neo · December 18, 2023, 12:46am

Hi @andrii , this issue is bug which only affects profile API/ML dashboard, it’s a metadata issue and service won’t actually try to deploy model to data node, I’ve created
this issue to track/fix this issue: Machine Learning Dashboard shows partially response after restarting non-ml nodes · Issue #1774 · opensearch-project/ml-commons · GitHub

andrii · December 18, 2023, 5:19pm

thanks a lot @zane_neo

system · February 16, 2024, 5:20pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
ML Model has to be re deployed each time ML Node is restarted Machine Learning	3	266	November 3, 2024
How can we deploy ML model (.zip) to nodes locally, not via SSL or the firewall OpenSearch discuss , troubleshoot , configure , install	8	296	August 19, 2024
Support dedicated ML node Machine Learning discuss , feature-request	1	848	October 6, 2022
Errors when deploy ML Models to Opensearch cluster OpenSearch	1	218	July 24, 2024
Model deployment failure with ml-commons plugin in internet disabled environment Machine Learning discuss , troubleshoot , configure , install	3	1119	November 12, 2023

Model is Partially responding in case of non-ML node restart

Related topics