Support dedicated ML node

ylwu · April 21, 2022, 5:09pm

We have released a new ML plugin ml-commons in 1.3 and we are planning to add more models.

ML model generally consuming more resources, especially for training process. We are going to support bigger ML models which might require more resources and special hardware like GPU. As OpenSearch doesn’t support ML node, we dispatch ML task to data node only. That means if user want to train some big model, they need to scale up all data nodes which seems costly and not reasonable. If we can support dedicated ML node, user don’t need to scale up their data node at all, just need to configure a new ML node(with different settings, more powerful instance type) and add it to cluster. And we can separate resource usage better by running ML task on dedicated node which can reduce impact to other tasks like search/ingestion.

And generally we can add a “computation” node for computation-intensive tasks like ML. And we may build more general solution like assigning/changing node role/tags on the fly. Check more details on this Github issue Support dynamic node role · Issue #2877 · opensearch-project/OpenSearch · GitHub. Welcome any suggestions/questions! To keep the discussion easier, let’s post suggestions/questions on the Github issue directly.

jdbright · October 6, 2022, 3:33pm

Follow up to what @ylwu mentioned before, we closed the loop on ML nodes in 2.1.

github.com/opensearch-project/ml-commons

Support dedicated ml node

opened 12:31AM - 30 Sep 21 UTC

closed 09:39PM - 07 Jul 22 UTC

spbjss

enhancement v2.1.0

**Is your feature request related to a problem?** We released ml-commons plugin… in OpenSearch 1.3. It supports training model and predicting. ML model generally consuming more resources, especially for training process. The community wants to support bigger ML models which might require more resources and special hardware like GPU. As OpenSearch doesn’t support ML node, we dispatch ML task to data node only. That means if user wants to train a large model, they need to scale up all data nodes which can be costly. And ML tasks will use shared resources on data nodes which may impact the core searching/indexing function. **What solution would you like?** Support a dedicated ML node, users don’t need to scale up their data node at all. Instead just configure a new ML node (with different settings, more powerful instance type) and add it to cluster via the YAML file (requires a cluster restart). By doing so, users can separate resource usage better by running ML task on dedicated node which can reduce impact to other critical tasks like search/ingestion. OpenSearch core will check node role when start node. If role is not built-in roles like `data` role, it will throw exception and node can't start. To support dedicated ML node, we have to remove this limitation in OpenSearch core. That is done with this PR which supports dynamic node role in OpenSearch https://github.com/opensearch-project/OpenSearch/pull/3436. With that we can enhance ml-commons code to dispatch task to `ml` nodes first. If no `ml` nodes we can fall back to data nodes. **Do you have any additional context?** [Original Proposal](https://github.com/opensearch-project/OpenSearch/issues/2877)

Topic		Replies	Views
Enable/Disable ML in a specific Node Machine Learning	2	1264	October 6, 2022
Performance and scaling of ML models and dense vector data Machine Learning discuss	6	701	May 12, 2023
Custom ml-common Machine Learning	5	557	November 28, 2022
ML Model has to be re deployed each time ML Node is restarted Machine Learning	3	351	November 3, 2024
[Feedback] ML Commons: ML Model Health Dashboard for Admins - Experimental Release Request For Comments releases	5	1119	May 4, 2023

Support dedicated ML node

Related topics