[Feedback] Machine Learning Model Serving Framework - Experimental Release

In OpenSearch 2.4, we will launch the first phase of the Machine Learning (ML) Model Serving Framework under Experimental. The framework aims to make it easier to integrate and operationalize ML models on OpenSearch to support a variety of ML use cases.

This framework provides the foundation for supporting the semantic search features released in 2.4 (Neural Search plugin). It allows users to register externally trained models into an OpenSearch cluster and serve them through the ml-commons API allowing for ML inference within ingest and query processes.

The OpenSearch product team is actively prioritizing enhancements such as the ability for you to integrate your choice of model server technology as an alternative to running your inference workloads on the ML Node. Your feedback will help us prioritize our roadmap. Please include a description of your use case so that we understand your context. Thank you for your participation!

This is good news. It is very difficult to give useful feedback before giving it a try. Though, I suggest the following:

  1. Having a neural search query capability, I know that a plugin is already part of the release but I want to stress on having highlighting capability as part of it. Having neural search query without highlight in is most cases going to be misleading. Highlighting options can be based on QA models or it can return the most similar sentence.

  2. Ingestion model should be auto-registered for a neural attribute so that querying and inference over that attribute will always be consistent.

  3. A clean documentation on how to include external model (huggingface for instance) is going to be crucial.

I look forward to trying it.

I know it is probably early to ask, but when is it planned to become official?

Thanks for the interest! The feature should go public together with 2.4 soon, most likely by this week.

So it won’t be experimental on 2.4?

Hi Hasan, it will be an experimental feature in the 2.4 release. We’ll share the roadmap to bring this feature to GA as soon as we’re ready.

Do you mind sharing what tools your team currently uses for building and serving machine learning models?

We use pretrained models feom huggingface and we also use Python for training and finetuning. We have never used these kinds of models in a Java environment. That is why I would like to know what models are and aren’t supported.

Hi Hasan, we support PyTorch models compiled in the TorchScript format in the first release. You won’t be limited to ML Java libraries. We’ll support more formats based on demand like ONNX.

We will also provide some native models that have been pre-compiled and configured for you. If you only use HuggingFace models, we’ll have you covered. The main limitation in the first release will be the absence of GPU support, which will limit you to using smaller models designed to run on CPU.

How do you serve your ML models today outside of OpenSearch? Do you serve models on an ML platform (eg. Amazon Sagemaker, Databricks, Kubeflow…etc.), just run them on self-packaged containers on Kubernetes, or something else?

Currently we are using pretrained models from huggingface and served through fastapi or run in a batch mode.

By the way, where can we find the documentations for this feature? It will be extremely difficult to try it and give useful feedback without the documentations.

Quote " If you only use HuggingFace models, we’ll have you covered", does this cover all huggingface tasks? Specifically I am interested in question answering.

Hi Hasan, I should have clarified. The first release is designed for text embedding models. You can import and serve the models through the ml-commons API. This provides you an interface for building an ingest pipeline to build a k-NN index. Using the neural search plugin, you will be able to run semantic queries. The API for text embedding models expect free text as input to generate the text encodings (vectors). Supporting Q&A models will require some enhancements because the underlying inference mechanics are different.

I’m currently evaluating options for providing more generic query support for models. If you provide me a list of the type of models you need support for, I’ll take that into account in our design.

Question: have you evaluated how well the HuggingFace Q&A models work for you out-of-the-box? Does your team need to fine-tune these models for your use cases?

Regards,
-Dylan

The tasks that I find helpful are:
1.QA
2. Document summarization
3. Zero/few-shot classification
4. Image embedding

I also suggest providing multi-predict API similar to multi-search. This can, in a way, reduce requests to the API and thus reduce the load.

The out-of-the-box HuggingFace QA was very good in a general context. I haven’t tested it deeply in specialized contexts such as healthcare for instance. But I can see there will be a need for fine-tuning at soem point.

Awesome direction and feature!
I would like to add a major contention point though, which is GPU support:

The current text embedding field is VERY focused on transformers, which beat earlier models by an extremely large margin.

Unfortunately, transformers require GPU.

Word2Vec/fasttext/GLoVe and the like are awesome. But not close to being enough for the textual search cases which aren’t served well by the base Lucene abilities.

BERT/RoBERTa and their ilk are amazing for contextual search. Game changing amazing. But usually require a GPU.

This feature is so so close to being a major game changer for Opensearch, but without GPU support doesn’t quite make it.

After GPU support is added, things like integrated models, better search, etc… should be pursued.
But as of now, it’s missing the mark by a bit, which is a shame, because it’s a huge step towards “Make your own Google at home”.

Hi Hagay, thanks for your feedback! GPU support is absolutely critical and is a must have on our path to GA.

If we were to provide GPU support by allowing users to use their existing model serving infrastructure, how would that look like for you? For instance, what if we integrate and allow you to run your ML workload on TorchServe/TensorflowServing/Triton running in your K8 cluster? Or, maybe you use a ML platform like Amazon SageMaker, Azure ML, Databricks, or Kubeflow KServe…?

Thanks Hasan, for providing insights into your needs.

We actually have such a system running currently, and use it by making batched API requests to the mod el for inference while indexing.

This however, would not really realize the needs of the community I believe.
If someone has a fully fledged model serving, he can use it while indexing easily enough.
Integrating the embedding and enabling an “embedding tokenizer”, with no further need from the user, should make this an extremely easy solution for search.

The only way I can think of integrating model serving frameworks, is to have some sort of management connection to the framework itself which allows OS to add a model and create an API on-the-fly, with user interacting with OS alone once the initial configuration is done.

Hi Hagay,

let’s assume that we have a way of making this possible as an alternative option to using dedicated ML Node(s). What existing technologies does your organization use today to host your ML models (independent of your OpenSearch workloads)?

I have noticed that models are not automatically loaded when the OpenSearch cluster is rebooted. Is there a way to do that? It would be impractical to manually reload models each time the cluster is rebooted, especially in a containerized setup.

We currently use Triton, alongside simple APIs built with a regular Web server atop an Openshift cluster with a GPU scheduler, depending on the use case, and how much custom logic needs to be implemented before actually making the inference (lots of custom logic=Openshift, just serving a model = Triton)