Who can figure out if something is wrong or not?

Who can figure out if something is wrong or not?

I am unexperienced on Elastic/Opensearch ,I have used only twice on some of my side projects. But my company uses it on AWS Opensearch. I am a cloud engineer ,therefore my task has been to check costs ,and optimize.
I came upon an Opensearch Cluster on production ,that has 26 nodes size : r5.large.search :: 2Vcpu , 16Gb Ram.

I know for sure that there is a problem with either the setup or how applications use this cluster, because I know we do not have that much data to process, index or search.

The cpu stays on each 3% , and Memory stays 97% on each Node.
Now I have attached a screen shot that shows the nodes.


Indexing latency is 160-240 in only one node. All other nodes is 0, beside 3-4 nodes which is from 10-60.

Can someone tell from it if we have a problem? If you need anything else from the metrics, let me know :slight_smile:

Hi @MaXiMilIan! I think the best thing to start with would be taking inventory of just how many indices you have, how many documents are in them, and what sources you have sending new documents into OpenSearch.

For a super high level view, I’d start with the API / Dev Console and just check

GET /_cat/indices

There’s likely a GUI page for this too, under Management → Index management, depending on which version of dashboards you’re running.

Either way, it should give you a list of all the indices in your cluster, and how many docs are there. There’s likely a number of built in indices for security logs and whatnot.


Since they seem to be on a managed service, can you tell if you have AWS OpenSearch Ingestion configured to send new documents there? Are you aware of any custom UI’s you have that allow web based searching through your indices? The soul of OpenSearch is as a search engine, so if there’s not traces, metrics, logs, or search data being sent into opensearch, it has to be being used for something. Do you have any hints as to what the current use case is for your massive cluster? 26 nodes seems a lot for something that seems relatively unused.

I’d love to help more - have a look around and let us know what you find!

1 Like

Hi, thanks for your message.

So I did some research after your message and found out that we have over 12 billion documents on the cluster, stored on EBS volumes since we have hot(standard) opensearch.

I asked my colleague to give me some information on what is stored and which has the most data.
The index with the most data (600M documents, over 4TB of data) is actually just storing application log data.

I found out :

  • The data is immutable, nobody changes it.
  • The data is only retrieved by PowerBi, and the average search rate is 4-5 doc/s, while the max is 51 doc/s, and the search latency is irrelevant.
  • The data is not cleaned up by any cronjob or process.
  • All they do is send SQL data to elasticsearch (so we have a known schema of data).
  • The main index with 4TB of data runs every day at 15:30 and it is shared on only 5 shards.

What I recommended to my tech lead, which will drastically reduce the cost:

  • Use ultrawarm nodes for +7 days of data and cold storage for 30+ log files.
  • Send everything to Redshift (AWS data warehouse)

What do you think of my research and conclusions?
Based on what it says about ultrawarm, is that it will store data on S3, and since we do not update our data, it is read only, I believe it will be very efficient in terms of storage costs.
And for 30+ days of data, just store it on S3 using cold storage, without using any compute resources.

My company is quite lazy when it comes to initiatives, so I don’t think they will go with the Redshift option, as you would have to migrate things and configure everything as new.

I think you’ve got a great solution here. I might also suggest that if it’s an option, look into using reserved instances for your OpenSearch nodes, which can also increase savings quite a bit compared to on-demand nodes.