I apologize in advance if this question has already been asked. I looked through the boards and couldn’t find anything. I also apologize if this question is dumb. I’m completely new to OpenSearch, and have been struggling to figure out with this part of my journey.
I’m tasked with building an Opensearch server for our Bitbucket server repo. I’ve been reading up on this for awhile, and what I need to do seems pretty straightforward. It’s going to install on Linux, so that part is easy. The problem that I am having is that I can’t figure out what specs I need on the server. The entirety of the Bitbucket repo is around 4TB. I’ve figured out what I’ll need in terms of CPU, but I haven’t been able to figure out what I’ll need in terms of RAM or Storage. Can anyone help me figure this out, or point me towards an equation that can help me figure it out?
I’m going to skip the “it depends” part
My rule of thumb is that your index size (minus replication) is the size of your original data. Because you compress it, but you add metadata at the same time. It depends on how your mapping looks like. If you index everything you’ll use more storage than if you only index a couple of fields, for example.
As for RAM, there are two aspects: how much you absolutely need and how much is desirable. The first aspect is effectively divided in two: some “static” memory, which is mostly caches (which are configurable percentage of your heap), but also some in-memory pointers to various data structures (e.g. whether you indexed a field or not). Nowadays, it’s typically negligible, say a few GB per TB of index. The bigger part (and unknown) is the “dynamic” memory that is needed to run queries. This depends on how many queries are running at the same time and how expensive they are.
Again, let me throw a rule of thumb: I’d be surprised if you need more than 30GB of heap on a node that holds this much data. Though I’ve been surprised many times before so make sure you test and monitor before you go to prod.
So far I’ve only talked about heap memory, not the entire RAM. You’ll need RAM also for the OS to cache files on disk. How much OS cache you need depends on how fast your storage is. With 4TB, it’s not realistic to cache everything in memory and make do with slow storage (as you would if you’d serve just a few documents). So you’ll probably want to go with 64GB of RAM and local SSDs (read: not over-the-network-it’s-very-fast-I-promise. Attached to the machine/VM).
I hope this helps. Since you said you’re new to OpenSearch, let me shamelessly plug the training classes I’ll be running in less than a month, in case you’re interested: OpenSearch Training - Sematext
Thank you so much! This is exactly the info I was looking for. I appreciate you including scope for the OS as well. I’m trying to run pretty lean linux OS at my job, so that shouldn’t have too large of a footprint. This is all great info though. I’ve read through a bunch of OpenSearch documentation, do you have anything you recommend for starters? Also, thanks for mentioning your training class. I’ll have to see if I can talk my management into covering that.
For starters, if you can’t find OpenSearch-specific stuff, any Elasticsearch book or tutorial will get you most of the way. For now they’re close enough and even in many places where they diverged (for example, index management), functionality is similar.
I would normally recommend my book and video tutorial, but they’re rather old