Questions about creating new cluster & old data

I’ve been storing data from web scraping in Elasticsearch for some time. As I was just learning, I kept it at a one node cluster with an intent to expand that at some point. Before I knew it, I’d stored over 300gigs of data in ES. Most of the data is evergreen and each site I scrape has it’s on indices that restart at the beginning of every month. Recently, I decided to move over to Opensearch and expand the cluster. So, I took a backup of the ES data and destroyed the cluster. After playing with Opensearch a bit, I set up a cluster of 2 master nodes, 3 data nodes and 2 coordinating nodes (Overkill, I know). I have a couple of questions:

  1. What is the best way to reindex the ES data into Opensearch? Is it just register a repository and restore from there?
  2. Some of the indices are large; More that 60-75 gigs. Is there a benefit to splitting them? If so, how do I do it?
  3. What server do I query/index against? In my old one node cluster, it was easy. Do I query/index against the coordinating node? I was thinking about putting a load balancer in front of them.

Thanks in advance!

Hi @thedraketaylor ,

Let me try to address your questions.

  1. What is the best way to reindex the ES data into Opensearch? Is it just register a repository and restore from there?

This would be the easiest way unless you use ES X-Pack features (in this case you may consider Reindex API [1]).

  1. Some of the indices are large; More that 60-75 gigs. Is there a benefit to splitting them? If so, how do I do it?

It is difficult to have any advice since indices are split in shards which are distributed across data nodes. So if you have 1 shard of 60G, it is probably better to split, if you have 10 shards of 6G, it would probably make sense to consolidate. Besides that, there are may factors to take into consideration: how fast indices grow? what are the search query patterns? … The general recommendations are coming from Elasticsearch [2] and apply to Opensearch as well, please take a look.

  1. What server do I query/index against? In my old one node cluster, it was easy. Do I query/index against the coordinating node? I was thinking about putting a load balancer in front of them.

If you have coordinating nodes, those should be the primary contact for search and bulk indexing. Indexing a single document should be done on the respective data node. The configuration here depends on the client: for deprecated transport client [4] you could enable sniffing so to discover all available nodes, for REST High Level client you may need to have 2 client [3].

Hope it helps.

[1] Reindex data - OpenSearch documentation
[2] Size your shards | Elasticsearch Guide [8.4] | Elastic
[3] When to use Coordinating vs Data Node - #6 by keithag - Elasticsearch - Discuss the Elastic Stack
[4] Transport Client | Java Transport Client (deprecated) [7.17] | Elastic

1 Like

@reta Thanks! This is perfect!