Multiple path.data directories/disks

Hello,

Elasticsearch is deprecating (from 7.13) and eventually removing support of adding multiple disks to path.data. Are there any plans to do a similar change in Opensearch?

While I’m trying to find the right part in the manual, I noticed that options like path.data doesn’t seem be documented in Get started - OpenSearch documentation at all. Is there another documentation?

Regards,
Matthias

2 Likes

I am also looking for this information. Can anybody help or suggest something.
How can we add multiple data path ?
How can we migrate from one data path to other ?

Hi,
I registered here, just to ask the same question. Does anyone know about opensearch’s plans for multiple data paths?

Loosing it would not be nice :-/

I’m not aware of any plans to remove support for multiple data paths. You’d use it like in Elasticsearch, by supplying an array of paths under path.data.

That said, I wouldn’t recommend doing this, because you might hit edge cases. For example, if shards are not of equal sizes, disk usage may be uneven between paths (because it allocates shards in a round-robin fashion), making OpenSearch hit disk watermarks quicker than it should.

Just for my understandig: in the case of high watermarks, opensearch should start rebalancing, but this will not work across different data paths on the same node, but only across different nodes.

So the only way out of a single data path becoming full, is to delete data, right…?

But as long as we keep the different data paths large enough, and the elasticsearch database is small enough, we should be allright…?

The big advantage I see of using multiple nvme data paths on the same machine is instead of RAIDing multiple devices: (but please comment if you disagree)

  • if one data path fails, you loose only the data on that data path, and not the whole data store
  • RAID0 means you loose all the data

When RAID10, the useable diskpace is much less. (and since we’re running nvme we don’t really need RAID10 increased performance)

Any comments on the thought above…? Anyone…? An approach we’re overlooking…?

Correct!

So the only way out of a single data path becoming full, is to delete data, right…?

Correct!

But as long as we keep the different data paths large enough, and the elasticsearch database is small enough, we should be allright…?

Yes, if shards are relatively equal, disk usage across data paths is relatively equal, too.

The big advantage I see of using multiple nvme data paths on the same machine is instead of RAIDing multiple devices: (but please comment if you disagree)

  • if one data path fails, you loose only the data on that data path, and not the whole data store
  • RAID0 means you loose all the data

Correct! Except that if one data path won’t work the node doesn’t really function properly. Some allocations will fail (those hitting the bad disk), but it may retry on the good disk, I don’t remember the behavior (in which case the good disk will fill up) - it’s going to be harder to troubleshoot because issues aren’t consistent for the whole node. You’ll need to replace the node anyway. But it’s true that you get a bit more reliability in the sense that you can still read indices from the good disk (e.g. to reindex them).

Given that most deployments have replicas for reliability, I didn’t come across a use-case (yet) where advantages of multiple data paths outweigh the disadvantages.

Actually, I did: I remember a use-case where people would search through N indices at once: full text search, no aggregations, but maaaany terms. This is very IO latency-dependent. Here, RAID0 won’t help, because you’d have the same latency as a single disk (just better throughput). Meanwhile, with N data paths N reads can potentially get parallelized better for the same latency.

Hi radu.gheorghe,
Thank you very much for your confirmations and answers, appreciated!

I’m now looking into snapshots & backups/restore, as that would allow us to use multiple data paths, AND also be able to recover from single disk failures.

1 Like

Just for the archives: i managed to redistribute our data across multiple data paths, by: creating a snapshot, change data path config, and restore the snapshot.

1 Like