Multiple path.data directories/disks

sezuan2 · February 16, 2022, 8:08am

Hello,

Elasticsearch is deprecating (from 7.13) and eventually removing support of adding multiple disks to path.data. Are there any plans to do a similar change in Opensearch?

While I’m trying to find the right part in the manual, I noticed that options like path.data doesn’t seem be documented in Get started - OpenSearch documentation at all. Is there another documentation?

Regards,
Matthias

Govind · May 27, 2022, 11:20am

I am also looking for this information. Can anybody help or suggest something.
How can we add multiple data path ?
How can we migrate from one data path to other ?

kkplein · September 14, 2023, 9:19am

Hi,
I registered here, just to ask the same question. Does anyone know about opensearch’s plans for multiple data paths?

Loosing it would not be nice :-/

radu.gheorghe · September 15, 2023, 8:10am

I’m not aware of any plans to remove support for multiple data paths. You’d use it like in Elasticsearch, by supplying an array of paths under path.data.

That said, I wouldn’t recommend doing this, because you might hit edge cases. For example, if shards are not of equal sizes, disk usage may be uneven between paths (because it allocates shards in a round-robin fashion), making OpenSearch hit disk watermarks quicker than it should.

kkplein · September 15, 2023, 8:57am

Just for my understandig: in the case of high watermarks, opensearch should start rebalancing, but this will not work across different data paths on the same node, but only across different nodes.

So the only way out of a single data path becoming full, is to delete data, right…?

But as long as we keep the different data paths large enough, and the elasticsearch database is small enough, we should be allright…?

The big advantage I see of using multiple nvme data paths on the same machine is instead of RAIDing multiple devices: (but please comment if you disagree)

if one data path fails, you loose only the data on that data path, and not the whole data store
RAID0 means you loose all the data

When RAID10, the useable diskpace is much less. (and since we’re running nvme we don’t really need RAID10 increased performance)

Any comments on the thought above…? Anyone…? An approach we’re overlooking…?

radu.gheorghe · September 26, 2023, 3:13pm

Correct!

So the only way out of a single data path becoming full, is to delete data, right…?

Correct!

But as long as we keep the different data paths large enough, and the elasticsearch database is small enough, we should be allright…?

Yes, if shards are relatively equal, disk usage across data paths is relatively equal, too.

The big advantage I see of using multiple nvme data paths on the same machine is instead of RAIDing multiple devices: (but please comment if you disagree)

if one data path fails, you loose only the data on that data path, and not the whole data store

RAID0 means you loose all the data

Correct! Except that if one data path won’t work the node doesn’t really function properly. Some allocations will fail (those hitting the bad disk), but it may retry on the good disk, I don’t remember the behavior (in which case the good disk will fill up) - it’s going to be harder to troubleshoot because issues aren’t consistent for the whole node. You’ll need to replace the node anyway. But it’s true that you get a bit more reliability in the sense that you can still read indices from the good disk (e.g. to reindex them).

Given that most deployments have replicas for reliability, I didn’t come across a use-case (yet) where advantages of multiple data paths outweigh the disadvantages.

Actually, I did: I remember a use-case where people would search through N indices at once: full text search, no aggregations, but maaaany terms. This is very IO latency-dependent. Here, RAID0 won’t help, because you’d have the same latency as a single disk (just better throughput). Meanwhile, with N data paths N reads can potentially get parallelized better for the same latency.

kkplein · September 28, 2023, 7:25am

Hi radu.gheorghe,
Thank you very much for your confirmations and answers, appreciated!

I’m now looking into snapshots & backups/restore, as that would allow us to use multiple data paths, AND also be able to recover from single disk failures.

kkplein · October 5, 2023, 10:05am

Just for the archives: i managed to redistribute our data across multiple data paths, by: creating a snapshot, change data path config, and restore the snapshot.

Topic		Replies	Views
Data path in opensearch.yml OpenSearch troubleshoot , configure	0	861	May 27, 2022
Migrating a data path to a new data node OpenSearch	4	1626	September 26, 2023
Opensearch indexing occupying more space OpenSearch troubleshoot	1	6	July 31, 2025
Questions about creating new cluster & old data OpenDistro discuss	2	442	February 11, 2022
Elasticsearch/Opensearch query Open Source Elasticsearch and Kibana	5	790	November 9, 2022

Multiple path.data directories/disks

Related topics