# Feature Proposal : Add Remote Storage Options for Improved Durability
## Ov…erview
OpenSearch is the search and analytics suite powering popular use cases such as application search, log analytics, and more. While OpenSearch is part of critical business and application workflows, it is seldom used as a primary data store because there are no strong guarantees on data durability as the cluster is susceptible to data loss in case of hardware failure. In this document, we propose providing strong durability guarantees in OpenSearch using remote storage.
## Challenges
Today, data indexed in OpenSearch clusters is stored on a local disk. To achieve durability, that is, to ensure that committed operations are not lost on hardware failure, users either back segments up using snapshots or add more replica copies. However, snapshots do not fully guarantee durability: data indexed since the last snapshot can be lost in case of failure. On the other hand, adding replicas is cost prohibitive. We propose a robust durability mechanism.
## Proposed Solution
In addition to storing data on a local disk, we propose providing an option to store indexed data in a remote durable data store. These new APIs will enable building integrations with remote durable storage options like Amazon S3, OCI Object Storage, Azure Blob Storage and GCP Cloud Storage. This will give users the flexibility to choose the storage option that best fits their requirements.
Indexed data is of two types: committed or un-committed. After the indexing operation is successful, and before OpenSearch invokes commit, data is stored in translog. After commit, data becomes part of a segment on disk. To achieve the durability of indexed data, both committed and un-committed data need to be saved durably. We propose storing uncommitted data in a remote translog store and committed data in a remote segment store. This will pave the way to support PITR (Point In Time Recovery) in the future.
### Remote Translog Storage
The translog (**Trans**action **Log**) contains data that is indexed successfully but yet to be committed. Each successful indexing operation makes an entry to the translog, which is a write-ahead transaction log. With remote translog storage, we keep a translog copy on a remote store as well. OpenSearch invokes a “flush” operation to perform Lucene commit. This process starts a new translog. We need to checkpoint the remote store with the committed changes in order to keep data consistent with primary. [Feature Proposal - Pluggable Translog](https://github.com/opensearch-project/OpenSearch/issues/1319)
### Remote Segment Storage
Segments represent committed indexed data. Using remote segment storage, all the newly created segments in primary will be stored to the configured remote store. New segment is created either by OpenSearch “flush” operation or a “refresh” operation that creates segments on the page cache or by merging the already created segments. Segment deletions on primary will mark remote segments for deletion. These marked segments will be kept for certain configurable period.
While remote storage enables durability, there are tradeoffs for both remote translog store and remote segment store.
## Considerations
### Durability
Durability of indexed data depends on the durability of the configured remote store. The user can choose a remote store based on their durability requirements.
### Availability
Data must be consistent between the remote and primary stores. Write availability of indexed data depends on the availability of the data node and the remote translog store. OpenSearch writes will fail if the data node or remote translog store is unavailable. Read operations can still be supported. However, in case of a failover, data availability for read will depend on the availability of remote store. The user can choose a remote data store based on their availability requirements.
### Performance
We need a synchronous remote translog store call during an indexing operation to guarantee that all operations have been durably persisted to the store. This affects the average response time of the write API, however there shouldn’t be any significant throughput degradation due to this which will be covered with more details in the design doc. But as indexed data becomes durable, if extra availability provided by the replicas is not required, users can opt out of replica. This will remove the sync network call from primary to replica which will decrease the average response time of the write API. The overhead of the network call to remote store depends on the performance of the remote store.
We will have one more network call to the remote segment store when segments are created. As segment creation happens in the background, there won’t be any direct impact on the performance of search and indexing. But there are dedicated APIs to create segments (`flush, refresh`). Response time of these APIs will increase because of the extra network call to the remote store.
### Consistency
Data in the remote segment store will be eventually consistent with the data on the local disk. It will not impact data visibility for read operations as queries will be served from the local disk.
### Cost
Any change in cost will be determined by two factors: configured remote stores and current cluster configuration. The change in cost of an OpenSearch cluster depends on different factors like change in the number of replica nodes and remote stores used for translog as well as segment.
As an illustration, consider a user who has configured replicas only for durability and who uses snapshots for periodic backup. Replacing replicas with remote storage will lower cost. The cost of storing remote segments will remain the same because the store used for snapshots be used for storing remote segments. However, the additional storage cost for the remote translog will lead to an increase in cost. But since durable storage options like Amazon S3 are not expensive, the increase in cost is minimal.
### Recovery
In the current OpenSearch implementation for peer recovery, data is copied from the primary to replicas, which consumes system resources (disk, network) of the primary. By using remote store to copy the indexed data to replicas, the scalability of primary will be increased, since data sync operations will be performed on the replica; but the overall recovery time will also increase.
## Next steps
Based on the feedback from this Feature Proposal, we’ll be drafting a design (link to be added). We look forward to collaborating with the community.
## FAQs
### 1. With remote storage option, will the indexed data be stored to local disk as well?
Yes, we will continue to write data to local disk as well. This is to ensure that search performance is not impacted. Data in the remote store will be used for restore and recovery operations.
### 2. What would be the durability SLA?
Durability SLA would be same as that of configured remote store. If we use 2 different remote stores for translog and segment, durability of OpenSearch will be defined as minimum of these two stores.
### 3. Is it possible to use one remote data store for translog and segments?
Yes it is possible. While translog is an append only log, segment is an immutable file. Write throughput to translog is same as indexing throughput whereas segment creation is a background process and not affected by the indexing throughput. Due to these differences, we will have different set of APIs for translog and segment store. If one data store can be integrated with both set of APIs, it can be used as remote translog store as well as remote segment store.
### 4. Do I still need to use snapshots for backup and restore?
Architecture of snapshot and remote storage will be unified. If you configure remote store option, then using snapshot would be a no-op. But if durability provided by snapshot is as per your requirement, then you can disable the remote storage and continue using snapshots.
### 5. Can the remote data be used across different OpenSearch versions?
Version compatibility would be similar to what OpenSearch supports with snapshots.
## Open Questions
1. Will it be possible to enable remote storage only for the set of indexes?
2. Would it be possible to change the remote store or disable remote storage option for a live cluster?