### What / Why
**What are you proposing?**
Users store logs, event data, a…nd other data in object stores, like S3, for analysis with batch-based analytics tooling (e.g., Spark). This pattern, often called a data lake, enables people to cost effectively and durably store data for analysis. As this relates to OpenSearch, we see people compliment their OpenSearch deployments with their data lakes for analysis. For example, some people choose to store real-time data in OpenSearch for real-time analytics and use their data lakes for longer-term historical analytics. However, we have also heard from users that they still want to be able to join both data sets for analysis. Because of this, we propose introducing the ability to create virtual indexes on data stored in object stores so that users can visualize in their data lake data alongside their OpenSearch data in OpenSearch Dashboards.
In order to accomplish this, we will introduce the ability to create virtual tables against data in object stores. From there, users will be able to create virtual indexes (aka materialized views) on the virtual tables and define the query used for generating the virtual index. OpenSearch will cache aggregated data into the virtual index and serve that data when users query against the virtual index. Users will be able to query this data as though it is regular OpenSearch data and visualize this data alongside their other OpenSearch data in OpenSearch Dashboards.
This proposal also aims to complement the existing [storage vision](https://github.com/opensearch-project/OpenSearch/issues/2578) by adding an option for querying data in object stores that is already stored in formats like [Parquet](https://parquet.apache.org/) without requiring reindexing the data into the Lucene format.
**Which users have asked for this feature?**
There have been multiple discussions in GitHub and the forums about using Object stores and features like UltraWarm to drive down costs. This proposal adds another option for querying data in object stores against formats the data might already be indexed in. Some example discussions include:
* https://forum.opensearch.org/t/any-plans-to-backport-ultrawarm-for-amazon-elasticsearch-service/2075
* https://forum.opensearch.org/t/whether-ultrawarm-will-be-open-source/5985
* https://github.com/opensearch-project/OpenSearch/issues/740
**What is the developer experience going to be?** _Does this have a REST API? If so, please describe the API and any impact it may have to existing APIs. In a brief summary (not a spec), highlight what new REST APIs or changes to REST APIs are planned. as well as any other API, CLI or Configuration changes that are planned as part of this feature._
This feature will have **both APIs and UX in OpenSearch Dashboards** for configuring, querying, and visualizing data from object stores. The high-level MVP workflow will follow:
1. Configuring connection to an object store and bucket which will auto create a virtual table
2. Creating a materialized view (aka virtual index) of the virtual table to define aggregations on the data you want cached.
3. The ability to create a visualization on the materialized view using any of the existing visualizations
4. The ability to add the visualization to an existing or new dashboard.
**Are there any security considerations?** _What is the security model of the new APIs? Features should be integrated into the OpenSearch security suite and so if they are not, we should highlight the reasons here._
*Answer*: Yes. This feature will have multiple security considerations:
1. Credentials to object stores must be securely stored and used.
2. This feature should adhere to the existing security capabilities of OpenSearch (e.g., Document-level security, index security, field-level security, and more).
3. There will need to be permissions for users to create data sources and create virtual indexes.
*_Are there any breaking changes to the API?_* If Yes, what is the path to minimizing impact? (example, add new API and deprecate the old one)
*Answer*: No. This will all be additive.
**Are there breaking changes to the User Experience?** _Will this change the existing user experience? Will this be a breaking change from a user flow or user experience perspective?_
**Answer:** No. This will all be additive.
**What will it take to execute?** _Are there any assumptions you may be making that could limit scope or add limitations? Are there performance, cost, or technical constraints that may impact the user experience? Does this feature depend on other feature work? What additional risks are there?_
There is risk around performance and stability of querying large datasets stored in object stores. In order to be successful we will need to benchmark for performance, reliability, and recommended caching configurations.
**Any remaining open questions?** _What are known enhancements to this feature? Any enhancements that may be out of scope but that we will want to track long term? List any other open questions that may need to be answered before proceeding with an implementation._
The biggest area of ambiguity is how this will evolve with the [multi-OpenSearch data source effort in OpenSearch Dashboards](https://github.com/opensearch-project/OpenSearch-Dashboards/issues/1388). The goal is to align UXs so that these are all managed and interacted with in similar ways. It is imperative that these experiences don't diverge as it will cause friction for users who want to integrate multiple data sources into OpenSearch Dashboards.
Beyond the initial release this feature can be enhanced by:
1. Integrating more data formats (e.g., JSON, CSV, other JDBC sources).
2. Improved caching mechanisms like prefetching based on Dashboard configurations and frequently accessed aggregations.
1. Note there may be optimization learnings from the storage vision projects.
4. For JDBC, potentially executing queries on the remote systems for higher performance/scale
5. Integrating the connection configuration into the admin panel effort
6. Integrating refresh or other virtual index operations into Index Management
7. Automatically spilling queries to the raw data source to reduce local cache size requirements.
### User stories
* [P0] As an administrator, I can create a connection to a bucket within an object store via the API and OpenSearch Dashboards administration experience. This will automatically create a virtual table inside of OpenSearch.
* [P0] As a developer, I can define a materialized view, including aggregations on the raw fields, and represent that as a virtual index inside of OpenSearch. This can be done both via the API and OpenSearch Dashboards.
* [P0] As a developer, when I create a virtual index in OpenSearch Dashboards, I will have the option to also create an index pattern for my virtual index.
* [P0] As a developer or Dashboards user, I can use the existing Visualization feature to create visualizations on virtual indexes.
* [P0] As a developer or Dashboards, I can add visualizations created with virtual indexes in existing or new dashboards that include data from non-virtual indexes.
* [P0] The existing security features (DLS, FLS, etc..) are compatible with virtual indexes.
* [P0] DSL, SQL, and PPL can be used to query virtual indexes
* [P0] As a developer I can use this feature with data lake formats such as Orc and Parquet.
* [P1] As a developer I can use this feature with any JDBC supported source.
* [P2] As a developer I can use this feature with JSON.
* [P2] As a developer I can use this feature with CSV.
* [P3] As a developer I can use this feature with other popular data lake formats (e.g., XML)
* [P0] As an administrator I can tune the caching options, refresh options, and other settings on materialized views via the API and OpenSearch Dashboards.
* [P0] As a dashboards user I can drill down to see the raw object store data for a given window.
* [P1] As a developer I can define a data enrichment policy as part of the materialized view definition
* [P1] As a dashboards user/developer I can join/build a correlation with data from a virtual index with data stored in OpenSearch
* [P2] As a Dashboards user, I am given a cue to see if a visualization is using a virtual index.