OpenSearch Community Meeting - 2023-0131

kris · February 1, 2023, 4:58pm

Slides:

github.com/opensearch-project/dashboards-maps

[FEATURE] Maps Multi Layer framework with enriched visualization

opened 05:25PM - 19 Apr 22 UTC

closed 06:22PM - 10 Jan 23 UTC

vamshin

enhancement v2.5.0 feature roadmap maps geospatial

**Is your feature request related to a problem?** Yes [add layers to a map in O…penSearch Dashboard 1.2.0](https://discuss.opendistrocommunity.dev/t/how-to-add-layers-to-a-map-in-opensearch-dashboard-1-2-0/8654) **What solution would you like?** Ability to add layers to a map in OpenSearch Dashboard

github.com/opensearch-project/sql

[FEATURE] Materialized views (aka virtual indexes) on object stores

opened 07:33PM - 16 Nov 22 UTC

elfisher

feature RFC

### What / Why **What are you proposing?** Users store logs, event data, a…nd other data in object stores, like S3, for analysis with batch-based analytics tooling (e.g., Spark). This pattern, often called a data lake, enables people to cost effectively and durably store data for analysis. As this relates to OpenSearch, we see people compliment their OpenSearch deployments with their data lakes for analysis. For example, some people choose to store real-time data in OpenSearch for real-time analytics and use their data lakes for longer-term historical analytics. However, we have also heard from users that they still want to be able to join both data sets for analysis. Because of this, we propose introducing the ability to create virtual indexes on data stored in object stores so that users can visualize in their data lake data alongside their OpenSearch data in OpenSearch Dashboards. In order to accomplish this, we will introduce the ability to create virtual tables against data in object stores. From there, users will be able to create virtual indexes (aka materialized views) on the virtual tables and define the query used for generating the virtual index. OpenSearch will cache aggregated data into the virtual index and serve that data when users query against the virtual index. Users will be able to query this data as though it is regular OpenSearch data and visualize this data alongside their other OpenSearch data in OpenSearch Dashboards. This proposal also aims to complement the existing [storage vision](https://github.com/opensearch-project/OpenSearch/issues/2578) by adding an option for querying data in object stores that is already stored in formats like [Parquet](https://parquet.apache.org/) without requiring reindexing the data into the Lucene format. **Which users have asked for this feature?** There have been multiple discussions in GitHub and the forums about using Object stores and features like UltraWarm to drive down costs. This proposal adds another option for querying data in object stores against formats the data might already be indexed in. Some example discussions include: * https://forum.opensearch.org/t/any-plans-to-backport-ultrawarm-for-amazon-elasticsearch-service/2075 * https://forum.opensearch.org/t/whether-ultrawarm-will-be-open-source/5985 * https://github.com/opensearch-project/OpenSearch/issues/740 **What is the developer experience going to be?** _Does this have a REST API? If so, please describe the API and any impact it may have to existing APIs. In a brief summary (not a spec), highlight what new REST APIs or changes to REST APIs are planned. as well as any other API, CLI or Configuration changes that are planned as part of this feature._ This feature will have **both APIs and UX in OpenSearch Dashboards** for configuring, querying, and visualizing data from object stores. The high-level MVP workflow will follow: 1. Configuring connection to an object store and bucket which will auto create a virtual table 2. Creating a materialized view (aka virtual index) of the virtual table to define aggregations on the data you want cached. 3. The ability to create a visualization on the materialized view using any of the existing visualizations 4. The ability to add the visualization to an existing or new dashboard. **Are there any security considerations?** _What is the security model of the new APIs? Features should be integrated into the OpenSearch security suite and so if they are not, we should highlight the reasons here._ *Answer*: Yes. This feature will have multiple security considerations: 1. Credentials to object stores must be securely stored and used. 2. This feature should adhere to the existing security capabilities of OpenSearch (e.g., Document-level security, index security, field-level security, and more). 3. There will need to be permissions for users to create data sources and create virtual indexes. *_Are there any breaking changes to the API?_* If Yes, what is the path to minimizing impact? (example, add new API and deprecate the old one) *Answer*: No. This will all be additive. **Are there breaking changes to the User Experience?** _Will this change the existing user experience? Will this be a breaking change from a user flow or user experience perspective?_ **Answer:** No. This will all be additive. **What will it take to execute?** _Are there any assumptions you may be making that could limit scope or add limitations? Are there performance, cost, or technical constraints that may impact the user experience? Does this feature depend on other feature work? What additional risks are there?_ There is risk around performance and stability of querying large datasets stored in object stores. In order to be successful we will need to benchmark for performance, reliability, and recommended caching configurations. **Any remaining open questions?** _What are known enhancements to this feature? Any enhancements that may be out of scope but that we will want to track long term? List any other open questions that may need to be answered before proceeding with an implementation._ The biggest area of ambiguity is how this will evolve with the [multi-OpenSearch data source effort in OpenSearch Dashboards](https://github.com/opensearch-project/OpenSearch-Dashboards/issues/1388). The goal is to align UXs so that these are all managed and interacted with in similar ways. It is imperative that these experiences don't diverge as it will cause friction for users who want to integrate multiple data sources into OpenSearch Dashboards. Beyond the initial release this feature can be enhanced by: 1. Integrating more data formats (e.g., JSON, CSV, other JDBC sources). 2. Improved caching mechanisms like prefetching based on Dashboard configurations and frequently accessed aggregations. 1. Note there may be optimization learnings from the storage vision projects. 4. For JDBC, potentially executing queries on the remote systems for higher performance/scale 5. Integrating the connection configuration into the admin panel effort 6. Integrating refresh or other virtual index operations into Index Management 7. Automatically spilling queries to the raw data source to reduce local cache size requirements. ### User stories * [P0] As an administrator, I can create a connection to a bucket within an object store via the API and OpenSearch Dashboards administration experience. This will automatically create a virtual table inside of OpenSearch. * [P0] As a developer, I can define a materialized view, including aggregations on the raw fields, and represent that as a virtual index inside of OpenSearch. This can be done both via the API and OpenSearch Dashboards. * [P0] As a developer, when I create a virtual index in OpenSearch Dashboards, I will have the option to also create an index pattern for my virtual index. * [P0] As a developer or Dashboards user, I can use the existing Visualization feature to create visualizations on virtual indexes. * [P0] As a developer or Dashboards, I can add visualizations created with virtual indexes in existing or new dashboards that include data from non-virtual indexes. * [P0] The existing security features (DLS, FLS, etc..) are compatible with virtual indexes. * [P0] DSL, SQL, and PPL can be used to query virtual indexes * [P0] As a developer I can use this feature with data lake formats such as Orc and Parquet. * [P1] As a developer I can use this feature with any JDBC supported source. * [P2] As a developer I can use this feature with JSON. * [P2] As a developer I can use this feature with CSV. * [P3] As a developer I can use this feature with other popular data lake formats (e.g., XML) * [P0] As an administrator I can tune the caching options, refresh options, and other settings on materialized views via the API and OpenSearch Dashboards. * [P0] As a dashboards user I can drill down to see the raw object store data for a given window. * [P1] As a developer I can define a data enrichment policy as part of the materialized view definition * [P1] As a dashboards user/developer I can join/build a correlation with data from a virtual index with data stored in OpenSearch * [P2] As a Dashboards user, I am given a cue to see if a visualization is using a virtual index.

github.com/opensearch-project/sql

[FEATURE] OpenSearch and Apache Spark Integration

opened 05:54PM - 29 Nov 22 UTC

penghuo

enhancement feature RFC

## Introduction We received a feature request for query execution on object sto…res in OpenSearch. * https://github.com/opensearch-project/sql/issues/1080 We have investigated the possibility to build a new solution for OpenSearch uses and leverage object store as storage. Which includes * https://github.com/opensearch-project/sql/issues/948 * https://github.com/opensearch-project/sql/issues/719 * https://github.com/opensearch-project/sql/issues/612 **We found the challenges are** * OpenSearch aggregation framework is the simplified MPP frameworks and does not support shuffle stage. * OpenSearch query framework missing key feature support, E.g. JOIN, Subquery. We found these work have been solved by general purpose data preprocessing system, E.g. Presto, Spark, Trino. And build such a platform require years to mature. ## Idea **The initial idea is** 1. Using SQL as interface. 2. Leverage spark as query/compute execution engine. <img width="935" alt="Screen Shot 2023-01-25 at 12 00 11 PM" src="https://user-images.githubusercontent.com/2969395/215843125-0f873cc0-6a01-4ac4-aa79-2ec56d1788bd.png"> ### User Experience 1. User configure SPARK cluster as computation resource, E.g. https://SPARK:7707. 2. User submit SQL to OpenSearch cluster use _plugins/_sql REST API. 1. SQL engine parse and analysis the SQL query. 2. SQL engine decide whether route the query to SPARK cluster or run query locally. 3. In phase-1, [we provide interface to let user create derived dataset from data on object store and store in OpenSearch.](https://github.com/opensearch-project/sql/issues/612) Then query will be optimized based derived dataset automatically during query time. 4. In phase-2, we [provide opt-in optimization choice for user](https://github.com/opensearch-project/sql/issues/612). The derived dataset will be create automatically based on query pattern. ### Epic * https://github.com/opensearch-project/sql/issues/1295

github.com/opensearch-project/sql

[FEATURE] Object Storage (S3) Data Ingestion through Streaming Query

opened 05:04PM - 21 Oct 22 UTC

dai-chen

feature meta

**Is your feature request related to a problem?** One of the key technical chal…lenge in https://github.com/opensearch-project/sql/issues/719 is how to maintain the consistency between base table (S3 data) and derived table (OpenSearch index/materialized view). **What solution would you like?** One solution for the problem is to refresh new data from S3 to OpenSearch incrementally. We are proposing to enhance our query engine by unifying the batch processing and stream processing capability in single architecture as existing solution in Apache Flink and Spark. In particular, the enhancement includes changes in query planning, query execution engine and query plan itself. PoC branch: https://github.com/opensearch-project/sql/tree/poc/maximus-m1. User manual and design doc in details will be published later as planned below. **What alternatives have you considered?** The alternative solution is rebuild the derived table (full refresh) on user demand or regular basis. This can be done by current batch processing architecture, however, introduce significant overhead for large S3 dataset it will. **Do you have any additional context?** ## Phase 1 ### Goal: * Ready for performance evaluation * Ready for feature evaluation * Missing * Failure recovery * Security ### Tasks - [x] Infra Enhancement - [x] https://github.com/opensearch-project/sql/pull/822 - [x] https://github.com/opensearch-project/sql/pull/845 - [x] https://github.com/opensearch-project/sql/pull/1085 - [x] https://github.com/opensearch-project/sql/pull/1091 - [x] https://github.com/opensearch-project/sql/issues/968 - [x] https://github.com/opensearch-project/sql/pull/1044 - [x] https://github.com/opensearch-project/sql/pull/1068 - [x] #969 - [x] #974 - [x] https://github.com/opensearch-project/sql/pull/994 - [ ] https://github.com/opensearch-project/sql/issues/1093 - [x] https://github.com/opensearch-project/sql/pull/1094 - [ ] https://github.com/opensearch-project/sql/pull/1139 - [ ] https://github.com/opensearch-project/sql/issues/951 - [x] https://github.com/opensearch-project/sql/pull/950 - [x] https://github.com/opensearch-project/sql/pull/958 - [ ] https://github.com/opensearch-project/sql/pull/1100 - [x] https://github.com/opensearch-project/sql/issues/953 - [x] https://github.com/opensearch-project/sql/pull/959 - [ ] https://github.com/opensearch-project/sql/issues/954 - [x] https://github.com/opensearch-project/sql/pull/990 - [ ] Refactor AggregateOperator to support stream processing - [ ] https://github.com/opensearch-project/sql/issues/955 - [ ] Add INSERT STREAM statement - [ ] Add CREATE TABLE statement. https://github.com/penghuo/os-sql/tree/hp/test/maximus-m1 - [ ] #972 - [ ] [S3 impl](https://github.com/penghuo/os-sql/tree/hp/test/maximus-m1) is blocked by https://github.com/opensearch-project/OpenSearch/issues/5359 - [ ] #1151 ## Phase 2 ### Goal: * Ready for experimental release * Missing * Pipeline Execution * Distributed Execution ### Tasks - [ ] Enhancement - [ ] https://github.com/opensearch-project/sql/issues/1071 - [ ] Fault Tolerant - [ ] https://github.com/opensearch-project/sql/issues/1007 - [ ] https://github.com/opensearch-project/sql/issues/1072 - [ ] Security - [ ] Use cases related feature - [ ] object/array support - [ ] full text search capability in streaming - match - [ ] Test - [ ] Documentation - [ ] User Interface ## Phase 3 ### Goal: * Ready for production deployment ### Tasks - [ ] Pipeline Execution - [ ] Distributed Execution

github.com

opensearch-project/opensearch-build/blob/main/release-notes/opensearch-release-notes-2.5.0.md

# OpenSearch and OpenSearch Dashboards 2.5.0 Release Notes

## Release Highlights

The OpenSearch 2.5.0 release adds new tools and enhancements to help you advance your search, analytics, and observability workloads. This release includes the software’s first Debian distribution and first administrative user interface, along with support for multi-layered maps and the ability to analyze traces in the Jaeger schema. The release also includes indexing and search improvements for Lucene-based k-NN search functionality, and Security Analytics tools are now generally available. Following are some highlights for this release.

### New Features

* You can now perform common administrative operations on your OpenSearch indexes, such as CRUD (Create, Read, Update, and Delete) functions, through an admin user interface.
* OpenSearch 2.5.0 lets you analyze trace data collected by the open-source Jaeger solution. Select Data Prepper or Jaeger as your trace data source as part of the OpenSearch Dashboards Observability feature.
* With this release, Security Analytics for OpenSearch and OpenSearch Dashboards is generally available, offering a number of tools to help users protect their data and infrastructure.
* You can build multi-layer maps from multiple data sources, combining data from different indexes into a single visualization to identify correlations and gain insights into geospatial data.
* New Debian distributions let you deploy OpenSearch and OpenSearch Dashboards directly on servers running Debian-based Linux distributions.
* Administrators can now view the health of their cluster at the awareness attribute level when shard allocation awareness is configured.
* You can now search your rollup indexes using query string search queries.

### Experimental Features

OpenSearch 2.5.0 includes the following experimental features. Experimental features are disabled by default. For instructions on how to enable them, see the version history (https://opensearch.org/docs/latest/version-history/) page which includes links to the documentation.
* Request-level durability allows you to deploy remote-backed storage on a per-index basis, supporting data durability for cloud-based backup and restore operations.

This file has been truncated. show original

Topic		Replies	Views
OpenSearch Community Meeting - 2022-0816 Community community-meeting , cve	3	1010	August 24, 2022
OpenSearch Community Meeting - 2022-1220 Community community-meeting	2	994	December 20, 2022
OpenSearch Community Meeting - 2022-0927 Community community-meeting	6	937	September 27, 2022
OpenSearch Community Meeting - 2024-0130 Community community-meeting	3	768	February 1, 2024
OpenSearch Community Meeting - 2023-0328 Community community-meeting	3	1140	March 29, 2023

OpenSearch Community Meeting - 2023-0131

Related topics