OpenSearch Lucene Study Group Meeting - Monday, November 27th

msfroh · November 21, 2023, 4:56pm

Sign up to join the meeting at Meetup:

Link to previous meeting’s post: OpenSearch Lucene Study Group Meeting - Monday, November 20th

Welcome to the OpenSearch Lucene Study Group!

Apache Lucene is the open-sourced search library that powers OpenSearch and many search applications large and small.

Based on last week’s meeting, we’re moving the learning series part earlier in the agenda, since most participants said they were attending in order to learn more about Lucene.

In the second half of the meeting, we will review recent developments in Apache Lucene and discuss their potential impact to OpenSearch, with a particular focus on new and exciting Lucene features that we can (and should) expose through OpenSearch. Since some changes require a deep dive to fully understand, we will ask participants to volunteer for “homework” to dig deeper into changes and report back for the next meeting.

Standing Agenda:

Welcome / introduction (5 minutes)
Lucene learning series - someone will either present a Lucene-related talk or we will do Lucene Q&A (20 minutes, recorded)
Review assigned issues from last time (10 minutes)
Review new Lucene changes and assign homework (20 minutes)

By joining the OpenSearch Lucene Study Group Meeting, you grant OpenSearch, and our affiliates the right to record, film, photograph, and capture your voice and image during the OpenSearch Community Meeting (the “Recordings”). You grant to us an irrevocable, nonexclusive, perpetual, worldwide, royalty-free right and license to use, reproduce, modify, distribute, and translate, for any purpose, all or any part of the Recordings and Your Materials. For example, we may distribute Recordings or snippets of Recordings via our social media outlets.

msfroh · November 27, 2023, 2:15am

Category	Description	Link
API Changes	Add HumanReadableQuery which takes a description parameter for debugging purposes.	https://github.com/apache/lucene/issues/12816
API Changes	Consolidate FSTStore and BytesStore in FST. Created FSTReader which contains the common methods of the two	https://github.com/apache/lucene/issues/12709
API Changes	Remove public constructor of FSTCompiler. Please use FSTCompiler.Builder instead.	https://github.com/apache/lucene/issues/12695
API Changes	Remove FSTCompiler#getTermCount() and FSTCompiler.UnCompiledNode#inputCount	https://github.com/apache/lucene/issues/12735
API Changes	Make TaskExecutor constructor public and use TaskExecutor for concurrent HNSW graph build.	https://github.com/apache/lucene/issues/12799
Improvements	FSTCompiler can now approximately limit how much RAM it uses to share suffixes during FST construction using the suffixRAMLimitMB method. Larger values result in a more minimal FST (more common suffixes are shard). Pass Double.POSITIVE_INFINITY to use as much RAM as is needed to create a purely minimal FST. Inspired by this Rust FST implemention: https://blog.burntsushi.net/transducers	https://github.com/apache/lucene/issues/12542
Optimizations	Use group-varint encoding for the tail of postings.	https://github.com/apache/lucene/issues/12782
Build	Only enable support for tests.profile if jdk.jfr module is available in Gradle runtime.	https://github.com/apache/lucene/issues/12845
Other	Add demo for faceting with StringValueFacetCounts over KeywordField and SortedDocValuesField.	https://github.com/apache/lucene/issues/12817
Other	Refactor BKD HeapPointWriter to hide the internal data structure.	https://github.com/apache/lucene/issues/12762

msfroh · November 27, 2023, 5:38pm

For Lucene's FST Builder should have a simpler "knob" to trade off memory/CPU required against minimality · Issue #12542 · apache/lucene · GitHub, do we want to expose this as a setting (index setting?) in OpenSearch? It could be a new knob for expert users.

Can we use it during indexing circuit-breaking to get a better sense of how much RAM an indexing request will take?

msfroh · November 27, 2023, 5:43pm

TODO (msfroh): Find and post the API that tells you what a query parsed to. Maybe we can use this HumanReadableQuery to make that API nicer?

Thanks Saurabh – the API is _validate

msfroh · November 27, 2023, 5:48pm

TODO: Ask Navneet about Make TaskExecutor cx public and use TaskExecutor for concurrent HNSW graph build by shubhamvishu · Pull Request #12799 · apache/lucene · GitHub

For OpenSearch we should consider exposing a knob to let users use multiple cores to merge their HNSW graphs.

Topic		Replies	Views
OpenSearch Lucene Study Group Meeting - Monday, November 20th Community community-meeting	9	434	November 27, 2023
OpenSearch Lucene Study Group Meeting - Monday, December 4th Community community-meeting	1	239	December 4, 2023
OpenSearch Lucene Study Group Meeting - Monday, February 12th, 2024 Community community-meeting	2	154	February 12, 2024
OpenSearch Lucene Study Group Meeting - Monday, February 19th, 2024 Community community-meeting	3	165	February 19, 2024
OpenSearch Lucene Study Group Meeting - Monday, February 5th, 2024 Community community-meeting	2	282	February 5, 2024

OpenSearch Lucene Study Group Meeting - Monday, November 27th

Related topics