Welcome to the first public meeting of the OpenSearch Lucene Study Group!
Apache Lucene is the open-sourced search library that powers OpenSearch and many search applications large and small.
In this meeting, we will review recent developments in Apache Lucene and discuss their potential impact to OpenSearch, with a particular focus on new and exciting Lucene features that we can (and should) expose through OpenSearch. Since some changes require a deep dive to fully understand, we will ask participants to volunteer for “homework” to dig deeper into changes and report back for the next meeting.
Standing Agenda:
Welcome / introduction (5 minutes)
Review assigned issues from last time (10 minutes)
Review new Lucene changes and assign homework (20 minutes)
Lucene learning series - someone will either present a Lucene-related talk or we will do Lucene Q&A (20 minutes, recorded)
By joining the OpenSearch Lucene Study Group Meeting, you grant OpenSearch, and our affiliates the right to record, film, photograph, and capture your voice and image during the OpenSearch Community Meeting (the “Recordings”). You grant to us an irrevocable, nonexclusive, perpetual, worldwide, royalty-free right and license to use, reproduce, modify, distribute, and translate, for any purpose, all or any part of the Recordings and Your Materials. For example, we may distribute Recordings or snippets of Recordings via our social media outlets.
Lucene now records if documents have been indexed as blocks in SegmentInfo. This is recorded on a per segment basis and maintained across merges. The property is exposed via LeafReaderMetadata.
Add new Lucene99FlatVectorsFormat for writing vectors in a flat format and refactor Lucene99HnswVectorsFormat to reuse the flat format. Added new Lucene99HnswQuantizedVectorsFormat that uses quantized vectors for its flat storage.
This was previously an internal Amazon meeting, but we’ve taken it public. This is the “homework” assigned during the last internal meeting that we’ll review in this week’s meeting.
Category
Description
Link
Owner
API Changes
Automata#makeStringUnion #makeBinaryStringUnion now accept Iterable instead of Collection. They also now explicitly throw IllegalArgumentException if input data is not properly sorted instead of relying on assert.
Add int8 scalar quantization to the HNSW vector format. This optionally allows for more compact lossy storage for the vectors, requiring about 75% memory for fast HNSW search.
I recently opened one ticket in Lucene and I would like to get some feedback on it. It can be a good start to explain some basic concepts and testing framework: