OpenSearch Lucene Study Group Meeting - Monday, November 20th

msfroh · November 16, 2023, 10:23pm

Welcome to the first public meeting of the OpenSearch Lucene Study Group!

Apache Lucene is the open-sourced search library that powers OpenSearch and many search applications large and small.

In this meeting, we will review recent developments in Apache Lucene and discuss their potential impact to OpenSearch, with a particular focus on new and exciting Lucene features that we can (and should) expose through OpenSearch. Since some changes require a deep dive to fully understand, we will ask participants to volunteer for “homework” to dig deeper into changes and report back for the next meeting.

Standing Agenda:

Welcome / introduction (5 minutes)
Review assigned issues from last time (10 minutes)
Review new Lucene changes and assign homework (20 minutes)
Lucene learning series - someone will either present a Lucene-related talk or we will do Lucene Q&A (20 minutes, recorded)

By joining the OpenSearch Lucene Study Group Meeting, you grant OpenSearch, and our affiliates the right to record, film, photograph, and capture your voice and image during the OpenSearch Community Meeting (the “Recordings”). You grant to us an irrevocable, nonexclusive, perpetual, worldwide, royalty-free right and license to use, reproduce, modify, distribute, and translate, for any purpose, all or any part of the Recordings and Your Materials. For example, we may distribute Recordings or snippets of Recordings via our social media outlets.

msfroh · November 20, 2023, 4:39am

New issues to review this week:

Category	Description	Link
API Changes	Add TaxonomyReader#getBulkOrdinals method to more efficiently retrieve facet ordinals for multiple FacetLabel at once.	https://github.com/apache/lucene/issues/12180
New Features	Added similarityToQueryVector API to compute vector similarity scores with DoubleValuesSource.	https://github.com/apache/lucene/issues/12548
New Features	Lucene now records if documents have been indexed as blocks in SegmentInfo. This is recorded on a per segment basis and maintained across merges. The property is exposed via LeafReaderMetadata.	https://github.com/apache/lucene/issues/12685
New Features	Add new Lucene99FlatVectorsFormat for writing vectors in a flat format and refactor Lucene99HnswVectorsFormat to reuse the flat format. Added new Lucene99HnswQuantizedVectorsFormat that uses quantized vectors for its flat storage.	https://github.com/apache/lucene/issues/12729
Improvements	Remove possible contention on a ReentrantReadWriteLock in Monitor which could result in searches waiting for commits.	https://github.com/apache/lucene/issues/12801
Improvements	LUCENE-10241: Upgrade to OpenNLP to 1.9.4.	https://github.com/apache/lucene/issues/11277,
Optimizations	Skip docs with DocValues in NumericLeafComparator.	https://github.com/apache/lucene/issues/12381
Optimizations	Specialize arc store for continuous label in FST.	https://github.com/apache/lucene/issues/12748
Optimizations	Cache buckets to speed up BytesRefHash#sort.	https://github.com/apache/lucene/issues/12784
Optimizations	Utilize exact kNN search when gathering k >= numVectors in a segment	https://github.com/apache/lucene/issues/12806
Bug Fixes	TestIndexWriterOnVMError.testUnknownError times out (fixes potential IndexWriter deadlock with tragic exceptions).	https://github.com/apache/lucene/issues/12654
Bug Fixes	Stop exploring HNSW graph if scores are not getting better.	https://github.com/apache/lucene/issues/12770
Bug Fixes	Ensure #finish is called on all drill-sideways collectors even if one throws a CollectionTerminatedException	https://github.com/apache/lucene/issues/12640
Bug Fixes	Fix segmentInfos replace to set userData	https://github.com/apache/lucene/issues/12626

msfroh · November 20, 2023, 4:46am

This was previously an internal Amazon meeting, but we’ve taken it public. This is the “homework” assigned during the last internal meeting that we’ll review in this week’s meeting.

Category	Description	Link	Owner
API Changes	Automata#makeStringUnion #makeBinaryStringUnion now accept Iterable instead of Collection. They also now explicitly throw IllegalArgumentException if input data is not properly sorted instead of relying on assert.	https://github.com/apache/lucene/issues/12427	Rishabh Kumar Maurya
New Features	Add int8 scalar quantization to the HNSW vector format. This optionally allows for more compact lossy storage for the vectors, requiring about 75% memory for fast HNSW search.	https://github.com/apache/lucene/issues/12582	Navneet Verma
New Features	HNSW graph now can be merged with multiple thread. Configurable in Lucene99HnswVectorsFormat.	https://github.com/apache/lucene/issues/12660	Navneet Verma
Optimizations	Disjunctions now sometimes run as conjunctions when the minimum competitive score requires multiple clauses to match.	https://github.com/apache/lucene/issues/12589	Rishabh Kumar Maurya
Optimizations	Top-level conjunctions that are not sorted by score now have a specialized bulk scorer.	https://github.com/apache/lucene/issues/12719	Saurabh Singh
Optimizations	Faster merging of terms enums.	https://github.com/apache/lucene/issues/1052	Michael Froh
Optimizations	Faster sort on high-cardinality string fields.	https://github.com/apache/lucene/issues/11903	Harsha Vamsi Kalluri
Bug Fixes	Ensure negative scores are not returned by vector similarity functions	https://github.com/apache/lucene/issues/12727	Jack Mazanec

lukas-vlcek · November 20, 2023, 5:15pm

I recently opened one ticket in Lucene and I would like to get some feedback on it. It can be a good start to explain some basic concepts and testing framework:

github.com/apache/lucene

BaseTokenStreamTestCase.assertAnalyzesTo fails when Analyzer contains…

apache:main ← lukas-vlcek:PathHierarchyAnalyzerTest

opened 05:05PM - 02 Nov 23 UTC

lukas-vlcek

+14 -0

… PathHierarchy tokenizer ### Description This PR is expected to fail. It …demonstrates issue with `BaseTokenStreamTestCase.assertAnalyzesTo()` method in connection to `PathHierarchyTokenizer`. Is there any reason why `PathHierarchyTokenizer` shall not be used in the test like this? There are definitely other tokenizers that are being tested like this, ie. they are wrapped in Analyzer and then `assertAnalyzesTo()` method is called to check the tokens. What is special about PathHierarchy tokenizer that it does not work? I think the problem might not be in the tokenizer but in the test method itself or in the way I call it (maybe I need to pass in more parameters/flags to get rid of the issue?). The testing method is complex, especially when it gets to `checkAnalysisConsistency()` part. I am looking for any useful tips. Thank you!

msfroh · November 20, 2023, 5:23pm

On Issues · apache/lucene · GitHub

Increase number of hits fetched.
Check impact on search_after after we upgrade to Lucene 9.9

msfroh · November 20, 2023, 6:04pm

Suggestion from Ankit Jain – let’s collect Lucene videos. Add links to this post.

sohami · November 20, 2023, 6:21pm

PR link	Assignees	Notes
[12180](https://github.com/apache/lucene/issues/12180)		Discuss issue in OS if we need taxonomy index support
[12548](https://github.com/apache/lucene/pull/12548)	samuel-oci	KNN - Similarity to query vector
[12685](https://github.com/apache/lucene/pull/12685)	reta	For nested docs, segment info will have a flag. See if we can skip segments using this for nested queries
[12729](https://github.com/apache/lucene/pull/12729)	amistrn	FlatVectorFormat: New internal lucene storage format for HNSW vectors
[12405](https://github.com/apache/lucene/pull/12405)	jainankitk	See if it can help with aggregation related optimizations
[12784](https://github.com/apache/lucene/pull/12784)	sohami	Cache bucket optimization for BytesRefHash#sort. Understand this optimization

msfroh · November 20, 2023, 10:59pm

I’ll create a GH issue in OpenSearch to discuss this.

hasnain2808 · November 22, 2023, 4:42am

Hi @sohami

I see we missed adding an entry for Does Opensearch need constant_keyword fieldtype to me (GitHub → hasnain2808)

msfroh · November 27, 2023, 11:50pm

Issue created: Should we consider adding support for taxonomy indices? · Issue #11355 · opensearch-project/OpenSearch · GitHub

Topic		Replies	Views
OpenSearch Lucene Study Group Meeting - Monday, February 5th, 2024 Community community-meeting	2	281	February 5, 2024
OpenSearch Lucene Study Group Meeting - Monday, March 4th, 2024 Community community-meeting	2	232	March 4, 2024
OpenSearch Lucene Study Group Meeting - Monday, February 19th, 2024 Community community-meeting	3	165	February 19, 2024
OpenSearch Lucene Study Group Meeting - Monday, March 18th, 2024 Community community-meeting	4	189	March 19, 2024
OpenSearch Lucene Study Group Meeting - Monday, January 15th, 2024 Community community-meeting	2	229	January 18, 2024

OpenSearch Lucene Study Group Meeting - Monday, November 20th

Related topics