OpenSearch Lucene Study Group Meeting - Monday, June 17th, 2024

msfroh · June 14, 2024, 5:38pm

Sign up to join the meeting at Meetup:

Link to previous meeting’s post: OpenSearch Lucene Study Group Meeting - Friday, May 31st, 2024

I failed to set up a meeting on Friday, June 7th, as I came down with a cold that week and wanted to do my best to recover before traveling to the Berlin Buzzwords conference. I was mostly successful, so I think it was worth it.

Since it’s been two weeks since our last meetup, let’s move back to our regular Monday time slot.

Welcome to the OpenSearch Lucene Study Group!

Apache Lucene is the open-sourced search library that powers OpenSearch and many search applications large and small.

We start the meeting with a Lucene learning topic or Q&A session. In the second half of the meeting, we review recent developments in Apache Lucene and discuss their potential impact to OpenSearch, with a particular focus on new and exciting Lucene features that we can (and should) expose through OpenSearch. Since some changes require a deep dive to fully understand, we sometimes ask participants to volunteer for “homework” to dig deeper into changes and report back for the next meeting.

Standing Agenda:

Welcome / introduction (5 minutes)
Lucene learning series - someone will either present a Lucene-related talk or we will do Lucene Q&A (20 minutes, recorded)
Review assigned issues from last time (10 minutes)
Review new Lucene changes and assign homework (20 minutes)

By joining the OpenSearch Lucene Study Group Meeting, you grant OpenSearch, and our affiliates the right to record, film, photograph, and capture your voice and image during the OpenSearch Community Meeting (the “Recordings”). You grant to us an irrevocable, nonexclusive, perpetual, worldwide, royalty-free right and license to use, reproduce, modify, distribute, and translate, for any purpose, all or any part of the Recordings and Your Materials. For example, we may distribute Recordings or snippets of Recordings via our social media outlets.

msfroh · June 17, 2024, 3:30pm

Here are this week’s Lucene changes for review:

Version	Category	Description	Link
Lucene 10.0.0	API Changes	Removed Scorer#getWeight	https://github.com/apache/lucene/issues/13410
Lucene 10.0.0	New Features	Sparse index: optional skip list on top of doc values which is exposed via the DocValuesSkipper abstraction. A new flag is added to FieldType.java that configures whether to create a "skip index" for doc values.	https://github.com/apache/lucene/issues/13449
Lucene 10.0.0	Other	Merges all immutable attributes in FieldInfos.FieldNumbers into one Hashmap saving memory when writing big indices. Fixes an exotic bug when calling clear where not all attributes were cleared.	https://github.com/apache/lucene/issues/13459
Lucene 9.12.0	API Changes	Mark COSINE VectorSimilarityFunction as deprecated.	https://github.com/apache/lucene/issues/13281
Lucene 9.12.0	Optimizations	Avoid unnecessary memory allocation in PackedLongValues#Iterator.	https://github.com/apache/lucene/issues/13439
Lucene 9.12.0	Optimizations	Rewrite SortedNumericDocValuesRangeQuery to MatchNoDocsQuery when the upper bound is smaller than the lower bound.	https://github.com/apache/lucene/issues/13425
Lucene 9.12.0	Optimizations	Implement Weight#count for vector values in the FieldExistsQuery.	https://github.com/apache/lucene/issues/13322
Lucene 9.12.0	Optimizations	MultiTermQuery returns null ScoreSupplier in cases where no query terms are present in the index segment	https://github.com/apache/lucene/issues/13454
Lucene 9.12.0	Optimizations	Replace TreeMap and use compiled Patterns in Japanese UserDictionary.	https://github.com/apache/lucene/issues/13431
Lucene 9.12.0	Optimizations	Don't preserve auxiliary buffer contents in LSBRadixSorter if it grows.	https://github.com/apache/lucene/issues/12941
Lucene 9.11.0	New Features	Add new option when calculating scalar quantiles. The new option of setting `confidenceInterval` to `0` will now dynamically determine the quantiles through a grid search over multiple quantiles calculated by multiple intervals.	https://github.com/apache/lucene/issues/13445
Lucene 9.11.0	Optimizations	Replace Map<Character> by CharObjectHashMap and Set<Character> by CharHashSet.	https://github.com/apache/lucene/issues/13420

harshavamsi · June 17, 2024, 4:37pm

Rewrite newSlowRangeQuery to MatchNoDocsQuery when upper > lower by ioanatia · Pull Request #13425 · apache/lucene · GitHub – this has an issue with the link. This should be the link.

msfroh · June 17, 2024, 5:21pm

Here is what we discussed this week:

MultiTermQuery return null for ScoreSupplier by mayya-sharipova · Pull Request #13454 · apache/lucene · GitHub – I think this one has some implications for the fix that I attempted in Get better cost estimate on MultiTermQuery over few terms by msfroh · Pull Request #13201 · apache/lucene · GitHub. Specifically, I think I can/should modify my PR to take advantage of the fact that Mayya’s PR expands out the first few terms of the MultiTermQuery before producing the ScorerSupplier. If the expansion is exhaustive, I think we can use it to produce a better cost estimate.
Should FieldInfo#FieldNumbers hold one map with index properties instead of a map for each property? · Issue #13459 · apache/lucene · GitHub – I was curious about this one when I saw the reference to fixing a leak in FieldNumbers#clear, since I vaguely remembered fixing a different leak there years ago (LUCENE-9617: Reset lowestUnassignedFieldNumber in FieldNumbers.clear(… · apache/lucene@8e162e2 · GitHub). In this case, it was a different leak (related to vector infos), but the overall code cleanup is really nice.
Rewrite newSlowRangeQuery to MatchNoDocsQuery when upper > lower by ioanatia · Pull Request #13425 · apache/lucene · GitHub – @harshavamsi noticed this one. We figured out that it’s probably only relevant to OpenSearch users when a numeric field is not indexed, but does have doc values, and a range query has lower > upper (there’s a typo in the PR title). We get the optimization “for free” on all numeric field types except unsigned_long. @harshavamsi will open an issue on OpenSearch to address this.

We also talked a little bit about how Lucene 10 changes are increasingly making use of newer Java features, like switch-expressions, records, type inference of local variables, etc. The change seems to have happened once discussion about releasing Lucene 10 started, so people no longer see the main branch as a waypoint to 9.x. On OpenSearch, our main still primarily exists as a stop on the way to 2.x, so we’re not ready to embrace new Java language features yet.

Topic		Replies	Views
OpenSearch Lucene Study Group Meeting - Monday, April 1st, 2024 Community community-meeting	2	234	April 1, 2024
OpenSearch Lucene Study Group Meeting - Monday, April 15th, 2024 Community community-meeting	2	154	April 15, 2024
OpenSearch Lucene Study Group Meeting - Monday, August 19th, 2024 Community	1	47	August 19, 2024
OpenSearch Lucene Study Group Meeting - Monday, March 4th, 2024 Community community-meeting	2	244	March 4, 2024
OpenSearch Lucene Study Group Meeting - Monday, February 19th, 2024 Community community-meeting	3	165	February 19, 2024

OpenSearch Lucene Study Group Meeting - Monday, June 17th, 2024

Related topics