OpenSearch Lucene Study Group Meeting - Monday, March 25th, 2024

Sign up to join the meeting at Meetup:

Link to previous meeting’s post: OpenSearch Lucene Study Group Meeting - Monday, March 18th, 2024

Welcome to the OpenSearch Lucene Study Group!

Apache Lucene is the open-sourced search library that powers OpenSearch and many search applications large and small.

We start the meeting with a Lucene learning topic or Q&A session. In the second half of the meeting, we review recent developments in Apache Lucene and discuss their potential impact to OpenSearch, with a particular focus on new and exciting Lucene features that we can (and should) expose through OpenSearch. Since some changes require a deep dive to fully understand, we sometimes ask participants to volunteer for “homework” to dig deeper into changes and report back for the next meeting.

Standing Agenda:

  • Welcome / introduction (5 minutes)
  • Lucene learning series - someone will either present a Lucene-related talk or we will do Lucene Q&A (20 minutes, recorded)
  • Review assigned issues from last time (10 minutes)
  • Review new Lucene changes and assign homework (20 minutes)

By joining the OpenSearch Lucene Study Group Meeting, you grant OpenSearch, and our affiliates the right to record, film, photograph, and capture your voice and image during the OpenSearch Community Meeting (the “Recordings”). You grant to us an irrevocable, nonexclusive, perpetual, worldwide, royalty-free right and license to use, reproduce, modify, distribute, and translate, for any purpose, all or any part of the Recordings and Your Materials. For example, we may distribute Recordings or snippets of Recordings via our social media outlets.

This week’s Lucene changes:

Lucene 9.11.0ImprovementsSupport getMaxScore of DisjunctionSumScorer for non top level scoring clause
Lucene 9.11.0ImprovementsMergeScheduler can now provide an executor for intra-merge parallelism. The first implementation is the ConcurrentMergeScheduler and the Lucene99HnswVectorsFormat will use it if no other executor is provided.
Lucene 9.11.0OptimizationsMake the HitQueue size more appropriate for KNN exact search
Lucene 9.11.0Bug FixesFix euals/hashCode of IOContext.
Lucene 9.11.0BuildUpgrade forbiddenapis to version 3.7 and ASM for APIJAR extraction to 9.7.N/A

We didn’t review the change entries above. We’ll need to revisit them next week.

Instead, the whole meeting turned into a nice deep-dive on the challenges of efficient query cost estimation, motivated by Improve AbstractMultiTermQueryConstantScoreWrapper#RewritingWeight ScorerSupplier cost estimation · Issue #13029 · apache/lucene · GitHub and my PR to try to improve at least some cases: Get better cost estimate on MultiTermQuery over few terms by msfroh · Pull Request #13201 · apache/lucene · GitHub.

As a summary of the issue, the reported cost of “multi-term queries” (i.e. PrefixQuery, WildcardQuery, TermInSetQuery, FuzzyQuery, …) was changed in 2023. Lucene uses query cost for a couple of things. In the original issue, a WildcardQuery was no longer being cached because the reported cost was too high. In an issue I (along with @harshavamsi) encountered last week, a TermInSetQuery was receiving a high cost estimate, which led to an adjacent PointRangeQuery from executing directly, rather than getting rewritten to a doc-value lookup.

Previously, the cost estimate was calculated by expanding the matching set of terms and summing the number of docs matched by each term (I think). The key change in Lucene is that the new logic specifically wants to avoid the term expansion, since that can be expensive – suppose you have employee IDs indexed as EMP000001, EMP000002, etc. and someone searches for EMP*. Unfortunately, it does so by taking a worst-case estimate, which is often “This query matches every document with the given field”.

The PR that I opened tries to trade a little cost by doing “some” term expansion (up to 128 terms) to produce more accurate cost estimates (though we throw the effort away if there are more than 128 terms).

There were some really good points raised in the discussion:

Kiran Reddy asked if instead of an arbitrary threshold of 128 terms, could we set a maximum time to spend on term expansion to compute a better estimate? This could be a significant change in Lucene, since this time would need to be plumbed through (as this cost logic is used for all query types and most queries have really simple cost estimates).

Tomás Fernández Löbbe pointed out that there are cases where a prefix/wildcard query matches maybe 1000 terms (like employeeId:EMP000*), which each match a single document, which would generally still be considered “low-cost”, but my change wouldn’t help.

@lukas-vlcek suggested that we could add some information at indexing time that would provide better statistics, based on knowledge of the type of queries that would be sent. I pointed out that this is conceptually a step toward using the index_prefixes option on a text field, for example, which stores statistics for individual prefixes, because it indexes the prefixes directly. As an in-between option, maybe we could index statistics about the individual characters (or probably bytes) that appear at the start of each term and within each term. At a maximum cost of 512 longs per segment (4KB), we could track how many documents contain terms that start with each possible byte and how many documents contain terms that contain each possible byte. So, an upper-bound on the cost of abc* would be min(startsWith('a'), contains('b'), contains('c')). If you extended that to 2-byte combinations (at a cost of 1MB per segment), you could say the cost of abc* would be min(startWith('ab'), contains('bc')).

Thanks all for a really great discussion! I’ll try to get the video up on YouTube soon.

(Note that the above summary was written from memory a few hours after the Meetup. If there’s anything that I failed to capture, please call it out in replies below!)

Here is the recording of the meeting: