OpenSearch Lucene Study Group Meeting - Monday, March 18th, 2024

Sign up to join the meeting at Meetup:

Link to previous meeting’s post: OpenSearch Lucene Study Group Meeting - Monday, March 11th, 2024

Welcome to the OpenSearch Lucene Study Group!

Apache Lucene is the open-sourced search library that powers OpenSearch and many search applications large and small.

We start the meeting with a Lucene learning topic or Q&A session. In the second half of the meeting, we review recent developments in Apache Lucene and discuss their potential impact to OpenSearch, with a particular focus on new and exciting Lucene features that we can (and should) expose through OpenSearch. Since some changes require a deep dive to fully understand, we sometimes ask participants to volunteer for “homework” to dig deeper into changes and report back for the next meeting.

Standing Agenda:

  • Welcome / introduction (5 minutes)
  • Lucene learning series - someone will either present a Lucene-related talk or we will do Lucene Q&A (20 minutes, recorded)
  • Review assigned issues from last time (10 minutes)
  • Review new Lucene changes and assign homework (20 minutes)

By joining the OpenSearch Lucene Study Group Meeting, you grant OpenSearch, and our affiliates the right to record, film, photograph, and capture your voice and image during the OpenSearch Community Meeting (the “Recordings”). You grant to us an irrevocable, nonexclusive, perpetual, worldwide, royalty-free right and license to use, reproduce, modify, distribute, and translate, for any purpose, all or any part of the Recordings and Your Materials. For example, we may distribute Recordings or snippets of Recordings via our social media outlets.

This week’s Lucene change log:

Lucene 9.11.0New FeaturesRecursive graph bisection is now supported on indexes that have blocks, as long as they configure a parent field via `IndexWriterConfig#setParentField`.
Lucene 9.11.0New FeaturesAdd new token filters for Japanese sutegana (捨て仮名). This introduces JapaneseHiraganaUppercaseFilter and JapaneseKatakanaUppercaseFilter.
Lucene 9.11.0Bug FixesFix potential race condition in DocumentsWriter & DocumentsWriterDeleteQueue

See also Rishabh’s changes regarding BP reordering here: Doc reordering avoid iterations: cooling using simulated annealing by rishabhmaurya · Pull Request #13186 · apache/lucene · GitHub

Other topics:

TwoPhaseIterator and DocIdSetIterator

Kiran Reddy asked about Lucene’s TwoPhaseIterator interface and how it differs from the regular DocIdSetIterator. The short answer is that TwoPhaseIterator supports an approximation (the first phase), which may point to some false positive documents (but never has false negatives). Matches should be confirmed by calling the matches method (the second phase). By comparison, regular DocIdSetIterators only return “true” matches. The TwoPhaseIterator is useful in situations where a cheap(er) approximation exists. For example, a phrase query like "hamburger cat" can be approximated as hamburger AND cat (since every document containing the phrase also contains the individual words), then the matches method can check the positions of the terms a document to see if they appear in order.

You can adapt a TwoPhaseIterator into a DocIdSetIterator by checking matches before returning a doc (skipping to the next approximate match until you find a “true” match). See TwoPhaseIteratorAsDocIdSetIterator for details. Based on this bug and this bug, we’re speculating that there may be a bug in ReqOptSumScorer’s logic around TwoPhaseIterators (where it may be exposing an approximation in place of a “real” DocIdSetIterator). We need to dig into that more.

Tiered indexing

I brought up an idea based in part on a discussion I had with @penghuo last week and partly based on rereading Designing Data-Intensive Applications. Essentially, we always make tradeoffs between indexing effort, storage size, and efficiency of retrieval. While OpenSearch uses Lucene indices as a “one size fits all” solution, I was wondering if we could (or should) use something else at either end of the “realtime versus archive” spectrum.

On the realtime end, I was thinking about our default 1 second refresh interval that seems to exist to create the vague illusion of read-after-write consistency. This results in a lot of small Lucene segments getting flushed, which then need to get merged. We can save some effort by flushing less frequently, but then we lose realtime search. What if we implement realtime search with an inefficiently-searchable translog (akin to the naive append-only database at the start of Chapter 3 of Designing Data-Intensive Applications)? Even if we brute-force search through each element of the translog, it would only hold enough elements to last until the next refresh completes. (Similarly, if we separate writers and searchers with segment replication, this naive realtime search could bridge the gap until writes have replicated.)

At the other extreme, we could have a batch consumer that processes updates specifically for object storage – maybe writing doc values using Parquet, for example.

I gave the example of a service that I worked on a few years ago that had three different data stores consuming from the same update stream: an in-memory data store that only held the last couple of hours of data (and had many replicas to serve traffic), a mid-range store that held the last couple of weeks’ worth of data, and a long-term blob store that held objects constructed via a batch job (written as Parquet or something Parquet-like) essentially forever.

This is really cool idea, we should explore this.