OpenSearch Lucene Study Group Meeting - Monday, January 15th, 2024

Sign up to join the meeting at Meetup:

Link to previous meeting’s post: OpenSearch Lucene Study Group Meeting - Monday, January 8th, 2024

Welcome to the OpenSearch Lucene Study Group!

Apache Lucene is the open-sourced search library that powers OpenSearch and many search applications large and small.

We start the meeting with a Lucene learning topic or Q&A session. In the second half of the meeting, we review recent developments in Apache Lucene and discuss their potential impact to OpenSearch, with a particular focus on new and exciting Lucene features that we can (and should) expose through OpenSearch. Since some changes require a deep dive to fully understand, we sometimes ask participants to volunteer for “homework” to dig deeper into changes and report back for the next meeting.

Standing Agenda:

  • Welcome / introduction (5 minutes)
  • Lucene learning series - someone will either present a Lucene-related talk or we will do Lucene Q&A (20 minutes, recorded)
  • Review assigned issues from last time (10 minutes)
  • Review new Lucene changes and assign homework (20 minutes)

By joining the OpenSearch Lucene Study Group Meeting, you grant OpenSearch, and our affiliates the right to record, film, photograph, and capture your voice and image during the OpenSearch Community Meeting (the “Recordings”). You grant to us an irrevocable, nonexclusive, perpetual, worldwide, royalty-free right and license to use, reproduce, modify, distribute, and translate, for any purpose, all or any part of the Recordings and Your Materials. For example, we may distribute Recordings or snippets of Recordings via our social media outlets.

Here are the Lucene changes since last Monday:

Lucene 10.0.0New FeaturesFor indices newly created as of 10.0.0 onwards, IndexWriter preserves document blocks indexed via IndexWriter#addDocuments or IndexWriter#updateDocuments also when index sorting is configured. Document blocks are maintained alongside their parent documents during sort and merge. IndexWriterConfig now requires a parent field to be specified if index sorting is used together with document blocks.
Lucene 10.0.0Changes in Backwards Compatibility PolicyIndexWriter#addDocuments or IndexWriter#updateDocuments now require a parent field name to be specified in IndexWriterConfig is documents blocks are indexed and index time sorting is configured.
Lucene 9.10.0ImprovementsUse Automaton for SurroundQuery prefix/pattern matching
Lucene 9.10.0OptimizationsAvoid reset BlockDocsEnum#freqBuffer when indexHasFreq is false.
Lucene 9.10.0Bug FixesFixed the bug that JapaneseReadingFormFilter cannot convert some hiragana to romaji.

While we didn’t have a formal “learning” topic this week, we ended up having a great impromptu chat and code dive, trying to figure out exactly how phrase queries do their position-matching.

Essentially, a phrase query like “quick brown fox” starts like a BooleanQuery for “quick AND brown AND fox”, skipping through doc IDs for each term until it finds a document with all three terms. Then it tries to find the terms in consecutive positions. I had previously guessed that positions were stored as a skip-list, like the doc IDs, but it looks like positions don’t support skipping – just one-by-one iteration. @radu.gheorghe cleared up the confusion by pointing us to the implementation in ExactPhraseMatcher::nextMatch, which does use “skipping” logic by calling the advancePosition method, which is implemented as a while loop.

It was a fun investigation and I think we all learned a bit about how phrase queries work.

1 Like