OpenSearch Lucene Study Group Meeting - Monday, January 8th, 2024

msfroh · January 2, 2024, 6:01pm

Sign up to join the meeting at Meetup:

Link to previous meeting’s post: OpenSearch Lucene Study Group Meeting - Monday, December 18th, 2023

Welcome to the OpenSearch Lucene Study Group!

Apache Lucene is the open-sourced search library that powers OpenSearch and many search applications large and small.

We start the meeting with a Lucene learning topic or Q&A session. In the second half of the meeting, we review recent developments in Apache Lucene and discuss their potential impact to OpenSearch, with a particular focus on new and exciting Lucene features that we can (and should) expose through OpenSearch. Since some changes require a deep dive to fully understand, we sometimes ask participants to volunteer for “homework” to dig deeper into changes and report back for the next meeting.

Standing Agenda:

Welcome / introduction (5 minutes)
Lucene learning series - someone will either present a Lucene-related talk or we will do Lucene Q&A (20 minutes, recorded)
Review assigned issues from last time (10 minutes)
Review new Lucene changes and assign homework (20 minutes)

By joining the OpenSearch Lucene Study Group Meeting, you grant OpenSearch, and our affiliates the right to record, film, photograph, and capture your voice and image during the OpenSearch Community Meeting (the “Recordings”). You grant to us an irrevocable, nonexclusive, perpetual, worldwide, royalty-free right and license to use, reproduce, modify, distribute, and translate, for any purpose, all or any part of the Recordings and Your Materials. For example, we may distribute Recordings or snippets of Recordings via our social media outlets.

msfroh · January 8, 2024, 4:05pm

I’m not going to bother with the fancy HTML table for changes this week, because there is only one change list entry for this week:

github.com/apache/lucene

Move group-varint encoding/decoding logic to DataOutput/DataInput

apache:main ← easyice:group_vint_mmap

opened 11:49AM - 24 Nov 23 UTC

easyice

+451 -194

From issue: https://github.com/apache/lucene/issues/12826 The JMH benchmark w…ith this PR on my Mac (Intel chip) : java 17 ``` Benchmark (size) Mode Cnt Score Error Units GroupVIntBenchmark.byteArrayReadGroupVInt 64 thrpt 5 5.519 ± 0.270 ops/us GroupVIntBenchmark.byteArrayReadVInt 64 thrpt 5 4.075 ± 2.868 ops/us GroupVIntBenchmark.byteBufferReadGroupVInt 64 thrpt 5 7.464 ± 1.618 ops/us GroupVIntBenchmark.byteBufferReadVInt 64 thrpt 5 5.179 ± 0.470 ops/us ``` java 21 ``` Benchmark (size) Mode Cnt Score Error Units GroupVIntBenchmark.byteArrayReadGroupVInt 64 thrpt 5 5.768 ± 0.305 ops/us GroupVIntBenchmark.byteArrayReadVInt 64 thrpt 5 5.255 ± 0.110 ops/us GroupVIntBenchmark.byteBufferReadGroupVInt 64 thrpt 5 11.551 ± 0.252 ops/us GroupVIntBenchmark.byteBufferReadVInt 64 thrpt 5 5.611 ± 0.266 ops/us ```

It’s in the “Optimizations” section for Lucene 9.10.

msfroh · January 12, 2024, 6:43pm

For the learning part of the meeting, @reta followed up on his homework from the previous meeting, to review https://github.com/apache/lucene/pull/12873. He opened [Feature Request] Explore the use of hidden classes (JDK-15 and above) for Painless script classes generation. · Issue #11800 · opensearch-project/OpenSearch · GitHub to investigate how we can apply a similar approach to compilation of Painless scripts in OpenSearch.

I then did a quick code dive, looking at how Lucene layers abstractions to get from files on disk to structures that help process search queries. As a quick recap, those abstractions are:

Directory: Models a (platform-independent) file system directory, with simple operations like listing available files, deleting files, and opening files for read/write.
IndexInput / IndexOutput: Abstraction for a file reading/writing, returned from a Directory. IndexInput supports "slicing", where you can get an IndexInput corresponding to a byte range from another IndexInput. This is how "compound file segments" (.cfs) files work. Essentially, they're a bunch of segment files concatenated together, with a table of contents (with byte offsets) at the end of the file. When reading, Lucene passes a slice from the CFS index input to a reader for a particular data structure, and it's as though that byte range were its own file. Note that IndexInput and IndexOutput also provide (via inheritance from DataInput/DataOutput) methods to read/write frequently-used primitives as bytes (e.g. VInt, ZInt, string maps, arrays of numeric types).
Codecs: Codecs are the (Lucene version-specific) accumulation of readers/writers for various data structures used by the higher-level query logic. Each data structure tends to have an associated "[Version][Type]Format" class that either directly implements reading/writing or can return readers and writers.
Tying it together: Many Lucene data structures are stored off-heap, which is achieved by making the reader implement the data structure's interface, so navigating the data structure makes the (Codec-specific) reader move through the IndexInput, which is usually a memory-mapped file supplied by MMapDirectory.

Topic		Replies	Views
OpenSearch Lucene Study Group Meeting - Monday, January 22nd, 2024 Community community-meeting	2	201	January 27, 2024
OpenSearch Lucene Study Group Meeting - Monday, January 15th, 2024 Community community-meeting	2	230	January 18, 2024
OpenSearch Lucene Study Group Meeting - Monday, January 29th, 2024 Community community-meeting	3	323	January 29, 2024
OpenSearch Lucene Study Group Meeting - Monday, February 5th, 2024 Community community-meeting	2	282	February 5, 2024
OpenSearch Lucene Study Group Meeting - Monday, April 1st, 2024 Community community-meeting	2	234	April 1, 2024

OpenSearch Lucene Study Group Meeting - Monday, January 8th, 2024

Related topics