Apache Lucene is the open-sourced search library that powers OpenSearch and many search applications large and small.
We start the meeting with a Lucene learning topic or Q&A session. In the second half of the meeting, we review recent developments in Apache Lucene and discuss their potential impact to OpenSearch, with a particular focus on new and exciting Lucene features that we can (and should) expose through OpenSearch. Since some changes require a deep dive to fully understand, we sometimes ask participants to volunteer for “homework” to dig deeper into changes and report back for the next meeting.
Welcome / introduction (5 minutes)
Lucene learning series - someone will either present a Lucene-related talk or we will do Lucene Q&A (20 minutes, recorded)
Review assigned issues from last time (10 minutes)
Review new Lucene changes and assign homework (20 minutes)
By joining the OpenSearch Lucene Study Group Meeting, you grant OpenSearch, and our affiliates the right to record, film, photograph, and capture your voice and image during the OpenSearch Community Meeting (the “Recordings”). You grant to us an irrevocable, nonexclusive, perpetual, worldwide, royalty-free right and license to use, reproduce, modify, distribute, and translate, for any purpose, all or any part of the Recordings and Your Materials. For example, we may distribute Recordings or snippets of Recordings via our social media outlets.
I then did a quick code dive, looking at how Lucene layers abstractions to get from files on disk to structures that help process search queries. As a quick recap, those abstractions are:
Directory: Models a (platform-independent) file system directory, with simple operations like listing available files, deleting files, and opening files for read/write.
IndexInput / IndexOutput: Abstraction for a file reading/writing, returned from a Directory. IndexInput supports "slicing", where you can get an IndexInput corresponding to a byte range from another IndexInput. This is how "compound file segments" (.cfs) files work. Essentially, they're a bunch of segment files concatenated together, with a table of contents (with byte offsets) at the end of the file. When reading, Lucene passes a slice from the CFS index input to a reader for a particular data structure, and it's as though that byte range were its own file. Note that IndexInput and IndexOutput also provide (via inheritance from DataInput/DataOutput) methods to read/write frequently-used primitives as bytes (e.g. VInt, ZInt, string maps, arrays of numeric types).
Codecs: Codecs are the (Lucene version-specific) accumulation of readers/writers for various data structures used by the higher-level query logic. Each data structure tends to have an associated "[Version][Type]Format" class that either directly implements reading/writing or can return readers and writers.
Tying it together: Many Lucene data structures are stored off-heap, which is achieved by making the reader implement the data structure's interface, so navigating the data structure makes the (Codec-specific) reader move through the IndexInput, which is usually a memory-mapped file supplied by MMapDirectory.