We didn’t review the change entries above. We’ll need to revisit them next week.
Instead, the whole meeting turned into a nice deep-dive on the challenges of efficient query cost estimation, motivated by Improve AbstractMultiTermQueryConstantScoreWrapper#RewritingWeight ScorerSupplier cost estimation · Issue #13029 · apache/lucene · GitHub and my PR to try to improve at least some cases: Get better cost estimate on MultiTermQuery over few terms by msfroh · Pull Request #13201 · apache/lucene · GitHub.
As a summary of the issue, the reported cost of “multi-term queries” (i.e. PrefixQuery
, WildcardQuery
, TermInSetQuery
, FuzzyQuery
, …) was changed in 2023. Lucene uses query cost for a couple of things. In the original issue, a WildcardQuery
was no longer being cached because the reported cost was too high. In an issue I (along with @harshavamsi) encountered last week, a TermInSetQuery
was receiving a high cost estimate, which led to an adjacent PointRangeQuery
from executing directly, rather than getting rewritten to a doc-value lookup.
Previously, the cost estimate was calculated by expanding the matching set of terms and summing the number of docs matched by each term (I think). The key change in Lucene is that the new logic specifically wants to avoid the term expansion, since that can be expensive – suppose you have employee IDs indexed as EMP000001
, EMP000002
, etc. and someone searches for EMP*
. Unfortunately, it does so by taking a worst-case estimate, which is often “This query matches every document with the given field”.
The PR that I opened tries to trade a little cost by doing “some” term expansion (up to 128 terms) to produce more accurate cost estimates (though we throw the effort away if there are more than 128 terms).
There were some really good points raised in the discussion:
Kiran Reddy asked if instead of an arbitrary threshold of 128 terms, could we set a maximum time to spend on term expansion to compute a better estimate? This could be a significant change in Lucene, since this time would need to be plumbed through (as this cost logic is used for all query types and most queries have really simple cost estimates).
Tomás Fernández Löbbe pointed out that there are cases where a prefix/wildcard query matches maybe 1000 terms (like employeeId:EMP000*
), which each match a single document, which would generally still be considered “low-cost”, but my change wouldn’t help.
@lukas-vlcek suggested that we could add some information at indexing time that would provide better statistics, based on knowledge of the type of queries that would be sent. I pointed out that this is conceptually a step toward using the index_prefixes
option on a text field, for example, which stores statistics for individual prefixes, because it indexes the prefixes directly. As an in-between option, maybe we could index statistics about the individual characters (or probably bytes) that appear at the start of each term and within each term. At a maximum cost of 512 longs per segment (4KB), we could track how many documents contain terms that start with each possible byte and how many documents contain terms that contain each possible byte. So, an upper-bound on the cost of abc*
would be min(startsWith('a'), contains('b'), contains('c'))
. If you extended that to 2-byte combinations (at a cost of 1MB per segment), you could say the cost of abc*
would be min(startWith('ab'), contains('bc'))
.
Thanks all for a really great discussion! I’ll try to get the video up on YouTube soon.
(Note that the above summary was written from memory a few hours after the Meetup. If there’s anything that I failed to capture, please call it out in replies below!)