Avoid re-sorting when initializing TermInSetQuery

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):

Opensearch version 2.16

Describe the issue:

In our Opensearch cluster, we’ve noticed a significant portion cpu time spent on sorting terms when initializing TermInSetQuery objects (specifically this sort call in Lucene’s TermInSetQuery.packTerms() function). However we make sure to presort the terms before constructing our retrieval query, so this is unexpected behavior.

Looking through the code a bit more, I see Lucene will skip sorting if the terms are passed as a SortedSet object (see code here), but it doesn’t look like Opensearch has any option to do this. I see we always pass a BytesRef here.

I wanted to confirm that my understanding here is correct. Is there any way to skip re-sorting terms if we’ve presorted them in the retrieval query, or would it require a code change to add this behavior?

Configuration:

Relevant Logs or Screenshots:

1 Like

Hi,
I’m afraid it requires code change. Also it’s worth to check a particular Lucene version from 2.16. It might be a valuable improvement for OS, I suppose.

1 Like

hold my beer pass in order terms as sorted to TermInSetQuery() by mkhludnev · Pull Request #17714 · opensearch-project/OpenSearch · GitHub

1 Like

One more idea for optimization for the certain edge case Reuse packedTerms between two TermInSetQuery what combined with IndexOrDocValuesQuery · Issue #14425 · apache/lucene · GitHub