Describe the issue:
With 2.19, a new parameter pagination_depth has been introduced, see #1048.
I am now wondering about the meaning of this parameter in the practical sense. I understood from the docs that it limits the number of results retrieved from each shard per subquery before anything else happens like filtering, score normalisation etc., see the docs.
So I made a small experiment with a hybrid query for one index (shard), setting pagination_depth to 1. This gives me 2 results since each subquery is limited to one result.
So if I want the client to be able to page through a long list of results, I’d have to set pagination_depth to a very high value (i.e. the expected max. number of results for one subquery) since it represents the number of documents from which a portion (i.e. a page) would be returned?
In general, I try to use search_after because of performance reasons and the limitation of pagination with from to 10’000 results. This is also possible for hybrid search but the sorting criteria cannot combine _score with another field like _id to make this deterministic even if two docs would have the same score: _score sort criteria cannot be applied with any other criteria. Please select one sort criteria out of them. I understand that this is the overall score calculated from the two subqueries’ scores.
So it comes down to two questions:
Is my understanding of pagination_depth correct?
Is there a way to make sorting with search_after deterministic without ignoring the score?
Hi @tobe - I tapped Varun on the shoulder to see if he could add a helpful reply here. Outside of that, this actually might be a good question to put into GitHub as an issue or even a comment on the PR itself. Plenty of devs there might have different takes on it.
What is pagination_depth? Pagination_depth defines the count of top search results of each sub-query from every shard need to picked for hybridization. The reason why this parameter is introduced is because earlier when you increase the size and from then the result ordering use to get changed because of normalization and combination.
Context around why pagination_depth is introduced:
For standard hybrid search without pagination, it always uses from+size formula to capture search results of each subquery from every individual shard where from =0 and size (what user provides). The challenge here is when you increase the from and size values to see more results then new results captured might end up on earlier pages because after normalization then can be ranked higher depending on weights. Therefore, it will change the ground_truth on which user is pagination.
What does pagination_depth do?
As we have a ground_truth disruption issue discussed earlier while applying pagination with from and size, therefore to counter that we introduced a new parameter called pagination_depth. This parameter essentially gives user an ability to decide how many search results they need to consider from each shard to hybridize and set the search results reference on which they can paginate consistently. Then by using from and size they can navigate on that search result reference.
For example:
Consider an index with 3 shards. A user performs a hybrid search which contains two subqueries: match and knn. User provided from = 10, size = 30 and pagination_depth=20.
For match query, it can capture up-to pagination_depth x number of shards equivalent number of results. Provided every shard does have that many results. In this case it can capture up-to 20 x 3 =60. Now consider every shard does have that many results for match query, so coordinator node got 60 results for match query.
Similarly for k-nn query, coordinator node can recieve upto 60 results.
60 + 60 = 120 results.
Out of 120 results lets consider 50 results are common for both match and knn query
After normalization and combination the total result count will be 120-50 = 70
So by using pagination_depth = 20 user ends up getting 70 unique hybridized search results.
Now from was 10 and size was 30 so it will return result according from those 70 documents.
Thanks a lot for clarifying this. What about pagination with search_after? I noticed that it is possible to use it with hybrid search (see docs) but it is not possible to combine _sort with another criterion like _id to make this deterministic: _score sort criteria cannot be applied with any other criteria. Please select one sort criteria out of them. I understand that the cursor refers to the resulting list after normalisation and I wonder if this leads to similar issues like with offset-based pagination (the ground truth issue). Is this the reason for this limitation and if yes are you planning on finding a solution for this as well?