Understanding OpenSearch Scaling

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
Ubuntu 22.04, OpenSearch 2.12

Describe the issue:
Currently I am testing the performance of opensearch by running a 1 node server on my desktop via docker. My machine is fairly beefy Ubuntu 22 machine (20 cores, 64 GB Ram, 2 TB NVME SSD). I have a 100GB index and have copied it numerous times but with different # of primary shards (1,2,4,8). I am running against it some fairly complex opensearch queries via the python client (~30K lines, with and without aggregations) and am finding that scaling is leveling off after 4 shards. My test involves taking a given query and re-running it 50 times on each index with different shard-count. I then take the minimum time for each index and am comparing them. From 1 → 2 → 4 shards I can see the minimum-time performance of the queries roughly scale linearly (1.5s → 800 ms → 450 ms), but beyond 4 shards the performance stays the exact same as the 4-shard query. From looking at iotop, I can see a decent amount of disk read on the very first query, but zero for the remaining 49 queries. I’m assuming everything gets loaded into an opensearch cache / operating system memory. What is the next step for analyzing the bottleneck? Is this expected behavior?

I have refresh_interval set to -1 and have tried force merging the segments down to 1 segment per shard, but have seen no difference in query performance.

Any help / commentary would be appreciated!