TL;DR: we don’t get the same speeds, we only get 10k/s, but in a very different scenario. Cluster is very big and the 10k/s uses all of the cluster resources, 50M shapes make for about 800G index size. Might be interesting to benchmark in controlled test case, maybe something to do with overhead from Opensearch and not direct lucene.
That’s an interesting read!
As far as our usage in real world scenario goes - it doesn’t even come close to this speed, might be overhead from Opensearch, not directly from Lucene.
When I am back at work (Sunday) I’ll try to post actual settings and configurations, but posting the use case here from memory:
We’re indexing geoshapes over a very big area (think continent size), with the size of each shape being different - but averaging no more than 100 square metres mostly.
It appears the more vertices you use in your geoshape field, or the bigger it is - the heavier the workload becomes (which makes sense considering the indexing method).
As a step for improvement, we’re simplifying our polygons and lines to have much fewer vertices, so the indexing speed is much better, and resource usage is much lower.
Still, even with all of that, we get speeds of 10k shapes per second at most and that’s with a cluster ranging over about 3 TB RAM and about 150 SSDs as the storage backend (with RAID giving the sustained speed of about 60k iops per node). Resource usage peaks at barely manageable levels when we go 10k per second (usually a reindex scenario) and the cluster requires supervision to make sure it remains stable.
Reading the benchmarks shown, unless I am mistaken, they’re for use cases over a much smaller plane (city size), and direct lucene. Might be interesting to benchmark it with Opensearch over a similar use case when I have some spare time at work.
I will mention that when Lucene originally improved the indexing speed (I think around ES 7.5?) by changing the indexing method to be tree based, it helped enormously. beforehand we could barely get up to 1k docs per second.
As far as index size goes - for about 50M geoshape fields, we get an index size of about 900G (compared to the same documents without the geoshape - about the size of 100G) , which once again, makes it really interesting to benchmark different scenarios, as configurations usually don’t affect index size that much.