Support for geo_distance queries/sorting on geo_shape fields?

The support for geo_distance queries and sorting for geo_shape fields was added in 7.11 (specifically this differs from the supported geo_distance queries on geo_point fields).

I was looking through the roadmap and git for any mention of this feature being added into OpenSearch, and failed to find any mention of it. Also took a quick dive into the code base to see if this support made it over (as it was merged before 7.11 was released)

Does anyone have any idea if this is on the radar?

1 Like

Any thoughts @nknize?

I am actually always surprised by how many people use geo fields in ES considering the poor support it has for it on the query level (real fast, but the dsl sucks and indexing is resource intensive AF)

Investing in this could be a killer feature for OpenSearch

1 Like

Can you expand on this comment? 400k shapes per second and 640k points per second per single core indexing thread is pretty darn fast and not that intensive, IMHO.

Also 500mb and 1.25gb index size is pretty compact for 60.8M points and 60.8M shapes.

What are your expectations here?

1 Like

TL;DR: we don’t get the same speeds, we only get 10k/s, but in a very different scenario. Cluster is very big and the 10k/s uses all of the cluster resources, 50M shapes make for about 800G index size. Might be interesting to benchmark in controlled test case, maybe something to do with overhead from Opensearch and not direct lucene.

That’s an interesting read!
As far as our usage in real world scenario goes - it doesn’t even come close to this speed, might be overhead from Opensearch, not directly from Lucene.
When I am back at work (Sunday) I’ll try to post actual settings and configurations, but posting the use case here from memory:
We’re indexing geoshapes over a very big area (think continent size), with the size of each shape being different - but averaging no more than 100 square metres mostly.
It appears the more vertices you use in your geoshape field, or the bigger it is - the heavier the workload becomes (which makes sense considering the indexing method).
As a step for improvement, we’re simplifying our polygons and lines to have much fewer vertices, so the indexing speed is much better, and resource usage is much lower.

Still, even with all of that, we get speeds of 10k shapes per second at most and that’s with a cluster ranging over about 3 TB RAM and about 150 SSDs as the storage backend (with RAID giving the sustained speed of about 60k iops per node). Resource usage peaks at barely manageable levels when we go 10k per second (usually a reindex scenario) and the cluster requires supervision to make sure it remains stable.

Reading the benchmarks shown, unless I am mistaken, they’re for use cases over a much smaller plane (city size), and direct lucene. Might be interesting to benchmark it with Opensearch over a similar use case when I have some spare time at work.

I will mention that when Lucene originally improved the indexing speed (I think around ES 7.5?) by changing the indexing method to be tree based, it helped enormously. beforehand we could barely get up to 1k docs per second.

As far as index size goes - for about 50M geoshape fields, we get an index size of about 900G (compared to the same documents without the geoshape - about the size of 100G) , which once again, makes it really interesting to benchmark different scenarios, as configurations usually don’t affect index size that much.

Adding on top of that - I believe even before indexing speeds, investing in allowing all queries and aggs to work over geoshape fields would be amazing.

Most queries in ES are only over point field - unless you go X-Pack, which opens up geoshape. Opensearch could draw a whole lot of people by allowing all of them over geoshape for free.

Hi @hagayg,

We are working on building aggregations over geo_shape field as well as introduce new XYPoint, XYShape for indexing/queries on cartesian coordinate system. We are targeting some aggregations like GeoBound, GeoCentroid for 2.4 release and proceed with other aggregations in subsequent releases. I will have more clear timeline on the public roadmap for each aggregation here https://github.com/orgs/opensearch-project/projects/1.

Meanwhile for any new feature requests please feel free to create GitHub issues in geospatial repo here