for the sake of simplicity we’ll focus on only one in this discussion here as it just means that we’ll do the same for all of them
if an SCS has a use-case where they need opensearch they’ll have an opensearch cluster (i.e. different SCS don’t share a common opensearch cluster to avoid coupling)
functionality in the SCS is built as microservices
these microservices are scalable, so it is well possible that more than a single pod of the service is running
our applications are not run by the teams building them (of which i’m a part) but instead by other companies (i.e. there are dozens of deployments out there, with varying versions of different services in use) and accordingly we also have 0 access to the production system
downtimes for maintenance windows should be avoided if ever possible, in some future use-cases they might be unacceptable
it is thus imperative that all actions are fully automated and fail-safe (i.e. re-runable). no human interaction must be needed for anything but unrecoverable errors (which in turn shouldn’t happen).
the opensearch cluster being deployed in an SCS starts out with a minimal configuration needed just to get it online
some of the minimal setup is customer/landscape specific (e.g. authentication realms)
the rest of the setup is use-case specific:
setup of indices
setup of roles and role mappings for these indices
the actual data is fed in asynchronously once the setup is done (and there’s a constant flow of data afterwards)
in a naïve implementation the use-case specific setup can be handled by the data ingestion system by just doing it whenever it starts. but that means that if a new pod starts (e.g. scaling from 1 to 2 pods) it’ll try to do it again.
a slightly more advanced system can build a setup-management system around this and version the changes, storing the version information in opensearch so that it only applies the change if it hasn’t been done before (think “liquibase for opensearch”). however, this still leaves the issue if multiple pods start at the same time (in general they’re a Deployment and not a StatefulSet, i.e. all will start together and not one after the other).
we also considered having a single configuration service (which runs as a singleton per SCS) through which we could handle the config updates (i.e. the other systems call this one with the required config changes) so that it could queue the config updates and ensure that they are run sequentially.
note that we can’t package the configuration with opensearch itself as we don’t know which use-cases will end up running against it (it must be possible to update a use-case specific component/service without updating opensearch and vice-versa). we do have a versioning & dependency management system in place to ensure that if an update of a component requires a newer version of another component this gets pulled in and deployed as well.
i somehow presume that we’re not the first ones facing this issue - how have others solved this? what are your recommendations?
I wanted to thank you for joining our community meeting today and speaking up about your issue here. I’ve sent up a flare to some of the developers on the project hoping to get some suggestions.
To me it sounds like this would be an awesome extensibility option - ‘index version management’ or some kind of ‘index migration’ plugin where before/after mappings can be defined, very much like a Ruby on Rails migration. The cluster would perform some kidn of self check at regular intervals and perform these migrations on any indices that need to ‘roll forward’ so to speak.
I personally think it’s an awesome idea. We should file an issue on this if there’s no best practice or some kind of index state management option that could be used.