We have a bit of a complicated setup, so let me introduce that one first:
- everything runs on kubernetes
- we have a bunch of self contained systems (SCS).
- for the sake of simplicity we’ll focus on only one in this discussion here as it just means that we’ll do the same for all of them
- if an SCS has a use-case where they need opensearch they’ll have an opensearch cluster (i.e. different SCS don’t share a common opensearch cluster to avoid coupling)
- functionality in the SCS is built as microservices
- these microservices are scalable, so it is well possible that more than a single pod of the service is running
- our applications are not run by the teams building them (of which i’m a part) but instead by other companies (i.e. there are dozens of deployments out there, with varying versions of different services in use) and accordingly we also have 0 access to the production system
- downtimes for maintenance windows should be avoided if ever possible, in some future use-cases they might be unacceptable
it is thus imperative that all actions are fully automated and fail-safe (i.e. re-runable). no human interaction must be needed for anything but unrecoverable errors (which in turn shouldn’t happen).
- the opensearch cluster being deployed in an SCS starts out with a minimal configuration needed just to get it online
- some of the minimal setup is customer/landscape specific (e.g. authentication realms)
- the rest of the setup is use-case specific:
- setup of indices
- setup of roles and role mappings for these indices
- the actual data is fed in asynchronously once the setup is done (and there’s a constant flow of data afterwards)
in a naïve implementation the use-case specific setup can be handled by the data ingestion system by just doing it whenever it starts. but that means that if a new pod starts (e.g. scaling from 1 to 2 pods) it’ll try to do it again.
a slightly more advanced system can build a setup-management system around this and version the changes, storing the version information in opensearch so that it only applies the change if it hasn’t been done before (think “liquibase for opensearch”). however, this still leaves the issue if multiple pods start at the same time (in general they’re a
Deployment and not a
StatefulSet, i.e. all will start together and not one after the other).
we also considered having a single configuration service (which runs as a singleton per SCS) through which we could handle the config updates (i.e. the other systems call this one with the required config changes) so that it could queue the config updates and ensure that they are run sequentially.
note that we can’t package the configuration with opensearch itself as we don’t know which use-cases will end up running against it (it must be possible to update a use-case specific component/service without updating opensearch and vice-versa). we do have a versioning & dependency management system in place to ensure that if an update of a component requires a newer version of another component this gets pulled in and deployed as well.
i somehow presume that we’re not the first ones facing this issue - how have others solved this? what are your recommendations?
thanks a lot for your feedback!