Rolling upgrade from version 1.x to 2.2.1 does not work

Rolling upgrade from 1.2.4 to 2.2.1:
The upgraded 2.2.1 node comes up and is listed in cat/nodes from the non-upgraded (1.2.4) nodes, but any API call on the upgrade node fails with

{
  "error" : {
    "root_cause" : [
      {
        "type" : "class_not_found_exception",
        "reason" : "class_not_found_exception: com.amazon.opendistroforelasticsearch.security.user.User"
      }
    ],
    "type" : "exception",
    "reason" : "java.lang.ClassNotFoundException: com.amazon.opendistroforelasticsearch.security.user.User",
    "caused_by" : {
      "type" : "class_not_found_exception",
      "reason" : "class_not_found_exception: com.amazon.opendistroforelasticsearch.security.user.User"
    }
  },
  "status" : 500
}

Next tried rolling upgrade from 1.2.4 → 1.3.5 → 2.21 (Based on OS upgrade from 1.0.0 to 2.2 )
The upgrade to 1.3.5 worked fine. Once the entire cluster was upgraded to 1.3.5, started rolling upgrade to 2.2.1. The first node once upgraded starts up with this error

2022-09-23T22:11:54,777 Thread-8     [E] ope.sec.con.ConfigurationRepository - [UID=] - Cannot apply default config (this is maybe not an error!)
com.fasterxml.jackson.core.JsonGenerationException: No current event to copy
	at com.fasterxml.jackson.core.JsonGenerator._reportError(JsonGenerator.java:2710) ~[jackson-core-2.13.3.jar:2.13.3]
	at com.fasterxml.jackson.core.JsonGenerator.copyCurrentEvent(JsonGenerator.java:2433) ~[jackson-core-2.13.3.jar:2.13.3]
	at com.fasterxml.jackson.core.JsonGenerator.copyCurrentStructure(JsonGenerator.java:2555) ~[jackson-core-2.13.3.jar:2.13.3]
	at org.opensearch.common.xcontent.json.JsonXContentGenerator.copyCurrentStructure(JsonXContentGenerator.java:418) ~[opensearch-x-content-2.2.1.jar:2.2.1]
	at org.opensearch.common.xcontent.XContentBuilder.copyCurrentStructure(XContentBuilder.java:1013) ~[opensearch-x-content-2.2.1.jar:2.2.1]
	at org.opensearch.security.support.ConfigHelper.readXContent(ConfigHelper.java:125) ~[opensearch-security-2.2.1.0.jar:2.2.1.0]
	at org.opensearch.security.support.ConfigHelper.uploadFile(ConfigHelper.java:78) ~[opensearch-security-2.2.1.0.jar:2.2.1.0]
	at org.opensearch.security.configuration.ConfigurationRepository$1.run(ConfigurationRepository.java:144) [opensearch-security-2.2.1.0.jar:2.2.1.0]
	at java.lang.Thread.run(Thread.java:887) [?:?]

After adding an empty allowlist.yml security config file (as per [BUG] AccessControlException: access denied on start · Issue #2065 · opensearch-project/security · GitHub ) the error changes to

2022-09-23T22:15:44,803 worker][T#1] [E] org.ope.sec.aut.BackendRegistry     - [UID=] - Not yet initialized (you may need to run securityadmin)

Updated opensearch.yml to use

  • cluster.initial_cluster_manager_nodes instead of cluster.initial_master_nodes
  • node.roles: [“cluster_manager”] instead of node.master: true
    With this the upgraded node starts up fine without any error in the logs. It is also listed in the _cat/nodes invokes from any of the 1.3.5 nodes. The cluster health is also green. But any API calls to the upgraded 2.2.1 node gives the following error.
{
  "error" : {
    "root_cause" : [
      {
        "type" : "security_exception",
        "reason" : "no permissions for [cluster:monitor/state] and User [name=admin, backend_roles=[], requestedTenant=null]"
      }
    ],
    "type" : "security_exception",
    "reason" : "no permissions for [cluster:monitor/state] and User [name=admin, backend_roles=[], requestedTenant=null]"
  },
  "status" : 403
}

Upgrading a data node to 2.2.1 makes its data unavailable to the cluster and the cluster start goes into yellow based in the shard unavailability.

NOTE: cluster upgrade from 1.2.4 to 2.2.1 was successful with a full cluster restart instead of rolling restart without any config changes to opensearch.yml.

What is the recommended way to for a rolling restart upgrade from 1.2.4 to 2.2.1? Is that even possible?

what kind of installation is this?
using the kubernetes operator i am able to easily upgrade from 1.3.4 to 2.x (tested to 2.2.1 as well 2.3.0) performing rolling upgrades.

@anubisg1 thanks for your reply!

This is a tarball based installation.
The upgrade process is same as Upgrade from Elasticsearch OSS to OpenSearch - OpenSearch documentation. Pretty much same as the k8s-operator (opensearch-k8s-operator/upgrade.go at opensearch-operator-2.0.4 · Opster/opensearch-k8s-operator · GitHub)

Although the upgraded 2.2.1 node does not show any error in the logs during server startup, when i check the _nodes/_all API from another 1.3.5 node, there is a node failure for the upgraded node:

{
  "_nodes" : {
    "total" : 6,
    "successful" : 5,
    "failed" : 1,
    "failures" : [
      {
        "type" : "failed_node_exception",
        "reason" : "Failed node [F8CAMr99RPaP1pll7wlfGQ]",
        "node_id" : "F8CAMr99RPaP1pll7wlfGQ",
        "caused_by" : {
          "type" : "exception",
          "reason" : "java.lang.ClassNotFoundException: com.amazon.opendistroforelasticsearch.security.user.User",
          "caused_by" : {
            "type" : "class_not_found_exception",
            "reason" : "class_not_found_exception: com.amazon.opendistroforelasticsearch.security.user.User"
          }
        }
      }
    ]
  },
// truncated

This cluster is setup with a few internal users, roles, etc. No external users/roles.

Based on the error above looked like the 2.2.1 cluster is not able to read the User information stored in the security index. By the way i has the security index named as .security in OpenSearch 1.2.4 itself, i.e. not the default .opendistro-security.

So tried an additional step of reinitializing the security index using securityadmin.sh after upgrading the cluster to OpenSearch 1.3.5 version with the hope that it will use the latest serialization class for storing the user in the security index. Unfortunately that didnt help, and i have the same “java.lang.ClassNotFoundException: com.amazon.opendistroforelasticsearch.security.user.User” error for an upgraded 2.2.1 node.

This error is same as in [BUG] [class_not_found_exception] during rolling upgrade on security enabled cluster · Issue #1259 · opensearch-project/security · GitHub but this cluster was not upgraded from OpenDistro; rather it was started fresh as OpenSearch 1.2.4.

Do let me know if you have any pointers on this.

The 1.3.5 nodes serializes the User with the base package as “com.amazon.opendistroforelasticsearch” when sending a request to the 2.2.1 node. The 2.2.1 security module has no idea how to handle this base package. I had to revert back Replace opensearch class names with opendistro class names during serialization and restore them back during deserialization by vrozov · Pull Request #1278 · opensearch-project/security · GitHub partially, i.e. at least the deserialization part in Base64Helper.java to get this rolling upgrade to work.

I wonder how the k8s-operator can do the rolling upgrade for this setup since any newly added 2.2.1 will not be able to process requests from any 1.x node due to this deserialization issue! Looks like a bug from the rolling upgrade standpoint.

Have you opened a bug in GitHub - opensearch-project/security: 🔐 Secure your cluster with TLS, numerous authentication backends, data masking, audit logging as well as role-based access control on indices, documents, and fields yet? I think you should, this doesn’t look like the expected behavior.

Sure @dblock . Created [BUG] User object deserialization prevents Rolling upgrade from version 1.x to 2.2.1 · Issue #2168 · opensearch-project/security · GitHub to track this.