Reindex API Unexpected Timeouts

jecanne · December 28, 2021, 3:05am

Hello, I am trying to reindex data from an (old) ElasticSearch 7.16.2 cluster to a (new) OpenSearch 1.2.2 cluster. I have quite a bit of data I want to migrate and so I’ve been experimenting with a small index on the old cluster.

The reindex operation only sometimes works (about 50% of the time, the other 50% is an obscure connection timeout error).

Walkthrough Scenario:

I send the following cURL request:

curl -X POST "https://<opensearch_node>:9200/_reindex/?pretty=true&wait_for_completion=true&timeout=2m" -H 'Content-Type: application/json' --data @reindex_body.json --cacert ca-certs.pem -u admin

Here are the contents of reindex_body.json:

{
  "source": {
    "remote": {
      "host": "https://<elasticsearch_node>:9200",
      "socket_timeout": "2m",
      "connect_timeout": "2m",
      "username": "<redacted>",
      "password": "<redacted>"
    },
    "index": "sw-reports-new-test"
  },
  "dest": {
    "index": "sw-reports-new-test92"
  }

The first time it usually works, and I get this:

{
  "took" : 443,
  "timed_out" : false,
  "total" : 9,
  "updated" : 9,
  "created" : 0,
  "deleted" : 0,
  "batches" : 1,
  "version_conflicts" : 0,
  "noops" : 0,
  "retries" : {
    "bulk" : 0,
    "search" : 0
  },
  "throttled_millis" : 0,
  "requests_per_second" : -1.0,
  "throttled_until_millis" : 0,
  "failures" : [ ]
}

A few seconds later, I send the same request again, and after about 10 seconds I get this:

{
  "error" : {
    "root_cause" : [
      {
        "type" : "i_o_exception",
        "reason" : "Connection timed out"
      }
    ],
    "type" : "i_o_exception",
    "reason" : "Connection timed out"
  },
  "status" : 500
}

When I check the server logs I see a very strange and non-specific connection timeout error with no context, really. I don’t even know which connection timed out:

java.io.IOException: Connection timed out
        at sun.nio.ch.SocketDispatcher.read0(Native Method) ~[?:?]
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:47) ~[?:?]
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:276) ~[?:?]
        at sun.nio.ch.IOUtil.read(IOUtil.java:245) ~[?:?]
        at sun.nio.ch.IOUtil.read(IOUtil.java:223) ~[?:?]
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:389) ~[?:?]
        at org.apache.http.nio.reactor.ssl.SSLIOSession.receiveEncryptedData(SSLIOSession.java:460) ~[?:?]
        at org.apache.http.nio.reactor.ssl.SSLIOSession.isAppInputReady(SSLIOSession.java:522) ~[?:?]
        at org.apache.http.impl.nio.reactor.AbstractIODispatch.inputReady(AbstractIODispatch.java:120) ~[?:?]
        at org.apache.http.impl.nio.reactor.BaseIOReactor.readable(BaseIOReactor.java:162) ~[?:?]
        at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvent(AbstractIOReactor.java:337) ~[?:?]
        at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvents(AbstractIOReactor.java:315) ~[?:?]
        at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:276) ~[?:?]
        at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104) ~[?:?]
        at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:591) ~[?:?]
        at java.lang.Thread.run(Thread.java:832) [?:?]

What I’ve Tried:
I’ve tried the following things:

Reindex and not wait for completion, then querying the task ID. Same result, works about 50% of the time.
Used multiple timeout values: I’ve set the timeout query parameter as well as the source.remote.socket_timeout and source.remote.connect_timeout values in the body. No differences observed.
Tried reindexing against different source/destination hosts
Tried restarting all nodes on source/dest cluster
Tried using a different destination index for each reindex test.

What I’ve Thought About Doing:

Using different versions of Java on the old and/or new cluster. The old cluster runs Java 8 and the new cluster runs on the bundled jdk (Java 15 I think).

Has anyone had experience with something like this? Any advice is much appreicated.

mhoydis · December 29, 2021, 10:09pm

I have reindexed massive indices successfully with logstash, instead of the built-in _reindex API feature.
I’ve found the logstash approach to be a bit more flexible, and responsive to backpressure from the target cluster. Basically, you’re setting up an elasticsearch{} input and output in a logstash pipeline. It’s handling one big giant scroll search for you, which you can tune with a refined query and/or the size of each scroll chunk. Heck, you could even parallelize this approach by splitting up the timeframe of the query among different logstash pipelines.

If it’s really a big job and you really want it to get done right, I also recommend throwing kafka info the mix. Have a logstash pipeline pull from the target and output into a kafka topic. Have another logstash pipeline pull from the kafka topic and into your target cluster.

Topic		Replies	Views
Problem with timeout 2 min Open Source Elasticsearch and Kibana	1	2103	February 7, 2023
Timeout when listing indices in the dashboard OpenSearch troubleshoot , configure , index-management	6	433	August 10, 2024
OpenDistro cluster becomes unstable after losing a node OpenDistro	8	965	January 11, 2022
Cluster is frequently lost with master not discovered error and Timeout error Open Source Elasticsearch and Kibana	1	1647	February 7, 2023
Possible issue? Index Management	3	1258	January 17, 2020

Reindex API Unexpected Timeouts

Related topics