Hello, I am trying to reindex data from an (old) ElasticSearch 7.16.2 cluster to a (new) OpenSearch 1.2.2 cluster. I have quite a bit of data I want to migrate and so I’ve been experimenting with a small index on the old cluster.
The reindex operation only sometimes works (about 50% of the time, the other 50% is an obscure connection timeout error).
Walkthrough Scenario:
- I send the following cURL request:
curl -X POST "https://<opensearch_node>:9200/_reindex/?pretty=true&wait_for_completion=true&timeout=2m" -H 'Content-Type: application/json' --data @reindex_body.json --cacert ca-certs.pem -u admin
Here are the contents of reindex_body.json
:
{
"source": {
"remote": {
"host": "https://<elasticsearch_node>:9200",
"socket_timeout": "2m",
"connect_timeout": "2m",
"username": "<redacted>",
"password": "<redacted>"
},
"index": "sw-reports-new-test"
},
"dest": {
"index": "sw-reports-new-test92"
}
- The first time it usually works, and I get this:
{
"took" : 443,
"timed_out" : false,
"total" : 9,
"updated" : 9,
"created" : 0,
"deleted" : 0,
"batches" : 1,
"version_conflicts" : 0,
"noops" : 0,
"retries" : {
"bulk" : 0,
"search" : 0
},
"throttled_millis" : 0,
"requests_per_second" : -1.0,
"throttled_until_millis" : 0,
"failures" : [ ]
}
- A few seconds later, I send the same request again, and after about 10 seconds I get this:
{
"error" : {
"root_cause" : [
{
"type" : "i_o_exception",
"reason" : "Connection timed out"
}
],
"type" : "i_o_exception",
"reason" : "Connection timed out"
},
"status" : 500
}
- When I check the server logs I see a very strange and non-specific connection timeout error with no context, really. I don’t even know which connection timed out:
java.io.IOException: Connection timed out
at sun.nio.ch.SocketDispatcher.read0(Native Method) ~[?:?]
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:47) ~[?:?]
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:276) ~[?:?]
at sun.nio.ch.IOUtil.read(IOUtil.java:245) ~[?:?]
at sun.nio.ch.IOUtil.read(IOUtil.java:223) ~[?:?]
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:389) ~[?:?]
at org.apache.http.nio.reactor.ssl.SSLIOSession.receiveEncryptedData(SSLIOSession.java:460) ~[?:?]
at org.apache.http.nio.reactor.ssl.SSLIOSession.isAppInputReady(SSLIOSession.java:522) ~[?:?]
at org.apache.http.impl.nio.reactor.AbstractIODispatch.inputReady(AbstractIODispatch.java:120) ~[?:?]
at org.apache.http.impl.nio.reactor.BaseIOReactor.readable(BaseIOReactor.java:162) ~[?:?]
at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvent(AbstractIOReactor.java:337) ~[?:?]
at org.apache.http.impl.nio.reactor.AbstractIOReactor.processEvents(AbstractIOReactor.java:315) ~[?:?]
at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:276) ~[?:?]
at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104) ~[?:?]
at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:591) ~[?:?]
at java.lang.Thread.run(Thread.java:832) [?:?]
What I’ve Tried:
I’ve tried the following things:
- Reindex and not wait for completion, then querying the task ID. Same result, works about 50% of the time.
- Used multiple timeout values: I’ve set the
timeout
query parameter as well as thesource.remote.socket_timeout
andsource.remote.connect_timeout
values in the body. No differences observed. - Tried reindexing against different source/destination hosts
- Tried restarting all nodes on source/dest cluster
- Tried using a different destination index for each reindex test.
What I’ve Thought About Doing:
- Using different versions of Java on the old and/or new cluster. The old cluster runs Java 8 and the new cluster runs on the bundled jdk (Java 15 I think).
Has anyone had experience with something like this? Any advice is much appreicated.