Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
OpenSearch 2.8 hosted in AWS
OS - Windows 11
Browser - Chrome
Describe the issue:
I am trying to Apache Spark’s newAPIHadoopRDD function to read index (by sending query), and convert results to RDD for further transformations. However, calling the function always results in below error:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: org.opensearch.hadoop.OpenSearchHadoopIllegalArgumentException: Cannot detect OpenSearch version - typically this happens if the network/OpenSearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting ‘opensearch.nodes.wan.only’
at org.opensearch.hadoop.rest.InitializationUtils.discoverAndValidateClusterInfo(InitializationUtils.java:348)
Below is the code snippet:
spark = SparkSession.builder.master(“local[*]”)
.config(“spark.jars”,“Dir..\opensearch-spark-30_2.12-1.3.0.jar”)
…
…
es_conf = {
“opensearch.nodes” : "https://…amazonaws.com’,
“opensearch.port” : “9200”,
“opensearch.net.ssl”: “true”,
“opensearch.resource” : “TestResource”,
“opensearch.query” : “{TestQuery}”,
“opensearch.nodes.wan.only” : “true”,
“opensearch.net.http.auth.user” : os_user,
“opensearch.net.http.auth.pass” : os_password,
“opensearch.batch.size.bytes”: “300000000”,
“opensearch.batch.size.entries”: “10000”,
“opensearch.batch.write.refresh”: “false”,
“opensearch.batch.write.retry.count”: “2”,
“opensearch.batch.write.retry.wait”: “500”,
“opensearch.http.timeout”: “30m”,
“opensearch.http.retries”: “2”,
“opensearch.action.heart.beat.lead”: “50”,
“opensearch.scroll.size”:“10000”,
“opensearch.net.ssl.cert.allow.self.signed”:“true”,
“opensearch.net.ssl.protocol”:“TLS”
}
… // create spark context …
sc.setLogLevel(“ALL”)
es_rdd = sc.newAPIHadoopRDD( inputFormatClass=“org.opensearch.hadoop.mr.OpenSearchInputFormat”,
keyClass=“org.apache.hadoop.io.NullWritable”,
valueClass=“org.opensearch.hadoop.mr.LinkedMapWritable”,
conf= es_conf)
However, just trying to read an index using ‘load’ works without any issues: Code snippet as below:
#Test OS
from pyspark.sql import SparkSession
osurl =‘xxxxxx.amazonaws.com:9200’
osusr = ‘admin’
ospwd = ‘testpasswd’
spark = SparkSession.builder.master(“local[*]”)
.config(“spark.jars”,“LocalDir..\opensearch-spark-30_2.12-1.3.0.jar”)
.config(“opensearch.nodes”, osurl)
.config(“opensearch.net.http.auth.user”, “admin”)
.config(“opensearch.net.http.auth.pass”,ospwd)
.config(“opensearch.net.ssl”, “true”)
.config(“opensearch.nodes.wan.only”, “true”)
.config(“opensearch.net.ssl.cert.allow.self.signed”, “true”)
.getOrCreate()
try:
df = spark.read
.format(“org.opensearch.spark.sql”)
.option(“opensearch.nodes”, osurl)
.option(“opensearch.port”, “9200”)
.option(“opensearch.net.http.auth.user”, osusr)
.option(“opensearch.net.http.auth.pass”, ospwd)
.option(“opensearch.net.ssl”, “true”)
.option(“opensearch.nodes.wan.only”, “true”)
.option(“opensearch.net.ssl.cert.allow.self.signed”, “true”)
.load(“TestIndex”)
df.show(5)
print("Read successful!")
except Exception as e:
print(f"Read failed: {e}")
Configuration:
Windows 11 enterprise
Jupyter notebook (jupyter lab 4.4.0)
Python 3.9
Spark 3.5.5 (Scala 2.12.18)
Java - 17.0.2
opensearch-spark-30_2.12-1.3.0.jar
Relevant Logs or Screenshots:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD. : org.opensearch.hadoop.OpenSearchHadoopIllegalArgumentException: Cannot detect OpenSearch version - typically this happens if the network/OpenSearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting ‘opensearch.nodes.wan.only’ at org.opensearch.hadoop.rest.InitializationUtils.discoverAndValidateClusterInfo(InitializationUtils.java:348) at org.opensearch.hadoop.rest.RestService.findPartitions(RestService.java:229) at org.opensearch.hadoop.mr.OpenSearchInputFormat.getSplits(OpenSearchInputFormat.java:425) at org.opensearch.hadoop.mr.OpenSearchInputFormat.getSplits(OpenSearchInputFormat.java:405) at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:138) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:294) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:290) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:294) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:290) at org.apache.spark.rdd.RDD.$anonfun$take$1(RDD.scala:1471) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:410) at org.apache.spark.rdd.RDD.take(RDD.scala:1465) at org.apache.spark.api.python.SerDeUtil$.pairRDDToPython(SerDeUtil.scala:173) at org.apache.spark.api.python.PythonRDD$.newAPIHadoopRDD(PythonRDD.scala:400) at org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD(PythonRDD.scala) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:568) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) at py4j.ClientServerConnection.run(ClientServerConnection.java:106) at java.base/java.lang.Thread.run(Thread.java:833) Caused by: org.opensearch.hadoop.rest.OpenSearchHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[https://xxxxxx.amazonaws.com:9200]] at org.opensearch.hadoop.rest.NetworkClient.execute(NetworkClient.java:170) at org.opensearch.hadoop.rest.RestClient.execute(RestClient.java:443) at org.opensearch.hadoop.rest.RestClient.execute(RestClient.java:439) at org.opensearch.hadoop.rest.RestClient.execute(RestClient.java:407) at org.opensearch.hadoop.rest.RestClient.mainInfo(RestClient.java:720) at org.opensearch.hadoop.rest.InitializationUtils.discoverAndValidateClusterInfo(InitializationUtils.java:340) … 31 more