Issue with async bulkrequest

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
1.2.3

Describe the issue:
All shards failed errors on one thread is impacting async bulkrequest with socket time out.

Configuration:

We have single node of opensearch and 0 replicas.

Relevant Logs or Screenshots:

We are using opensearch in our application. During the application startup we will create opensearch template and then we are creating few objects to a particular index. During index creation and mappings update, if we are querying for any document, query is failing with all shards failed error. Example for query failure from opensearch logs and application logs

Opensearch logs

[2024-03-19T08:24:41,099][INFO ][o.o.c.m.MetadataCreateIndexService] [node] [index1] creating index, cause [auto(bulk api)], templates [index_template], shards [1]/[0]
[2024-03-19T08:24:42,000][DEBUG][o.o.a.s.TransportSearchAction] [node] [eMocDAmPT92sqvwJNayyOA][index1][0]: Failed to execute [SearchRequest{searchType=QUERY_THEN_FETCH, indices=[index1, index2], indicesOptions=IndicesOptions[ignore_unavailable=true, allow_no_indices=true, expand_wildcards_open=true, expand_wildcards_closed=true, expand_wildcards_hidden=false, allow_aliases_to_multiple_indices=true, forbid_closed_indices=true, ignore_aliases=false, ignore_throttled=false], types=[], routing='null', preference='null', requestCache=null, scroll=null, maxConcurrentShardRequests=5, batchedReduceSize=512, preFilterShardSize=null, allowPartialSearchResults=true, localClusterAlias=null, getOrCreateAbsoluteStartMillis=-1, ccsMinimizeRoundtrips=true, source={"size":1,"query":{"query_string":{"query":"path:\”abc\””,”fields":[],"type":"best_fields","default_operator":"or","max_determinized_states":10000,"enable_position_increments":true,"fuzziness":"AUTO","fuzzy_prefix_length":0,"fuzzy_max_expansions":50,"phrase_slop":0,"escape":false,"auto_generate_synonyms_phrase_query":true,"fuzzy_transpositions":true,"boost":1.0}},"_source":{"includes":["resource_type"],"excludes":[]}}, cancelAfterTimeInterval=null}] lastShard [true]
[2024-03-19T08:24:42,013][WARN ][r.suppressed             ] [node] path: /index1,index2/_search, params: {typed_keys=true, max_concurrent_shard_requests=5, ignore_unavailable=true, expand_wildcards=open,closed, allow_no_indices=true, ignore_throttled=false, index=index1,index2, search_type=query_then_fetch, batched_reduce_size=512, ccs_minimize_roundtrips=true}
[2024-03-19T08:25:22,112][INFO ][o.o.c.r.a.AllocationService] [node] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[index1][0]]]).
[2024-03-19T08:25:22,142][INFO ][o.o.c.m.MetadataMappingService] [node] [index1/w4BFjXlDQnCyPcpATNeuTw] update_mapping [_doc]

Application Logs

      Suppressed: org.opensearch.client.ResponseException: method [POST], host [http://localhost:9200], URI [/index1,index2/_search?typed_keys=true&max_concurrent_shard_requests=5&ignore_unavailable=true&expand_wildcards=open%2Cclosed&allow_no_indices=true&ignore_throttled=false&search_type=query_then_fetch&batched_reduce_size=512&ccs_minimize_roundtrips=true], status line [HTTP/1.1 503 Service Unavailable]
{"error":{"root_cause":[],"type":"search_phase_execution_exception","reason":"all shards failed","phase":"query","grouped":true,"failed_shards":[]},"status":503}
                at org.opensearch.client.RestClient.convertResponse(RestClient.java:344) ~[?:?]
                at org.opensearch.client.RestClient.performRequest(RestClient.java:314) ~[?:?]
                at org.opensearch.client.RestClient.performRequest(RestClient.java:289) ~[?:?]
                at org.opensearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1762) ~[?:?]
                at org.opensearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:1728) ~[?:?]
                at org.opensearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:1696) ~[?:?]
                at org.opensearch.client.RestHighLevelClient.search(RestHighLevelClient.java:1087) ~[?:?]

Above queries are running one one thread where we could observe almost all queries are failing with all shard failed error.

All of sudden after 30sec of these failures on thread1, on other thread, where we are performing asyn bulk request openSearchClient.getRestClient().bulkAsync got failed with socket time out error.

2024-03-19T08:25:11.436Z ERROR I/O dispatcher 7 IndexingServiceImpl 77297 - [nsx@6876 comp="nsx-manager" errorCode="MP60503" level="ERROR" subcomp="manager"] [Indexing: BatchProcessing] The Bulk indexing request could not be processed: org.opensearch.action.bulk.BulkRequest@26463834
java.net.SocketTimeoutException: 30,000 milliseconds timeout on connection http-outgoing-1 [ACTIVE]
        at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.timeout(HttpAsyncRequestExecutor.java:387) ~[?:?]
        at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:92) ~[?:?]
        at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:39) ~[?:?]
        at org.apache.http.impl.nio.reactor.AbstractIODispatch.timeout(AbstractIODispatch.java:175) ~[?:?]
        at org.apache.http.impl.nio.reactor.BaseIOReactor.sessionTimedOut(BaseIOReactor.java:263) ~[?:?]
        at org.apache.http.impl.nio.reactor.AbstractIOReactor.timeoutCheck(AbstractIOReactor.java:492) ~[?:?]
        at org.apache.http.impl.nio.reactor.BaseIOReactor.validate(BaseIOReactor.java:213) ~[?:?]
        at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:280) ~[?:?]
        at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104) ~[?:?]
        at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:591) ~[?:?]
        at java.lang.Thread.run(Unknown Source) ~[?:?]

This is how we are creating rest client same is used for both querying and indexing by autowiring below restClient.

RestClientBuilder builder = RestClient.builder(new HttpHost(searchHostName, 9200, protocol));
restClient = new RestHighLevelClient(builder);

One solution that we thought of good to have is to do ClusterIndexStatus before executing queries so that we don’t land up into all shards failed error.

But, we are curious to understand why it has impacting asyncBulkRequest on other thread if Opensearch Java Client is thread safe.

I think the bulk request is not impacted by the query directly, but it could be impacted indirectly because of the high CPU/Memory consumption or high IO util in the cluster which caused by the query.