Filtering across indices: Return IDs which are in one index but not in another

Hello community,

I am struggling with filtering across two indices.
The example is as follows: In the first index (index_1), there are data with values. I take the data values, multiply them by 10 and store them in index_2 with the same ID as they have in index_1. Then new data arrives in index_1. I do not want to recalculate all values for index_2 again but I only want to do the calculation for the new arrived data. Therefore, I need a query that only returns the IDs of index_1 which are not in index_2.

The first index looks like
index_1:
ID = 11, value_1 = 2.3
ID = 32, value_1 = 3.4
ID = 43, value_1 = 2.2
ID = 50, value_1 = 2.0

The second index looks like
index_2
ID = 11, value_2 = 23
ID = 23, value_2 = 34
ID = 43, value_2 = 22

The query should return ID = 50, value_1 = 2.0 from index_1 because ID50 is not in index_2.

I already tried a left join and a postfiltering, but as discussed in this topic, the postfiltering is not properly working.

Does anyone have an idea how to query this? It seems like a basic query but so far I didnt find any solution.

Best,
Jan

@JanK If your indices share a pattern, probably you can try out index pattern, ex. SELECT ID FROM index_*. Thanks!

@daichen Thanks for our reply!
I tried your suggestion. However, SELECT ID FROM index_* returns the union of the IDs from index_1 and index_2. I am looking for something like: index_1.ID not in index_2.ID. Do you have another idea how to query this?

Got it. I think what you need is set intersect operator. Unfortunately, only union and minus (difference) are supported in old engine and JSON format only. Not sure if this is useful to you or not. We will add full support to the new engine in future. You can open an issue for tracking if this is what you need. Thanks!

POST index1/_doc
{
  "id": 11
}

POST index1/_doc
{
  "id": 50
}

POST index2/_doc
{
  "id": 11
}

POST _opendistro/_sql?format=json
{
  "query": """
    SELECT id FROM index1
    MINUS
    SELECT id FROM index2
  """
}
{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 0,
    "successful" : 0,
    "failed" : 0
  },
  "hits" : {
    "hits" : [
      {
        "_type" : "_doc",
        "_source" : {
          "id" : 50
        },
        "_id" : "eJU8ZXgBL9gQF7tvnJj8",
        "_score" : "NaN"
      }
    ],
    "total" : {
      "value" : 1,
      "relation" : "EQUAL_TO"
    },
    "max_score" : 1.0
  }
}

@daichen thanks for your answer!
The minus operator is exactly what I am looking for. However, it did not work with my data as expected even though it worked with the example you provided. It took me some time to figure out what’s wrong. The problem is that in the indexes there are more than 10,000 entries.
For example
index_1:
ID = 11, value_1 = 2.3
ID = 32, value_1 = 3.4
ID = 43, value_1 = 2.2
ID = 50, value_1 = 2.0

index_2:
ID = 11, value_1 = 23
ID = 32, value_1 = 34
ID = 32, value_2 = 33
ID = 32, value_3 = 35

ID = 32, value_10000 = 38
ID = 43, value_1 = 22

ID is here just a placeholder for any identifier, each entry still has an uuid.
The Query

POST _opendistro/_sql?format=jso
 {
      "query": """
        SELECT id FROM index1
        MINUS
        SELECT id FROM index2
      """
}

only takes the first 10,000 results of index_2 into account: ID 11 and ID 32. The thus query returns ID 43 and ID 50. Although ID 43 is in index 2.

I tried using distinct but this also did not result in the correct output.

POST _opendistro/_sql?format=jso
 {
      "query": """
        SELECT id FROM index1
        MINUS
        SELECT DISTINCT id FROM index2
      """
}

Is there a solution how to circumvent the limitation of 10,000 elements?

Best,
Jan