I have 2 indices, index_a and index_b. The 2 indices have documents with the almost the exact same template. index_b has some extra fields which was introduced as part of a new feature, but it also has all the fields already present in index_a. We have

I have 2 indices, index_a and index_b.

The 2 indices have documents with the almost the exact same template. index_b has some extra fields which was introduced as part of a new feature, but it also has all the fields already present in index_a.

We have noticed that that index_b is missing about 2000 documents which are present in index_a. We found this out by using the _count API.

Now the question is, is there a way to find out the actual missing documents? Only the missing Ids should also be enough for a start.

Both the indices have a field called member_id which is unique for each document and is the same as the document id, so retrieving only the missing id fields should also be enough.

I cannot compare the index directly to the source database because this data comes from an external API.

What I would do:

  • have a script that scrolls through all data in index_a
  • for every page, take the list of IDs and write a terms query searching in index_b
  • if you get the whole page back (say your page size is 1000, you get 1000 docs back from your search), then all the documents in that page are in index_b. If not, you’ll have to compare the list in your script and identify the missing docs. Add them (or their IDs) to a list in your script
  • repeat all the above until you scrolled through all the data. In the end, the list of your script should have your missing documents

All this shouldn’t put too much load on OpenSearch.