_cat/indices reports docs.deleted decreasing

reshippie · August 30, 2024, 12:02am

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
2.14.0

Describe the issue:
We have Logstash creating daily indices of logs collected. I used curl to hit _cat/indices/$INDEX twice in about 30 seconds. The first time it reported docs.count - 24347115 and docs.deleted - 562865. The second time it reported docs.count - 24357773 and docs.deleted - 560369.
How is that possible?

I was initially investigating the fact that we have such a high number of deleted documents for an index that should only ever be added to. I am misunderstanding what docs.deleted means?

yeonghyeonKo · August 30, 2024, 8:07am

Actually, both OpenSearch and Elasticsearch cannot store deleted documents forever. When you try to delete a document, the cluster just checks whether it is to be searched or not because there is no need to merge segments in real time. Lucene uses Bitsets data structure for marking deletion.
(The tasks for merging segments need quite high CPU usage and Disk I/O. But you can still test it using _forcemerge API)

Data nodes periodically merge segments in background for optimizing the performance.
At the same time, deleted documents won’t be moved to new segments. (JUST alive documents will be)

reshippie · August 30, 2024, 5:28pm

Are you saying that docs.deleted means the number of documents marked for deletion, not the number of documents actually deleted?

yeonghyeonKo · August 31, 2024, 12:32am

yes, until each data node finish merging its segments. At the time old segments are combined into new one, you can say that the documents marked as to be deleted are completely passed away.

reshippie · August 31, 2024, 1:03am

If that’s the case, then why does calling _forcemerge on an index with the value of docs.deleted in the hundreds of thousands not have any effect on that value?
I can guarantee that nothing is being written to that index.
I can also guarantee that there is no way hundreds of thousands of documents were deleted from that index in the first place.

yeonghyeonKo · September 10, 2024, 12:59am

Can you show the number and the size of indices?
Also, your indices might be read-only after a day, right? (as you explained Logstash collects logs daily.) The reason why I ask you whether the indices are read-only or not is that making force-merge an index receiving writes can cause very large segments to be produced and make snapshots more expensive than before. (Force merge API | Elasticsearch Guide [8.15] | Elastic)

When you try calling _forcemerge API, i recommend you to attach wait_for_completion=false query parameter(asynchronously) so the task can still be running in the background, not falling into a connection loss.

yeonghyeonKo · September 10, 2024, 2:50am

TieredMergePolicy which is allowed to merge non-adjacent segments (sorted by size) is a default merging policy in Lucene.

In my opinion(and i’m not sure), force-merge API doesn’t always allow us to delete documents.

As you see the below graph(Changing Bits: Visualizing Lucene's segment merges), the dark-grey band on top of each segment bar increases i.e the proportion of deletions in a segment is incrementing until TieredMergePolicy determines an optimizing merge based on “budget”.

TieredMergePolicy first computes the allowed “budget” of how many segments should be in the index, by counting how many steps the “perfect logarithmic staircase” would require given total index size, minimum segment size (floored), mergeAtOnce, and a new configuration maxSegmentsPerTier that lets you set the allowed width (number of segments) of each stair in the staircase. This is nice because it decouples how many segments to merge at a time from how wide the staircase can be.

Lucene segment merges

Topic		Replies	Views
Indices become bloated not deleting deleted docs and old doc versions GeoSpatial discuss , troubleshoot , configure	2	2349	September 22, 2022
Deleted Documents count in Index Management for certain indices in Dashboard OpenSearch Dashboards	0	89	November 20, 2024
Deleted documents metric in dashboard OpenSearch	4	1259	March 17, 2023
Duplicate documents in different indexes OpenSearch	1	1263	January 3, 2023
Index deletion on 80% diskspace leads to race condition Index Management discuss , troubleshoot , configure	1	196	June 1, 2024

_cat/indices reports docs.deleted decreasing

Related topics