We just update to 1.13.1 and I seem to have heap memory issues
We can see this on our data only node :
CircuitBreakingException[[parent] Data too large, data for [internal:index/shard/recovery/start_recovery] would be [33575573184/31.2gb], which is larger than the limit of [32641751449/30.3gb], real usage: [33575572272/31.2gb], new bytes reserved: [912/912b], usages [request=0/0b, fielddata=0/0b, in_flight_requests=9908/9.6kb, accounting=4222558184/3.9gb]];
So I look the the Node stat and found the following :
"jvm" : {
"timestamp" : 1616102416988,
"uptime_in_millis" : 25389419,
"mem" : {
"heap_used_in_bytes" : 32834308920,
"heap_used_percent" : 95,
"heap_committed_in_bytes" : 34359738368,
"heap_max_in_bytes" : 34359738368,
"non_heap_used_in_bytes" : 198294112,
"non_heap_committed_in_bytes" : 207482880,
"pools" : {
"young" : {
"used_in_bytes" : 402653184,
"max_in_bytes" : 0,
"peak_used_in_bytes" : 20501757952,
"peak_max_in_bytes" : 0
},
"old" : {
"used_in_bytes" : 32408545328,
"max_in_bytes" : 34359738368,
"peak_used_in_bytes" : 34307804680,
"peak_max_in_bytes" : 34359738368
},
"survivor" : {
"used_in_bytes" : 23110408,
"max_in_bytes" : 0,
"peak_used_in_bytes" : 1583291440,
"peak_max_in_bytes" : 0
}
}
},
"threads" : {
"count" : 227,
"peak_count" : 368
},
"gc" : {
"collectors" : {
"young" : {
"collection_count" : 3560,
"collection_time_in_millis" : 107619
},
"old" : {
"collection_count" : 360,
"collection_time_in_millis" : 1131408
}
}
},
"buffer_pools" : {
"mapped" : {
"count" : 157716,
"used_in_bytes" : 3579872863310,
"total_capacity_in_bytes" : 3579872863310
},
"direct" : {
"count" : 190,
"used_in_bytes" : 68222478,
"total_capacity_in_bytes" : 68222477
},
"mapped - 'non-volatile memory'" : {
"count" : 0,
"used_in_bytes" : 0,
"total_capacity_in_bytes" : 0
}
}
"breakers" : {
"request" : {
"limit_size_in_bytes" : 20615843020,
"limit_size" : "19.1gb",
"estimated_size_in_bytes" : 0,
"estimated_size" : "0b",
"overhead" : 1.0,
"tripped" : 0
},
"fielddata" : {
"limit_size_in_bytes" : 13743895347,
"limit_size" : "12.7gb",
"estimated_size_in_bytes" : 0,
"estimated_size" : "0b",
"overhead" : 1.03,
"tripped" : 0
},
"in_flight_requests" : {
"limit_size_in_bytes" : 34359738368,
"limit_size" : "32gb",
"estimated_size_in_bytes" : 563,
"estimated_size" : "563b",
"overhead" : 2.0,
"tripped" : 0
},
"accounting" : {
"limit_size_in_bytes" : 34359738368,
"limit_size" : "32gb",
"estimated_size_in_bytes" : 4222558184,
"estimated_size" : "3.9gb",
"overhead" : 1.0,
"tripped" : 0
},
"parent" : {
"limit_size_in_bytes" : 32641751449,
"limit_size" : "30.3gb",
"estimated_size_in_bytes" : 32834308920,
"estimated_size" : "30.5gb",
"overhead" : 1.0,
"tripped" : 13593254
}
}
I look in the gc.log and didn’t found any more answer to why the memory is so high
[2021-03-18T19:39:04.440+0000][1][gc,start ] GC(3275) Pause Young (Concurrent Start) (G1 Humongous Allocation)
[2021-03-18T19:39:04.440+0000][1][gc,task ] GC(3275) Using 43 workers of 43 for evacuation
[2021-03-18T19:39:04.440+0000][1][gc,age ] GC(3275) Desired survivor size 109051904 bytes, new threshold 15 (max threshold 15)
[2021-03-18T19:39:04.461+0000][1][gc,age ] GC(3275) Age table with threshold 15 (max threshold 15)
[2021-03-18T19:39:04.461+0000][1][gc,age ] GC(3275) - age 1: 12946144 bytes, 12946144 total
[2021-03-18T19:39:04.461+0000][1][gc,phases ] GC(3275) Pre Evacuate Collection Set: 0.9ms
[2021-03-18T19:39:04.461+0000][1][gc,phases ] GC(3275) Merge Heap Roots: 1.5ms
[2021-03-18T19:39:04.461+0000][1][gc,phases ] GC(3275) Evacuate Collection Set: 13.8ms
[2021-03-18T19:39:04.461+0000][1][gc,phases ] GC(3275) Post Evacuate Collection Set: 3.9ms
[2021-03-18T19:39:04.461+0000][1][gc,phases ] GC(3275) Other: 1.0ms
[2021-03-18T19:39:04.461+0000][1][gc,heap ] GC(3275) Eden regions: 22->0(101)
[2021-03-18T19:39:04.461+0000][1][gc,heap ] GC(3275) Survivor regions: 0->1(13)
[2021-03-18T19:39:04.461+0000][1][gc,heap ] GC(3275) Old regions: 1972->1972
[2021-03-18T19:39:04.461+0000][1][gc,heap ] GC(3275) Archive regions: 0->0
[2021-03-18T19:39:04.461+0000][1][gc,heap ] GC(3275) Humongous regions: 1->0
[2021-03-18T19:39:04.461+0000][1][gc,metaspace ] GC(3275) Metaspace: 112607K(115708K)->112607K(115708K) NonClass: 99340K(101500K)->99340K(101500K) Class: 13267K(14208K)->13267K(14208K)
[2021-03-18T19:39:04.461+0000][1][gc ] GC(3275) Pause Young (Concurrent Start) (G1 Humongous Allocation) 31546M->31202M(32768M) 21.182ms
[2021-03-18T19:39:04.461+0000][1][gc,cpu ] GC(3275) User=0.64s Sys=0.04s Real=0.02s
[2021-03-18T19:39:04.461+0000][1][gc ] GC(3276) Concurrent Cycle
[2021-03-18T19:39:04.461+0000][1][safepoint ] Safepoint "G1TryInitiateConcMark", Time since last: 960189638 ns, Reaching safepoint: 582197 ns, At safepoint: 21526113 ns, Total: 22108310 ns
[2021-03-18T19:39:04.462+0000][1][gc,marking ] GC(3276) Concurrent Clear Claimed Marks
[2021-03-18T19:39:04.462+0000][1][gc,marking ] GC(3276) Concurrent Clear Claimed Marks 0.137ms
[2021-03-18T19:39:04.462+0000][1][gc,marking ] GC(3276) Concurrent Scan Root Regions
[2021-03-18T19:39:04.482+0000][1][gc,marking ] GC(3276) Concurrent Scan Root Regions 19.986ms
[2021-03-18T19:39:04.482+0000][1][gc,marking ] GC(3276) Concurrent Mark (19316.913s)
[2021-03-18T19:39:04.482+0000][1][gc,marking ] GC(3276) Concurrent Mark From Roots
[2021-03-18T19:39:04.482+0000][1][gc,task ] GC(3276) Using 11 workers of 11 for marking
[2021-03-18T19:39:05.462+0000][1][safepoint ] Safepoint "Cleanup", Time since last: 1000186658 ns, Reaching safepoint: 603855 ns, At safepoint: 43795 ns, Total: 647650 ns
[2021-03-18T19:39:08.564+0000][1][gc,marking ] GC(3276) Concurrent Mark From Roots 4082.489ms
[2021-03-18T19:39:08.564+0000][1][gc,marking ] GC(3276) Concurrent Preclean
[2021-03-18T19:39:08.617+0000][1][gc,marking ] GC(3276) Concurrent Preclean 52.501ms
[2021-03-18T19:39:08.617+0000][1][gc,marking ] GC(3276) Concurrent Mark (19316.913s, 19321.048s) 4135.083ms
[2021-03-18T19:39:08.617+0000][1][gc,start ] GC(3276) Pause Remark
[2021-03-18T19:39:08.624+0000][1][gc ] GC(3276) Pause Remark 31709M->31702M(32768M) 7.209ms
[2021-03-18T19:39:08.625+0000][1][gc,cpu ] GC(3276) User=0.17s Sys=0.04s Real=0.00s
[2021-03-18T19:39:08.625+0000][1][safepoint ] Safepoint "G1Concurrent", Time since last: 3154514507 ns, Reaching safepoint: 353958 ns, At safepoint: 7443548 ns, Total: 7797506 ns
[2021-03-18T19:39:08.625+0000][1][gc,marking ] GC(3276) Concurrent Rebuild Remembered Sets
[2021-03-18T19:39:09.504+0000][1][gc,start ] GC(3277) Pause Young (Normal) (G1 Humongous Allocation)
[2021-03-18T19:39:09.504+0000][1][gc,task ] GC(3277) Using 43 workers of 43 for evacuation
[2021-03-18T19:39:09.504+0000][1][gc,age ] GC(3277) Desired survivor size 109051904 bytes, new threshold 15 (max threshold 15)
[2021-03-18T19:39:09.541+0000][1][gc,age ] GC(3277) Age table with threshold 15 (max threshold 15)
[2021-03-18T19:39:09.541+0000][1][gc ] GC(3277) To-space exhausted
[2021-03-18T19:39:09.541+0000][1][gc,phases ] GC(3277) Pre Evacuate Collection Set: 0.8ms
[2021-03-18T19:39:09.541+0000][1][gc,phases ] GC(3277) Merge Heap Roots: 1.6ms
[2021-03-18T19:39:09.541+0000][1][gc,phases ] GC(3277) Evacuate Collection Set: 14.4ms
[2021-03-18T19:39:09.541+0000][1][gc,phases ] GC(3277) Post Evacuate Collection Set: 20.3ms
[2021-03-18T19:39:09.541+0000][1][gc,phases ] GC(3277) Other: 0.5ms
[2021-03-18T19:39:09.541+0000][1][gc,heap ] GC(3277) Eden regions: 34->0(102)
[2021-03-18T19:39:09.541+0000][1][gc,heap ] GC(3277) Survivor regions: 1->0(0)
[2021-03-18T19:39:09.541+0000][1][gc,heap ] GC(3277) Old regions: 1967->1995
[2021-03-18T19:39:09.541+0000][1][gc,heap ] GC(3277) Archive regions: 0->0
[2021-03-18T19:39:09.541+0000][1][gc,heap ] GC(3277) Humongous regions: 46->0
[2021-03-18T19:39:09.541+0000][1][gc,metaspace ] GC(3277) Metaspace: 112607K(115708K)->112607K(115708K) NonClass: 99340K(101500K)->99340K(101500K) Class: 13267K(14208K)->13267K(14208K)
[2021-03-18T19:39:09.541+0000][1][gc ] GC(3277) Pause Young (Normal) (G1 Humongous Allocation) 32465M->31617M(32768M) 37.608ms
[2021-03-18T19:39:09.541+0000][1][gc,cpu ] GC(3277) User=0.85s Sys=0.05s Real=0.04s
[2021-03-18T19:39:09.541+0000][1][safepoint ] Safepoint "G1CollectForAllocation", Time since last: 877640595 ns, Reaching safepoint: 1084860 ns, At safepoint: 37973528 ns, Total: 39058388 ns
[2021-03-18T19:39:11.416+0000][1][safepoint ] Safepoint "PrintThreads", Time since last: 1860186604 ns, Reaching safepoint: 953011 ns, At safepoint: 13170786 ns, Total: 14123797 ns
[2021-03-18T19:39:11.417+0000][1][safepoint ] Safepoint "PrintJNI", Time since last: 180516 ns, Reaching safepoint: 1018775 ns, At safepoint: 37052 ns, Total: 1055827 ns
[2021-03-18T19:39:11.418+0000][1][safepoint ] Safepoint "FindDeadlocks", Time since last: 70564 ns, Reaching safepoint: 940848 ns, At safepoint: 57897 ns, Total: 998745 ns
[2021-03-18T19:39:12.842+0000][1][gc,marking ] GC(3276) Concurrent Rebuild Remembered Sets 4217.641ms
[2021-03-18T19:39:12.843+0000][1][gc,start ] GC(3276) Pause Cleanup
[2021-03-18T19:39:12.856+0000][1][gc ] GC(3276) Pause Cleanup 31877M->31877M(32768M) 12.297ms
[2021-03-18T19:39:12.856+0000][1][gc,cpu ] GC(3276) User=0.03s Sys=0.05s Real=0.01s
[2021-03-18T19:39:12.856+0000][1][safepoint ] Safepoint "G1Concurrent", Time since last: 1424551309 ns, Reaching safepoint: 848540 ns, At safepoint: 12507145 ns, Total: 13355685 ns
[2021-03-18T19:39:12.856+0000][1][gc,marking ] GC(3276) Concurrent Cleanup for Next Mark
[2021-03-18T19:39:12.872+0000][1][gc,marking ] GC(3276) Concurrent Cleanup for Next Mark 16.241ms
[2021-03-18T19:39:12.872+0000][1][gc ] GC(3276) Concurrent Cycle 8410.749ms
It’s running in docker containers, the official one, with the G1GC default prameters :
-XX:+UseG1GC
-XX:G1ReservePercent=25
-XX:InitiatingHeapOccupancyPercent=30
I know we have a high shard volume, 12000, running on 6 data node, 3 ingest and 3 master.
Do anyone has an idea where to look to resolve this issue ?