JVM发生频繁 CMS GC,罪魁祸首是这个参数!
了解 CMS GC 的同學(xué),一定知道 -XX:CMSScavengeBeforeRemark 參數(shù),它是用來開啟或關(guān)閉在 CMS-remark 階段之前的清除(Young GC)嘗試。
大家都知道CMS GC 只會回收 OldGen 的對象,那為什么需要這個(gè)參數(shù)? 由于 YoungGen 存在引用 OldGen 對象的情況,因此 CMS-remark 階段會將 YoungGen 作為 OldGen 的 “GC ROOTS” 進(jìn)行掃描,防止回收了不該回收的對象。而配置 -XX:+CMSScavengeBeforeRemark 參數(shù),在 CMS GC 的 CMS-remark 階段開始前先進(jìn)行一次 Young GC,有利于減少 Young Gen 對 Old Gen 的無效引用,降低 CMS-remark 階段的時(shí)間開銷。
這篇文章的內(nèi)容是業(yè)務(wù)開發(fā)同學(xué)遇到的奇怪的頻繁 CMS GC 問題,我們一起定位排查,最終發(fā)現(xiàn)跟 -XX:CMSScavengeBeforeRemark 參數(shù)相關(guān)。
問題
頻繁 Full GC
業(yè)務(wù)開發(fā)同學(xué)通過監(jiān)控發(fā)現(xiàn)線上一臺機(jī)器頻繁 CMS GC,下圖是 CMS GC 監(jiān)控圖,大約從 20 點(diǎn) 5-15 分,每分鐘 8-11 次的持續(xù) CMS GC。
說明:公司監(jiān)控對 Old GC 與 Full GC 是不區(qū)分的,案例中講的其實(shí)是 CMS GC。
OldGen 使用空間占比
從下圖 OldGen 的使用監(jiān)控圖來看,剛開始 OldGen 對象占用 OldGen 約 80% 的空間,經(jīng)過 CMS GC 后,幾乎立馬空間使用的占用比例約在 30% 以下。
JVM 參數(shù)
//JDK 版本 "1.8.0_45" -Xms5324m -Xmx5324m -Xss512k -XX:PermSize=384m -XX:MaxPermSize=384m -XX:MetaspaceSize=256m -XX:MaxMetaspaceSize=256m -XX:NewSize=2048m -XX:MaxNewSize=2048m -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=9 -XX:+UseConcMarkSweepGC -XX:+UseCMSInitiatingOccupancyOnly -XX:+CMSScavengeBeforeRemark -XX:+ScavengeBeforeFullGC -XX:+UseCMSCompactAtFullCollection -XX:+CMSParallelRemarkEnabled -XX:CMSFullGCsBeforeCompaction=9 -XX:CMSInitiatingOccupancyFraction=80 -XX:+CMSClassUnloadingEnabled -XX:SoftRefLRUPolicyMSPerMB=0 -XX:-ReduceInitialCardMarks -XX:+CMSPermGenSweepingEnabled -XX:CMSInitiatingPermOccupancyFraction=80 -XX:+ExplicitGCInvokesConcurrent -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCApplicationStoppedTime -XX:+PrintHeapAtGC -Xloggc:/data/applogs/heap_trace.txt -XX:-HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/data/applogs/HeapDumpOnOutOfMemoryError -XX:+IgnoreUnrecognizedVMOptions結(jié)合 OldGen 的使用空間占比與 JVM 參數(shù)(-XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=80),幾乎可以斷定第一次 CMS GC 是因?yàn)?OldGen 的使用占比到達(dá)了 OldGen 總量的 80%。
疑惑
第一次觸發(fā) CMS GC 可以通過 OldGen 的使用占比到達(dá)了 OldGen 總量的 80% 來解釋,但是通過監(jiān)控可以看到后來 OldGen 使用占比降低到 30% 以下,為什么還一直頻繁進(jìn)行 CMS GC?
分析
GC 監(jiān)控圖展示的還不夠全面,具體問題還是要通過 GC 日志進(jìn)行定位,因?yàn)?GC 日志中的信息更豐富。
GC 日志
為了分析問題,這里選取了第一次、第二次、第三次-第 N 次的 CMS GC 日志。
第一次 CMS GC日志
2019-03-28T20:05:06.906+0800: 3644459.373: [GC (CMS Initial Mark) [1 CMS-initial-mark: 2935428K(3354624K)] 3160044K(5242112K), 0.0586708 secs] [Times: user=0.22 sys=0.00, real=0.06 secs]2019-03-28T20:05:06.965+0800: 3644459.432: Total time for which application threads were stopped: 0.0616049 seconds, Stopping threads took: 0.0001381 seconds2019-03-28T20:05:06.965+0800: 3644459.432: [CMS-concurrent-mark-start]2019-03-28T20:05:08.066+0800: 3644460.533: [CMS-concurrent-mark: 1.101/1.101 secs] [Times: user=1.57 sys=0.05, real=1.10 secs]2019-03-28T20:05:08.066+0800: 3644460.533: [CMS-concurrent-preclean-start]2019-03-28T20:05:08.076+0800: 3644460.543: [CMS-concurrent-preclean: 0.010/0.010 secs] [Times: user=0.01 sys=0.01, real=0.01 secs]2019-03-28T20:05:08.076+0800: 3644460.543: [CMS-concurrent-abortable-preclean-start]2019-03-28T20:05:10.177+0800: 3644462.645: Application time: 3.2124140 seconds{Heap before GC invocations=18476 (full 731):par new generation total 1887488K, used 1887488K [0x0000000673400000, 0x00000006f3400000, 0x00000006f3400000)eden space 1677824K, 100% used [0x0000000673400000, 0x00000006d9a80000, 0x00000006d9a80000)from space 209664K, 100% used [0x00000006e6740000, 0x00000006f3400000, 0x00000006f3400000)to space 209664K, 0% used [0x00000006d9a80000, 0x00000006d9a80000, 0x00000006e6740000)concurrent mark-sweep generation total 3354624K, used 2935428K [0x00000006f3400000, 0x00000007c0000000, 0x00000007c0000000)Metaspace used 90225K, capacity 91504K, committed 91776K, reserved 1130496Kclass space used 9517K, capacity 9806K, committed 9856K, reserved 1048576K2019-03-28T20:05:10.179+0800: 3644462.647: [GC (Allocation Failure) 3644462.647: [ParNew: 1887488K->201195K(1887488K), 0.4228807 secs] 4822916K->3279195K(5242112K), 0.4231546 secs] [Times: user=1.54 sys=0.00, real=0.42 secs]Heap after GC invocations=18477 (full 731):par new generation total 1887488K, used 201195K [0x0000000673400000, 0x00000006f3400000, 0x00000006f3400000)eden space 1677824K, 0% used [0x0000000673400000, 0x0000000673400000, 0x00000006d9a80000)from space 209664K, 95% used [0x00000006d9a80000, 0x00000006e5efae68, 0x00000006e6740000)to space 209664K, 0% used [0x00000006e6740000, 0x00000006e6740000, 0x00000006f3400000)concurrent mark-sweep generation total 3354624K, used 3078000K [0x00000006f3400000, 0x00000007c0000000, 0x00000007c0000000)Metaspace used 90225K, capacity 91504K, committed 91776K, reserved 1130496Kclass space used 9517K, capacity 9806K, committed 9856K, reserved 1048576K}2019-03-28T20:05:10.603+0800: 3644463.070: Total time for which application threads were stopped: 0.4258929 seconds, Stopping threads took: 0.0001722 seconds2019-03-28T20:05:10.904+0800: 3644463.372: [CMS-concurrent-abortable-preclean: 2.397/2.828 secs] [Times: user=6.22 sys=0.10, real=2.83 secs]2019-03-28T20:05:10.904+0800: 3644463.372: Application time: 0.3012271 seconds2019-03-28T20:05:10.907+0800: 3644463.374: [GC (CMS Final Remark) [YG occupancy: 434406 K (1887488 K)]{Heap before GC invocations=18477 (full 731):par new generation total 1887488K, used 434406K [0x0000000673400000, 0x00000006f3400000, 0x00000006f3400000)eden space 1677824K, 13% used [0x0000000673400000, 0x00000006817bed10, 0x00000006d9a80000)from space 209664K, 95% used [0x00000006d9a80000, 0x00000006e5efae68, 0x00000006e6740000)to space 209664K, 0% used [0x00000006e6740000, 0x00000006e6740000, 0x00000006f3400000)concurrent mark-sweep generation total 3354624K, used 3078000K [0x00000006f3400000, 0x00000007c0000000, 0x00000007c0000000)Metaspace used 90225K, capacity 91504K, committed 91776K, reserved 1130496Kclass space used 9517K, capacity 9806K, committed 9856K, reserved 1048576K2019-03-28T20:05:10.907+0800: 3644463.375: [GC (CMS Final Remark) 3644463.375: [ParNew (promotion failed): 434406K->315478K(1887488K), 5.8407801 secs] 3512406K->3486710K(5242112K), 5.8410096 secs] [Times: user=6.84 sys=1.31, real=5.84 secs]Heap after GC invocations=18478 (full 731):par new generation total 1887488K, used 315478K [0x0000000673400000, 0x00000006f3400000, 0x00000006f3400000)eden space 1677824K, 13% used [0x0000000673400000, 0x00000006817bed10, 0x00000006d9a80000)from space 209664K, 39% used [0x00000006e6740000, 0x00000006eb796e60, 0x00000006f3400000)to space 209664K, 95% used [0x00000006d9a80000, 0x00000006e5efae68, 0x00000006e6740000)concurrent mark-sweep generation total 3354624K, used 3171231K [0x00000006f3400000, 0x00000007c0000000, 0x00000007c0000000)Metaspace used 90225K, capacity 91504K, committed 91776K, reserved 1130496Kclass space used 9517K, capacity 9806K, committed 9856K, reserved 1048576K}3644469.216: [Rescan (parallel) , 0.3096135 secs]3644469.525: [weak refs processing, 0.0009228 secs]3644469.526: [class unloading, 0.0797710 secs]3644469.606: [scrub symbol table, 0.0229535 secs]3644469.629: [scrub string table, 0.0020416 secs][1 CMS-remark: 3171231K(3354624K)] 3486710K(5242112K), 6.2593934 secs] [Times: user=8.10 sys=1.36, real=6.26 secs]2019-03-28T20:05:17.166+0800: 3644469.634: Total time for which application threads were stopped: 6.2622888 seconds, Stopping threads took: 0.0002099 seconds2019-03-28T20:05:17.167+0800: 3644469.634: [CMS-concurrent-sweep-start]2019-03-28T20:05:17.176+0800: 3644469.644: Application time: 0.0100218 seconds2019-03-28T20:05:17.179+0800: 3644469.647: Total time for which application threads were stopped: 0.0025500 seconds, Stopping threads took: 0.0001934 seconds2019-03-28T20:05:18.179+0800: 3644470.647: Application time: 1.0001731 seconds2019-03-28T20:05:18.182+0800: 3644470.649: Total time for which application threads were stopped: 0.0026811 seconds, Stopping threads took: 0.0001358 seconds2019-03-28T20:05:21.000+0800: 3644473.468: Application time: 2.8185985 seconds2019-03-28T20:05:21.003+0800: 3644473.471: Total time for which application threads were stopped: 0.0029238 seconds, Stopping threads took: 0.0001172 seconds2019-03-28T20:05:21.013+0800: 3644473.481: Application time: 0.0097451 seconds2019-03-28T20:05:21.019+0800: 3644473.487: Total time for which application threads were stopped: 0.0060990 seconds, Stopping threads took: 0.0002775 seconds2019-03-28T20:05:21.734+0800: 3644474.201: Application time: 0.7144315 seconds2019-03-28T20:05:21.736+0800: 3644474.204: Total time for which application threads were stopped: 0.0026804 seconds, Stopping threads took: 0.0001238 seconds2019-03-28T20:05:22.203+0800: 3644474.671: [CMS-concurrent-sweep: 5.019/5.037 secs] [Times: user=5.28 sys=0.27, real=5.03 secs]2019-03-28T20:05:22.204+0800: 3644474.671: [CMS-concurrent-reset-start]2019-03-28T20:05:22.211+0800: 3644474.678: [CMS-concurrent-reset: 0.007/0.007 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]2019-03-28T20:05:22.238+0800: 3644474.706: Application time: 0.5016696 seconds2019-03-28T20:05:22.241+0800: 3644474.708: Total time for which application threads were stopped: 0.0026876 seconds, Stopping threads took: 0.0001305 seconds2019-03-28T20:05:22.438+0800: 3644474.905: Application time: 0.1970764 seconds2019-03-28T20:05:22.440+0800: 3644474.908: Total time for which application threads were stopped: 0.0027034 seconds, Stopping threads took: 0.0001344 seconds2019-03-28T20:05:23.441+0800: 3644475.908: Application time: 1.0001304 seconds2019-03-28T20:05:23.443+0800: 3644475.911: Total time for which application threads were stopped: 0.0024875 seconds, Stopping threads took: 0.0001316 seconds2019-03-28T20:05:24.210+0800: 3644476.678: Application time: 0.7671567 seconds看第一次 CMS GC 日志,有以下四個(gè)發(fā)現(xiàn):
由日志 “CMS-initial-mark: 2935428K(3354624K)”可知,第一次 CMS GC 是因?yàn)?2935428 / 3354624 = 87.5% > 80%,與此前監(jiān)控圖分析一致。
由日志 “2019-03-28T20:05:22.211+0800: 3644474.678: [CMS-concurrent-reset: 0.007/0.007 secs]” 可知,第一次 CMS GC 完成具體時(shí)間是 20:05:22.211。
由日志 “[GC (Allocation Failure) 3644462.647: [ParNew: 1887488K->201195K(1887488K), 0.4228807 secs]” 和日志 “[GC (CMS Final Remark) 3644463.375: [ParNew (promotion failed): 434406K->315478K(1887488K), 5.8407801 secs]”可知,第一次 CMS GC 日志中包含兩次 Young GC,并且第一次 YoungGC 是由于 Allocation Failure,而第二次是因?yàn)槭裁茨?#xff0c;其實(shí)是因?yàn)榕渲昧?XX:+CMSScavengeBeforeRemark 參數(shù),因此在 CMS-remark 階段前進(jìn)行了一次 YoungGC。
除了以上的信息,還有個(gè)奇怪的現(xiàn)象是,Young GC 后 eden、from、to 三個(gè) space 的使用量都不是 0 使用的情況,正常情況 Young GC 后 eden 和 to space 的使用量應(yīng)該是 0。?
這里其實(shí)不奇怪,通過日志 “concurrent mark-sweep generation total 3354624K, used 3171231K” 可知,OldGen 所剩無幾,而且還可能存在碎片,這會導(dǎo)致 Young GC 晉升的對象,無處安放,導(dǎo)致 Young GC 回收停止了。
第二次 CMS GC 日志
2019-03-28T20:05:24.213+0800: 3644476.680: [GC (CMS Initial Mark) [1 CMS-initial-mark: 899032K(3354624K)] 1548664K(5242112K), 0.2663899 secs] [Times: user=0.82 sys=0.00, real=0.27 secs]2019-03-28T20:05:24.479+0800: 3644476.947: Total time for which application threads were stopped: 0.2691825 seconds, Stopping threads took: 0.0001300 seconds2019-03-28T20:05:24.480+0800: 3644476.947: [CMS-concurrent-mark-start]2019-03-28T20:05:25.453+0800: 3644477.921: [CMS-concurrent-mark: 0.973/0.973 secs] [Times: user=0.98 sys=0.00, real=0.97 secs]2019-03-28T20:05:25.453+0800: 3644477.921: [CMS-concurrent-preclean-start]2019-03-28T20:05:25.463+0800: 3644477.931: [CMS-concurrent-preclean: 0.010/0.010 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]2019-03-28T20:05:25.463+0800: 3644477.931: [CMS-concurrent-abortable-preclean-start]2019-03-28T20:05:25.463+0800: 3644477.931: [CMS-concurrent-abortable-preclean: 0.000/0.000 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]2019-03-28T20:05:25.464+0800: 3644477.931: Application time: 0.9841796 seconds2019-03-28T20:05:25.466+0800: 3644477.934: [GC (CMS Final Remark) [YG occupancy: 649871 K (1887488 K)]{Heap before GC invocations=18478 (full 732):par new generation total 1887488K, used 649871K [0x0000000673400000, 0x00000006f3400000, 0x00000006f3400000)eden space 1677824K, 33% used [0x0000000673400000, 0x0000000695e4cfa0, 0x00000006d9a80000)from space 209664K, 39% used [0x00000006e6740000, 0x00000006eb796e60, 0x00000006f3400000)to space 209664K, 95% used [0x00000006d9a80000, 0x00000006e5efae68, 0x00000006e6740000)concurrent mark-sweep generation total 3354624K, used 899032K [0x00000006f3400000, 0x00000007c0000000, 0x00000007c0000000)Metaspace used 90199K, capacity 91456K, committed 91776K, reserved 1130496Kclass space used 9512K, capacity 9798K, committed 9856K, reserved 1048576K2019-03-28T20:05:25.466+0800: 3644477.934: [GC (CMS Final Remark) 3644477.934: [ParNew: 649871K->649871K(1887488K), 0.0000289 secs] 1548903K->1548903K(5242112K), 0.0001785 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]Heap after GC invocations=18479 (full 732):par new generation total 1887488K, used 649871K [0x0000000673400000, 0x00000006f3400000, 0x00000006f3400000)eden space 1677824K, 33% used [0x0000000673400000, 0x0000000695e4cfa0, 0x00000006d9a80000)from space 209664K, 39% used [0x00000006e6740000, 0x00000006eb796e60, 0x00000006f3400000)to space 209664K, 95% used [0x00000006d9a80000, 0x00000006e5efae68, 0x00000006e6740000)concurrent mark-sweep generation total 3354624K, used 899032K [0x00000006f3400000, 0x00000007c0000000, 0x00000007c0000000)Metaspace used 90199K, capacity 91456K, committed 91776K, reserved 1130496Kclass space used 9512K, capacity 9798K, committed 9856K, reserved 1048576K}3644477.934: [Rescan (parallel) , 0.6932186 secs]3644478.627: [weak refs processing, 0.0000973 secs]3644478.628: [class unloading, 0.0527698 secs]3644478.680: [scrub symbol table, 0.0170169 secs]3644478.697: [scrub string table, 0.0019041 secs][1 CMS-remark: 899032K(3354624K)] 1548903K(5242112K), 0.7680979 secs] [Times: user=2.75 sys=0.00, real=0.76 secs]2019-03-28T20:05:26.234+0800: 3644478.702: Total time for which application threads were stopped: 0.7708180 seconds, Stopping threads took: 0.0001262 seconds2019-03-28T20:05:26.235+0800: 3644478.703: [CMS-concurrent-sweep-start]2019-03-28T20:05:27.235+0800: 3644479.702: Application time: 1.0001322 seconds2019-03-28T20:05:27.237+0800: 3644479.705: Total time for which application threads were stopped: 0.0026851 seconds, Stopping threads took: 0.0001390 seconds2019-03-28T20:05:27.441+0800: 3644479.908: Application time: 0.2034629 seconds2019-03-28T20:05:27.445+0800: 3644479.912: Total time for which application threads were stopped: 0.0038960 seconds, Stopping threads took: 0.0014157 seconds2019-03-28T20:05:28.445+0800: 3644480.912: Application time: 1.0001430 seconds2019-03-28T20:05:28.447+0800: 3644480.915: Total time for which application threads were stopped: 0.0026665 seconds, Stopping threads took: 0.0001339 seconds2019-03-28T20:05:31.156+0800: 3644483.624: Application time: 2.7089917 seconds2019-03-28T20:05:31.159+0800: 3644483.627: Total time for which application threads were stopped: 0.0025883 seconds, Stopping threads took: 0.0001411 seconds2019-03-28T20:05:32.468+0800: 3644484.936: [CMS-concurrent-sweep: 6.221/6.233 secs] [Times: user=6.35 sys=0.06, real=6.24 secs]2019-03-28T20:05:32.468+0800: 3644484.936: [CMS-concurrent-reset-start]2019-03-28T20:05:32.476+0800: 3644484.943: [CMS-concurrent-reset: 0.007/0.007 secs] [Times: user=0.01 sys=0.00, real=0.00 secs]2019-03-28T20:05:33.159+0800: 3644485.627: Application time: 2.0002143 seconds2019-03-28T20:05:33.162+0800: 3644485.630: Total time for which application threads were stopped: 0.0026832 seconds, Stopping threads took: 0.0001180 seconds2019-03-28T20:05:34.162+0800: 3644486.630: Application time: 1.0000781 seconds2019-03-28T20:05:34.164+0800: 3644486.632: Total time for which application threads were stopped: 0.0023807 seconds, Stopping threads took: 0.0001220 seconds2019-03-28T20:05:34.476+0800: 3644486.943: Application time: 0.3111430 seconds看第二次 CMS GC 日志,有以下四個(gè)發(fā)現(xiàn):
由日志 “CMS-initial-mark: 899032K(3354624K)” 可知,其實(shí)第一次 CMS GC 是已經(jīng)回收了 OldGen,而且釋放了大量空間,OldGen 的使用占比只有 899032 / 3354624 = 26.8%,很奇怪為什么會進(jìn)行 CMS GC?
由日志 “2019-03-28T20:05:24.213+0800: 3644476.680: [GC (CMS Initial Mark)” 可知,第二次 CMS GC 開始的具體時(shí)間是 20:05:24.213,上次 CMS GC 結(jié)束時(shí)間 20:05:22.211 相差 2s。
由日志 “[GC (CMS Final Remark) 3644477.934: [ParNew: 649871K->649871K(1887488K), 0.0000289 secs]” 可知,第二次 CMS GC 日志中包含一次 Young GC,毫無疑問是因?yàn)榕渲昧?XX:+CMSScavengeBeforeRemark 參數(shù)導(dǎo)致的。
Young GC 后 eden、from、to 三個(gè) space 的使用量都不是 0 的情況依然存在,只是 eden space 由使用比率 13% 增加到 33%。
很奇怪,此時(shí)通過日志 “concurrent mark-sweep generation total 3354624K, used 899032K” 可知,OldGen 空閑空間很大,為什么 Young GC 好像沒起作用。
第三次-第 N 次 CMS GC 日志
2019-03-28T20:05:34.478+0800: 3644486.945: [GC (CMS Initial Mark) [1 CMS-initial-mark: 573449K(3354624K)] 1247191K(5242112K), 0.5737527 secs] [Times: user=0.79 sys=0.00, real=0.57 secs]2019-03-28T20:05:35.052+0800: 3644487.519: Total time for which application threads were stopped: 0.5762441 seconds, Stopping threads took: 0.0001088 seconds2019-03-28T20:05:35.052+0800: 3644487.520: [CMS-concurrent-mark-start]............看第三次-第 N 次 CMS GC 日志,有三個(gè)發(fā)現(xiàn):
由日志 “CMS-initial-mark: 573449K(3354624K)” 可知,OldGen 的使用占比只有 573449 / 3354624= 17.1%,很奇怪為什么會進(jìn)行 CMS GC?
由日志 “2019-03-28T20:05:34.478+0800: 3644486.945: [GC (CMS Initial Mark)” 可知,第三次 CMS GC 的開始時(shí)間 20:05:34.478 與 第二次 CMS GC 結(jié)束時(shí)間 20:05:32.476 又相差 2s。
由于配置了 -XX:+CMSScavengeBeforeRemark 參數(shù),CMS GC 過程中依然包含一次 Young GC。
Young GC 后 eden、from、to 三個(gè) space 的使用量都不是 0 的情況依然存在,只是 eden space 由使用比率增長。 很奇怪,OldGen 空閑空間很大,為什么 Young GC 好像沒起作用?
根源定位
通過日志分析,,大家很容易發(fā)現(xiàn)三個(gè)問題:
-
每次 CMS GC 都是相隔 2s? 這其實(shí)是 CMS background collector 的策略,每隔 CMSWaitDuration(默認(rèn)為2000ms) 時(shí)間進(jìn)行一次檢測,若發(fā)現(xiàn)滿足 CMS GC 觸發(fā)條件,就進(jìn)行一次 CMS background collector。
-
第二次及后面的 CMS GC,OldGen 的使用占比情況都沒有達(dá)到 80%,很疑惑是什么導(dǎo)致了 CMS GC? 通過上面的分析,其實(shí)只要知道是什么滿足了 CMS GC 觸發(fā)條件而導(dǎo)致了 CMS GC,就能回答第二個(gè)問題。
-
Young GC 后 eden、from、to 三個(gè) space 的使用量都不是 0 的情況,而且 OldGen 空閑空間很大,為什么 Young GC 好像沒起作用。
源碼排查
OldGen 的使用占比情況都沒有達(dá)到 80%,什么原因?qū)е碌?CMS GC
下面我們來看下 CMS GC 觸發(fā)條件,觸發(fā)條件都在 shouldConcurrentCollect 函數(shù)里,返回 true 的都是可能的情況,這里分別解釋下。
-
“if (fullgc_requested)” 這是由 System.gc() 調(diào)用且配置了 -XX:+ExplicitGCInvokesConcurrent 參數(shù)的情況下,會觸發(fā)一次 CMS GC。但如果是 System.gc(),每次 CMS GC 的間隔時(shí)間不可能一直是 2s,故顯然不符合。
-
“if (!UseCMSInitiatingOccupancyOnly)” 這是在沒有配置 -XX:+UseCMSInitiatingOccupancyOnly 參數(shù)的情況下,可能觸發(fā) CMS GC 的情況,故顯然不符合。
-
“if (cmsGen->shouldconcurrent_collect())” 這是 -XX:+UseCMSInitiatingOccupancyOnly 參數(shù)的情況下,如果 OldGen 使用占比達(dá)到 -XX:CMSInitiatingOccupancyFraction 參數(shù)設(shè)置值,就會觸發(fā) CMS GC,但第二次、第三-第 N 次明顯不符合情況。
-
“if (gch->incrementalcollectionwillfail(true /* consultyoung */))” 這是一種悲觀策略,判斷新生代回收是否會失敗,如果最近一次 Young GC 已經(jīng)失敗或者可能會失敗,就會觸發(fā)一次 CMS GC。這是符合本文說的情況的。
-
“if (MetaspaceGC::shouldconcurrentcollect())” 這是 Metaspace 滿足 CMS GC 觸發(fā)條件的情況,根據(jù)日志 “ Metaspace used 90199K, capacity 91456K, committed 91776K, reserved 1130496K” 中 Metaspace 空間使用情況,顯然不符合。
-
“if (CMSTriggerInterval >= 0)” 這是配置了 -XX:CMSTriggerInterval 參數(shù)的情況,顯然不符合。
接下來,我們具體分析下 incrementalcollectionwillfail(true) 函數(shù),這個(gè)函數(shù)有兩個(gè)判斷條件 incrementalcollectionfailed() 或者 !younggen->collectionattemptissafe(),有一個(gè)成立就會返回 true。
// Returns true if an incremental collection is likely to fail. // We optionally consult the young gen, if asked to do so; // otherwise we base our answer on whether the previous incremental // collection attempt failed with no corrective action as of yet. bool incremental_collection_will_fail(bool consult_young) {// The first disjunct remembers if an incremental collection failed, even// when we thought (second disjunct) that it would not.return incremental_collection_failed() ||(consult_young && !_young_gen->collection_attempt_is_safe());}bool incremental_collection_failed() const {return _incremental_collection_failed; }我們先來看 incrementalcollectionfailed() 函數(shù),這個(gè)函數(shù)返回的是incrementalcollectionfailed 這個(gè)成員的值,這個(gè)值只有兩個(gè)情況下會通過 setincrementalcollectionfailed() 函數(shù)設(shè)置成 true,并且會在 CMS GC 的 sweep 階段會設(shè)置為 false。 第一種情況是: 晉升失敗 Promotion failed,但是只有第一次 CMS GC 出現(xiàn)過一次,后續(xù)的Young GC 都不是 promotion failed,說明不是這種情況。
void ParNewGeneration::handle_promotion_failed(GenCollectedHeap* gch, ParScanThreadStateSet& thread_state_set) {assert(_promo_failure_scan_stack.is_empty(), "post condition");_promo_failure_scan_stack.clear(true); // Clear cached segments.remove_forwarding_pointers();log_info(gc, promotion)("Promotion failed");// All the spaces are in play for mark-sweep.swap_spaces(); // Make life simpler for CMS || rescan; see 6483690.from()->set_next_compaction_space(to());gch->set_incremental_collection_failed();// Inform the next generation that a promotion failure occurred._old_gen->promotion_failure_occurred();// Trace promotion failure in the parallel GC threadsthread_state_set.trace_promotion_failed(gc_tracer());// Single threaded code may have reported promotion failure to the global stateif (_promotion_failed_info.has_failed()) {_gc_tracer.report_promotion_failed(_promotion_failed_info);}// Reset the PromotionFailureALot counters.NOT_PRODUCT(gch->reset_promotion_should_fail();) }第二種情況是: Young GC 過程中,if (!collectionattemptissafe()) 為 true,也會通過 setincrementalcollectionfailed() 函數(shù)設(shè)置。
void ParNewGeneration::collect(bool full,bool clear_all_soft_refs,size_t size,bool is_tlab) {assert(full || size > 0, "otherwise we don't want to collect");GenCollectedHeap* gch = GenCollectedHeap::heap();_gc_timer->register_gc_start();AdaptiveSizePolicy* size_policy = gch->gen_policy()->size_policy();WorkGang* workers = gch->workers();assert(workers != NULL, "Need workgang for parallel work");uint active_workers =AdaptiveSizePolicy::calc_active_workers(workers->total_workers(),workers->active_workers(),Threads::number_of_non_daemon_threads());active_workers = workers->update_active_workers(active_workers);log_info(gc,task)("Using %u workers of %u for evacuation", active_workers, workers->total_workers());_old_gen = gch->old_gen();// If the next generation is too full to accommodate worst-case promotion// from this generation, pass on collection; let the next generation// do it.if (!collection_attempt_is_safe()) {gch->set_incremental_collection_failed(); // slight lie, in that we did not even attempt onereturn;}我們再看看 collectionattemptissafe() 函數(shù)的實(shí)現(xiàn),會讓你豁然開朗,if (!to()->isempty()) return false,剛好滿足了每次 YoungGC to space 不為空。因此,是在這里 incrementalcollection_failed 被設(shè)置成 true,導(dǎo)致每隔 2s 觸發(fā)一次 CMS GC,這就解釋了為什么 OldGen 的使用占比情況都沒有達(dá)到 80%,也會觸發(fā) CMS GC。
bool DefNewGeneration::collection_attempt_is_safe() {if (!to()->is_empty()) {log_trace(gc)(":: to is not empty ::");return false;}if (_old_gen == NULL) {GenCollectedHeap* gch = GenCollectedHeap::heap();_old_gen = gch->old_gen();}return _old_gen->promotion_attempt_is_safe(used()); }Young GC 后 eden、from、to 三個(gè) space 的使用量都不是 0 的情況
看到這里,其實(shí)這個(gè)問題也很好解釋了,我們看 ParNewGeneration::collect 函數(shù)中的這段代碼就明白了,YoungGC 遇到 to space 不為空的情況下,直接 setincrementalcollection_failed() 完就返回了,并沒有進(jìn)行真正的 Young GC。
if (!collection_attempt_is_safe()) {gch->set_incremental_collection_failed(); // slight lie, in that we did not even attempt onereturn;}罪魁禍?zhǔn)?/h3>
看到這里,你一定在想,那罪魁禍?zhǔn)椎降资钦l呢?表面上看是 to space 不為空導(dǎo)致觸發(fā)了 Young GC,然后設(shè)置了 incrementalcollectionfailed 參數(shù),進(jìn)而滿足了 CMS GC 觸發(fā)條件。實(shí)質(zhì)上是因?yàn)榕渲昧?-XX:CMSScavengeBeforeRemark 參數(shù),CMS GC 階段強(qiáng)制進(jìn)行了 Young GC,導(dǎo)致 to space 不為空,因此這個(gè)鍋得由 -XX:CMSScavengeBeforeRemark 參數(shù)來背。
你可能要問即使不設(shè)置 -XX:CMSScavengeBeforeRemark 參數(shù) CMS GC 階段也是有可能會觸發(fā) Young GC,憑什么要讓 -XX:CMSScavengeBeforeRemark 參數(shù)來背鍋。
如果是 Allocation Failure 觸發(fā)的 Young GC 也會有問題嗎?
答案是不會,這里可以借助最后一次 CMS GC 日志來分析。
最后一次 CMS GC 日志
2019-03-28T20:14:15.470+0800: 3645007.937: Application time: 2.4315795 seconds2019-03-28T20:14:15.472+0800: 3645007.940: [GC (CMS Initial Mark) [1 CMS-initial-mark: 572987K(3354624K)] 2318654K(5242112K), 1.7733417 secs] [Times: user=1.94 sys=0.05, real=1.77 secs]2019-03-28T20:14:17.246+0800: 3645009.714: Total time for which application threads were stopped: 1.7762440 seconds, Stopping threads took: 0.0001283 seconds2019-03-28T20:14:17.246+0800: 3645009.714: [CMS-concurrent-mark-start]2019-03-28T20:14:17.257+0800: 3645009.725: Application time: 0.0112116 seconds2019-03-28T20:14:17.260+0800: 3645009.728: Total time for which application threads were stopped: 0.0027625 seconds, Stopping threads took: 0.0001346 seconds2019-03-28T20:14:17.260+0800: 3645009.728: Application time: 0.0001755 seconds2019-03-28T20:14:17.263+0800: 3645009.730: Total time for which application threads were stopped: 0.0027375 seconds, Stopping threads took: 0.0000825 seconds2019-03-28T20:14:17.263+0800: 3645009.731: Application time: 0.0000958 seconds2019-03-28T20:14:17.265+0800: 3645009.733: Total time for which application threads were stopped: 0.0025920 seconds, Stopping threads took: 0.0000832 seconds2019-03-28T20:14:17.274+0800: 3645009.741: Application time: 0.0081685 seconds2019-03-28T20:14:17.277+0800: 3645009.744: Total time for which application threads were stopped: 0.0028536 seconds, Stopping threads took: 0.0001305 seconds2019-03-28T20:14:17.845+0800: 3645010.312: Application time: 0.5681527 seconds{Heap before GC invocations=18561 (full 815):par new generation total 1887488K, used 1760091K [0x0000000673400000, 0x00000006f3400000, 0x00000006f3400000)eden space 1677824K, 100% used [0x0000000673400000, 0x00000006d9a80000, 0x00000006d9a80000)from space 209664K, 39% used [0x00000006e6740000, 0x00000006eb796e60, 0x00000006f3400000)to space 209664K, 95% used [0x00000006d9a80000, 0x00000006e5efae68, 0x00000006e6740000)concurrent mark-sweep generation total 3354624K, used 572987K [0x00000006f3400000, 0x00000007c0000000, 0x00000007c0000000)Metaspace used 90185K, capacity 91432K, committed 91776K, reserved 1130496Kclass space used 9510K, capacity 9794K, committed 9856K, reserved 1048576K2019-03-28T20:14:17.847+0800: 3645010.315: [GC (Allocation Failure) 3645010.315: [ParNew: 1760091K->1760091K(1887488K), 0.0000231 secs]3645010.315: [CMS2019-03-28T20:14:18.223+0800: 3645010.691: [CMS-concurrent-mark: 0.961/0.977 secs] [Times: user=1.04 sys=0.00, real=0.97 secs](concurrent mode failure): 572987K->554147K(3354624K), 3.2531090 secs] 2333078K->554147K(5242112K), [Metaspace: 90185K->90185K(1130496K)], 3.2534483 secs] [Times: user=3.26 sys=0.00, real=3.25 secs]Heap after GC invocations=18562 (full 816):new generation total 1887488K, used 0K [0x0000000673400000, 0x00000006f3400000, 0x00000006f3400000)eden space 1677824K, 0% used [0x0000000673400000, 0x0000000673400000, 0x00000006d9a80000)from space 209664K, 0% used [0x00000006e6740000, 0x00000006e6740000, 0x00000006f3400000)to space 209664K, 0% used [0x00000006d9a80000, 0x00000006d9a80000, 0x00000006e6740000)concurrent mark-sweep generation total 3354624K, used 554147K [0x00000006f3400000, 0x00000007c0000000, 0x00000007c0000000)Metaspace used 90165K, capacity 91402K, committed 91776K, reserved 1130496Kclass space used 9507K, capacity 9789K, committed 9856K, reserved 1048576K}2019-03-28T20:14:21.101+0800: 3645013.568: Total time for which application threads were stopped: 3.2561084 seconds, Stopping threads took: 0.0001151 seconds看上面的日志,你會發(fā)現(xiàn)這次 CMS GC 日志跟以往的都不太一樣,CMS-concurrent-mark-start 日志出現(xiàn)后,后面的日志都不是按照 CMS GC 的各個(gè)階段打出來的。 是的,后面其實(shí)是由于 Allocation Failure 而發(fā)生了一次 Young GC,從而中斷了 CMS background collector,而進(jìn)行了一次 CMS foreground collector,有 “concurrent mode failure” 為證。
也就是說一般的 Allocation Failure 引起的 YoungGC 在這種情況下,不會出現(xiàn)頻繁 CMS GC,因此,把問題歸結(jié)到 -XX:CMSScavengeBeforeRemark 參數(shù)不為過吧。
總結(jié)
本文主要是由于 -XX:CMSScavengeBeforeRemark 參數(shù)觸發(fā)了 Young GC,但該 YoungGC 并沒有成功進(jìn)行的,反而促使 CMS background collector 觸發(fā)條件滿足,進(jìn)而引發(fā)了頻繁 CMS GC。
該怎么避免呢
一時(shí)也沒有想到很好的辦法,兩個(gè)參考方案:
-
去掉 -XX:CMSScavengeBeforeRemark 參數(shù)
-
降低 YoungGen 大小,加快因 Allocation Failure 而觸發(fā)正常 Young GC
總結(jié)
以上是生活随笔為你收集整理的JVM发生频繁 CMS GC,罪魁祸首是这个参数!的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: MySQL 如何查找删除重复行?
- 下一篇: 我逛了一下JDK一条街,发现了不少好东西