记一次 .NET 某消防物联网 后台服务 内存泄漏分析
一:背景
1. 講故事
去年十月份有位朋友從微信找到我,說(shuō)他的程序內(nèi)存要炸掉了。。。截圖如下:
時(shí)間有點(diǎn)久,圖片都被清理了,不過(guò)有點(diǎn)諷刺的是,自己的程序本身就是做監(jiān)控的,結(jié)果自己出了問(wèn)題,太尷尬了🤣🤣🤣
二:Windbg 分析
1. 托管還是非托管
這個(gè)是甄別內(nèi)存問(wèn)題的第一步,通過(guò) !address -summary 和 !eeheap -gc 兩個(gè)命令基本就可以斷定。
0:000>?!address?-summary????????????????????????????? Mapping?file?section?regions... Mapping?module?regions... Mapping?PEB?regions... Mapping?TEB?and?stack?regions... Mapping?heap?regions... Mapping?page?heap?regions... Mapping?other?regions... Mapping?stack?trace?database?regions... Mapping?activation?context?regions...---?Usage?Summary?----------------?RgnCount?-----------?Total?Size?--------?%ofBusy?%ofTotal Free????????????????????????????????????237?????7ffc`1e222000?(?127.985?TB)???????????99.99% <unknown>???????????????????????????????594????????3`b9b20000?(??14.901?GB)??95.96%????0.01% Heap????????????????????????????????????370????????0`12a2a000?(?298.164?MB)???1.88%????0.00% Image??????????????????????????????????1248????????0`0ee5a000?(?238.352?MB)???1.50%????0.00% Stack???????????????????????????????????315????????0`06780000?(?103.500?MB)???0.65%????0.00% Other????????????????????????????????????13????????0`001d7000?(???1.840?MB)???0.01%????0.00% TEB?????????????????????????????????????105????????0`000d2000?(?840.000?kB)???0.01%????0.00% PEB???????????????????????????????????????1????????0`00001000?(???4.000?kB)???0.00%????0.00%---?Type?Summary?(for?busy)?------?RgnCount?-----------?Total?Size?--------?%ofBusy?%ofTotal MEM_PRIVATE????????????????????????????1178????????3`ce03d000?(??15.219?GB)??98.00%????0.01% MEM_IMAGE??????????????????????????????1409????????0`0f6fd000?(?246.988?MB)???1.55%????0.00% MEM_MAPPED???????????????????????????????59????????0`04694000?(??70.578?MB)???0.44%????0.00%---?State?Summary?----------------?RgnCount?-----------?Total?Size?--------?%ofBusy?%ofTotal MEM_FREE????????????????????????????????237?????7ffc`1e222000?(?127.985?TB)???????????99.99% MEM_COMMIT?????????????????????????????2326????????3`c7543000?(??15.115?GB)??97.33%????0.01% MEM_RESERVE?????????????????????????????320????????0`1a88b000?(?424.543?MB)???2.67%????0.00%0:000>?!eeheap?-gc Number?of?GC?Heaps:?1 generation?0?starts?at?0x0000009902B57670 generation?1?starts?at?0x0000009902A56810 generation?2?starts?at?0x00000095318C1000 ephemeral?segment?allocation?context:?(0x0000009902D724A8,?0x0000009902D724C0)segment?????????????begin?????????allocated?????????committed????allocated?size????committed?size ... 00000098FFBE0000??00000098FFBE1000??0000009902D724A8??0000009902D7D000??0x31914a8(51975336)??0x319c000(52019200) Large?object?heap?starts?at?0x00000095418C1000segment?????????????begin?????????allocated?????????committed????allocated?size????committed?size 00000095418C0000??00000095418C1000??00000095475D8D98??00000095475D9000??0x5d17d98(97615256)??0x5d18000(97615872) Total?Allocated?Size:??????????????Size:?0x398e6cbc8?(15450164168)?bytes. Total?Committed?Size:??????????????Size:?0x398e7b000?(15450222592)?bytes. ------------------------------ GC?Allocated?Heap?Size:????Size:?0x398e6cbc8?(15450164168)?bytes. GC?Committed?Heap?Size:????Size:?0x398e7b000?(15450222592)?bytes.從輸出信息看,好家伙。。。進(jìn)程提交內(nèi)存是 15G, 托管堆差不多也是 15G,這就說(shuō)明當(dāng)前是相對(duì)簡(jiǎn)單的 托管內(nèi)存泄漏。
2. 到底是什么在泄漏
要想知道到底是什么在泄漏,可以先到托管堆上看看有沒(méi)有什么異常的對(duì)象,使用 !dumpheap -stat 命令。
0:000>?!dumpheap?-stat Statistics:MT????Count????TotalSize?Class?Name ... 00007ff8815d0f88??7260233????290409320?System.Collections.ArrayList 00007ff8815e6830??7313696????326240826?System.String 000000952fbbd2b0?12141398????509369998??????Free 00007ff880685cf0??7254983????928637824?System.Diagnostics.ProcessInfo 00007ff88065f7d0??7256845???2031916600?System.Diagnostics.Process 00007ff8815e6ea8??7391338???2230744600?System.Object[] 00007ff88068fa70?186800748???8966435904?System.Diagnostics.ThreadInfo從卦象上來(lái)看,真的太奇怪了,如果大家了解 Process 類(lèi),應(yīng)該知道 ProcessInfo 和 ThreadInfo 都是掛在 Process 下的,而且 ThreadInfo 對(duì)象高達(dá) 1.8億,真的太🐂👃了,看樣子程序是在不斷的做 Process.Start 操作吧。
接下來(lái)要探究的問(wèn)題是 ThreadInfo 到底正在被誰(shuí)持有???可以挑幾個(gè)看看它們的 !gcroot, 首尾法查了幾個(gè),都是沒(méi)有引用根,如下所示:
0:000>?!gcroot?0000009531e8f760 Found?0?unique?roots?(run?'!GCRoot?-all'?to?see?all?roots). 0:000>?!gcroot?0000009531e8f670? Found?0?unique?roots?(run?'!GCRoot?-all'?to?see?all?roots). 0:000>?!gcroot?0000009531e8f378 Found?0?unique?roots?(run?'!GCRoot?-all'?to?see?all?roots).3. 無(wú)引用根為什么不被回收
既然對(duì)象沒(méi)有引用根,為什么 GC 不回收它呢?這里就需要談經(jīng)驗(yàn)了,在我之前分析的很多關(guān)于內(nèi)存泄漏 的dump中,我都是從 生產(chǎn)端 找問(wèn)題,貌似還沒(méi)有出現(xiàn)一個(gè)從 消費(fèi)端 找問(wèn)題的案例,參考模型如下:
生產(chǎn)者既然沒(méi)問(wèn)題,那消費(fèi)端能有什么問(wèn)題呢?大家可以想一想,托管堆的消費(fèi)端起碼有兩個(gè)角色。
GC
Finalizer 線(xiàn)程
GC 肯定是不會(huì)出問(wèn)題的,那就只能懷疑 Finalizer 線(xiàn)程出了什么問(wèn)題,可以用 ?!t 命令把所有線(xiàn)程調(diào)出來(lái)看看。
0:000>?!t ThreadCount:??????9566 UnstartedThread:??0 BackgroundThread:?88 PendingThread:????0 DeadThread:???????9471 Hosted?Runtime:???noLock??ID?OSID?ThreadOBJ???????????State?GC?Mode?????GC?Alloc?Context??????????????????Domain???????????Count?Apt?Exception0????1?40e18?000000952fbdfe50????26020?Preemptive??0000000000000000:0000000000000000?000000952fbd2e40?0?????STA?2????2?41ce8?000000952fc0e4d0????2b220?Preemptive??0000000000000000:0000000000000000?000000952fbd2e40?0?????MTA?(Finalizer)?4????4?41cb4?000000954c324970????27220?Preemptive??0000000000000000:0000000000000000?000000952fbd2e40?0?????STA?5????5?41cb8?000000954c36d1e0??2027220?Preemptive??0000000000000000:0000000000000000?000000952fbd2e40?0?????STA?6????6?41c58?000000954c33f070??2027220?Preemptive??0000000000000000:0000000000000000?000000952fbd2e40?0?????STA?7????7?41c38?000000954c33f840????27220?Preemptive??0000000000000000:0000000000000000?000000952fbd2e40?0?????STA?8????8?41e0c?000000954c333580????27220?Preemptive??0000000000000000:0000000000000000?000000952fbd2e40?0?????STA?9????9?41e2c?000000954c354440????27220?Preemptive??0000000000000000:0000000000000000?000000952fbd2e40?0?????STA?10???10?41f24?000000954c355840????27220?Preemptive??0000000000000000:0000000000000000?000000952fbd2e40?0?????STA? ... XXXX?9446????0?000000957a233410??8039820?Preemptive??0000000000000000:0000000000000000?000000952fbd2e40?0?????MTA?(Threadpool?Completion?Port)? XXXX?9447????0?0000009579f83e30??8039820?Preemptive??0000000000000000:0000000000000000?000000952fbd2e40?0?????MTA?(Threadpool?Completion?Port)? XXXX?9450????0?000000957a46dcf0??8039820?Preemptive??0000000000000000:0000000000000000?000000952fbd2e40?0?????MTA?(Threadpool?Completion?Port)? XXXX?9449????0?000000957967c6e0??8039820?Preemptive??0000000000000000:0000000000000000?000000952fbd2e40?0?????MTA?(Threadpool?Completion?Port)? XXXX?9448????0?000000957aee0010??8039820?Preemptive??0000000000000000:0000000000000000?000000952fbd2e40?0?????MTA?(Threadpool?Completion?Port)? XXXX?9452????0?00000095796824a0??8039820?Preemptive??0000000000000000:0000000000000000?000000952fbd2e40?0?????MTA?(Threadpool?Completion?Port)? XXXX?9451????0?000000957af05df0??8039820?Preemptive??0000000000000000:0000000000000000?000000952fbd2e40?0?????MTA?(Threadpool?Completion?Port)這不看不知道,一看嚇一跳,當(dāng)前進(jìn)程有 9566 個(gè),死線(xiàn)程高達(dá) 9471 個(gè) ,以過(guò)往經(jīng)驗(yàn),這小子不用線(xiàn)程池,用 new Thread 咯, 🐂👃🦆 ,吐槽結(jié)束,再看下 Finalizer 線(xiàn)程正在做什么,使用 ~2s & !dumpstack
0:002>?~2s ntdll!NtWaitForSingleObject+0xa: 00007ff8`8e220c8a?c3??????????????ret 0:002>?!dumpstack OS?Thread?Id:?0x41ce8?(2) Current?frame:?ntdll!NtWaitForSingleObject+0xa Child-SP?????????RetAddr??????????Caller,?Callee 0000009549e4e120?00007ff88b591118?KERNELBASE!WaitForSingleObjectEx+0x94,?calling?ntdll!NtWaitForSingleObject 0000009549e4e1c0?00007ff88da3e334?combase!MTAThreadWaitForCall+0x54?[d:\9147\com\combase\dcomrem\channelb.cxx:5657],?calling?KERNELBASE!WaitForSingleObject 0000009549e4e210?00007ff88d8fe089?combase!MTAThreadDispatchCrossApartmentCall+0x75?[d:\9147\com\combase\dcomrem\chancont.cxx:193],?calling?combase!MTAThreadWaitForCall?[d:\9147\com\combase\dcomrem\channelb.cxx:5619] 0000009549e4e240?00007ff88da3e13d?combase!CRpcChannelBuffer::SendReceive2+0x64b?[d:\9147\com\combase\dcomrem\channelb.cxx:4796],?calling?combase!MTAThreadDispatchCrossApartmentCall?[d:\9147\com\combase\dcomrem\chancont.cxx:156] 0000009549e4e2b0?00007ff88e1bb6f7?ntdll!RtlAllocateHeap+0xd7,?calling?ntdll!RtlpLowFragHeapAllocFromContext ... 0000009549e4f5d0?00007ff8827d79cd?clr!ManagedThreadBase_DispatchOuter+0x75,?calling?clr!ManagedThreadBase_DispatchMiddle 0000009549e4f5e0?00007ff8828601af?clr!EEConfig::GetConfigDWORD_DontUse_+0x3b,?calling?clr!EEConfig::GetConfiguration_DontUse_ 0000009549e4f660?00007ff8828574fa?clr!FinalizerThread::FinalizerThreadStart+0x10a,?calling?clr!ManagedThreadBase_DispatchOuter 0000009549e4f6a0?00007ff8827d55b9?clr!EEHeapFreeInProcessHeap+0x45,?calling?kernel32!HeapFreeStub 0000009549e4f700?00007ff882882e8f?clr!Thread::intermediateThreadProc+0x86 0000009549e4f780?00007ff882882e6f?clr!Thread::intermediateThreadProc+0x66,?calling?clr!_chkstk 0000009549e4f7c0?00007ff88dcd13d2?kernel32!BaseThreadInitThunk+0x22 0000009549e4f7f0?00007ff88e2003c4?ntdll!RtlUserThreadStart+0x34從堆棧信息看,原來(lái)是終結(jié)器線(xiàn)程卡死了,從 MTAThreadDispatchCrossApartmentCall 方法看,貌似是 MTA 向 STA 做了一個(gè)調(diào)用,到這里有經(jīng)驗(yàn)的朋友應(yīng)該知道,這和 com 組件有關(guān)系了,也就是說(shuō) Finalizer 線(xiàn)程無(wú)法釋放由 STA 線(xiàn)程創(chuàng)建的 COM 對(duì)象,那到底是哪一個(gè)線(xiàn)程創(chuàng)建了沒(méi)有被合理釋放的 COM 組件呢?
4. 尋找創(chuàng)建 COM 組件的線(xiàn)程
說(shuō)實(shí)話(huà),這個(gè)對(duì)COM組件不了解的話(huà),很難找出答案,但天無(wú)絕人之路,當(dāng)我回頭看 線(xiàn)程列表 的時(shí)候,發(fā)現(xiàn)居然有 38 個(gè) STA線(xiàn)程,截圖如下:
這里面肯定有問(wèn)題,接下來(lái)抽一個(gè)線(xiàn)程看看調(diào)用棧如何。
0:004>?!clrstack? OS?Thread?Id:?0x41cb4?(4)Child?SP???????????????IP?Call?Site 000000954da1ee38?00007ff88e220c8a?[HelperMethodFrame:?000000954da1ee38]?System.Threading.Thread.SleepInternal(Int32) 000000954da1ef30?00007ff88138c20a?***?WARNING:?Unable?to?verify?checksum?for?mscorlib.ni.dll System.Threading.Thread.Sleep(Int32) 000000954da1ef60?00007ff82322437f?xxx.CFileLogTask.DoWork() 000000954da1f160?00007ff8232234d6?xxx.CTask.InitStart() 000000954da1f240?00007ff8813c31d3?System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext,?System.Threading.ContextCallback,?System.Object,?Boolean) 000000954da1f310?00007ff8813c3064?System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext,?System.Threading.ContextCallback,?System.Object,?Boolean) 000000954da1f340?00007ff8813c3032?System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext,?System.Threading.ContextCallback,?System.Object) 000000954da1f390?00007ff8813bc812?System.Threading.ThreadHelper.ThreadStart() 000000954da1f5e8?00007ff8827d6bb3?[GCFrame:?000000954da1f5e8]? 000000954da1f938?00007ff8827d6bb3?[DebuggerU2MCatchHandlerFrame:?000000954da1f938]接下來(lái)反編譯下 xxx.CFileLogTask.DoWork() 方法看看它是如何被 Thread 執(zhí)行的。
到這里終于水落石出,罪魁禍?zhǔn)自?CurrThread.SetApartmentState(ApartmentState.STA); 這一句上,我也不知道為啥開(kāi)個(gè) Thread 還要給個(gè) STA。。。
三:總結(jié)
本次事故主要是因?yàn)樵?STA 線(xiàn)程上用到了 COM 組件,導(dǎo)致讓 MTA 模型的 Finalizer 線(xiàn)程去釋放時(shí)被卡死,而這個(gè)Thread又沒(méi)有用 Application.Run 啟動(dòng)消息循環(huán),STA也是 Sleep 狀態(tài),我個(gè)人感覺(jué)兩者無(wú)法通訊,給到朋友的建議是去掉 Thread 的 STA。
其實(shí)這里有一個(gè)很好的點(diǎn)就是,當(dāng)內(nèi)存暴漲,不一定是 生產(chǎn)端 的問(wèn)題,也有可能是 消費(fèi)端 。
總結(jié)
以上是生活随笔為你收集整理的记一次 .NET 某消防物联网 后台服务 内存泄漏分析的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 客户要求ASP.NET Core API
- 下一篇: 源代码提交SOP(Git版)