pdflush内核线程池及其中隐含的竞争
pdflush內核線程池是Linux為了回寫文件系統數據而創建的進程上下文工作環境。它的實現比較精巧,全部代碼只有不到250行。
?
| ? 1 /* ? 2? * mm/pdflush.c - worker threads for writing back filesystem data ? 3? * ? 4? * Copyright (C) 2002, Linus Torvalds. ? 5? * ? 6? * 09Apr2002??? akpm@zip.com.au ? 7? *????? Initial version ? 8? * 29Feb2004??? kaos@sgi.com ? 9? *????? Move worker thread creation to kthread to avoid chewing ?10? *????? up stack space with nested calls to kernel_thread. ?11? */ ? |
文件頭部的說明,主要包含版權信息和主要的更改記錄(Changlog)。kaos@sgi.com將內核工作線程的創建工作移交給了kthread,主要是為了防止過多的內核線程消耗太多的父工作線程的堆棧空間。關于這個改變我們也能夠通過ps的結果看出:
?
| root???????? 5???? 1???? 5? 0??? 1 21:31 ???????? 00:00:00 [kthread] root?????? 114???? 5?? 114? 0??? 1 21:31 ???????? 00:00:00 [pdflush] root?????? 115???? 5?? 115? 0??? 1 21:31 ???????? 00:00:00 [pdflush] ? |
所有pdflush內核線程的父進程都是kthread進程(pid為5)。
?
| ?12 ?13 #include <linux/sched.h> ?14 #include <linux/list.h> ?15 #include <linux/signal.h> ?16 #include <linux/spinlock.h> ?17 #include <linux/gfp.h> ?18 #include <linux/init.h> ?19 #include <linux/module.h> ?20 #include <linux/fs.h>?????? // Needed by writeback.h ?21 #include <linux/writeback.h>??? // Prototypes pdflush_operation() ?22 #include <linux/kthread.h> ?23 #include <linux/cpuset.h> ?24 ?25 ? |
包含一些比要的頭文件。不過有一點不怎么好,雖然C++的行注釋已經遷移到了C,可在內核的代碼里面看到,還是一樣的不舒服,可能是我太挑剔了,本身也沒啥不好,我可能需要與時俱進。
?
| ?26 /* ?27? * Minimum and maximum number of pdflush instances ?28? */ ?29 #define MIN_PDFLUSH_THREADS 2 ?30 #define MAX_PDFLUSH_THREADS 8 ?31 ?32 static void start_one_pdflush_thread(void); ?33 ?34 ? |
29和30行分別定義了pdflush內核線程實例的最小和最大數量,分別是2和8。最小線程數是為了減少操作的延時,最大線程數是為了防止過多的線程降低系統性能。不過,這里的最大線程數有些問題,下面我們分析其中的競爭條件時會再次提及它。
?
| ?35 /* ?36? * The pdflush threads are worker threads for writing back dirty data. ?37? * Ideally, we'd like one thread per active disk spindle.? But the disk ?38? * topology is very hard to divine at this level.?? Instead, we take ?39? * care in various places to prevent more than one pdflush thread from ?40? * performing writeback against a single filesystem.? pdflush threads ?41? * have the PF_FLUSHER flag set in current->flags to aid in this. ?42? */ ?43 ? |
上面這段注釋是對pdflush線程池的簡單解釋,大致的意思就是:“pdflush線程是為了將臟數據寫回的工作線程。比較理想的情況是為每一個活躍的磁盤軸創建一個線程,但是在這個層次上比較難確定磁盤的拓撲結構,因此,我們處處小心,盡量防止對單一文件系統做多個回寫操作。pdflush線程可以通過current->flags中PF_FLUSHER標志來協助實現這個。”
可以看出,內核開發者們對于效率還是相當的“吝嗇”,考慮的比較周全。但是,對于層次的劃分也相當關注,時刻不敢越“雷池”半步,那么的謹小慎微。
?
| ?43 ?44 /* ?45? * All the pdflush threads.? Protected by pdflush_lock ?46? */ ?47 static LIST_HEAD(pdflush_list); ?48 static DEFINE_SPINLOCK(pdflush_lock); ?49 ?50 /* ?51? * The count of currently-running pdflush threads.? Protected ?52? * by pdflush_lock. ?53? * ?54? * Readable by sysctl, but not writable.? Published to userspace at ?55? * /proc/sys/vm/nr_pdflush_threads. ?56? */ ?57 int nr_pdflush_threads = 0; ?58 ?59 /* ?60? * The time at which the pdflush thread pool last went empty ?61? */ ?62 static unsigned long last_empty_jifs; ?63 ? |
定義個一些必要的全局變量,為了不污染內核的名字空間,對于不需要導出的變量都采用了static關鍵字限定了它們的作用域為此編譯單元(即當前的pdflush.c文件)。所有的空閑pdflush線程都被串在雙向鏈表pdflush_list里面,并用變量nr_pdflush_threads對當前pdflush的進程(包括活躍的和空閑的)數就行統計,last_empty_jifs用來記錄pdflush線程池上次為空(也就是無線程可用)的jiffies時間,線程池中所有需要互斥操作的場合都采用自旋鎖pdflush_lock進行加鎖保護。
?
| ?64 /* ?65? * The pdflush thread. ?66? * ?67? * Thread pool management algorithm: ?68? * ?69? * - The minimum and maximum number of pdflush instances are bound ?70? *?? by MIN_PDFLUSH_THREADS and MAX_PDFLUSH_THREADS. ?71? * ?72? * - If there have been no idle pdflush instances for 1 second, create ?73? *?? a new one. ?74? * ?75? * - If the least-recently-went-to-sleep pdflush thread has been asleep ?76? *?? for more than one second, terminate a thread. ?77? */ ?78 ? |
又是一大段注釋,不知道你有沒有看煩,反正我都有點兒膩煩了,本來只想就其間的競爭說兩句,沒想到扯出這么多東西!上面介紹的是線程池的算法:
| ?79 /* ?80? * A structure for passing work to a pdflush thread.? Also for passing ?81? * state information between pdflush threads.? Protected by pdflush_lock. ?82? */ ?83 struct pdflush_work { ?84???????? struct task_struct *who;??????? /* The thread */ ?85???????? void (*fn)(unsigned long);????? /* A callback function */ ?86???????? unsigned long arg0;???????????? /* An argument to the callback */ ?87???????? struct list_head list;????????? /* On pdflush_list, when idle */ ?88???????? unsigned long when_i_went_to_sleep; ?89 }; ?90 ? |
上面定義了每個線程實例的節點數據結構,比較簡明,不需要再廢話。
現在,基本的數據結構的變量都瀏覽了一遍,接下來我們將從module_init這個入口著手分析:
?
| 232 static int __init pdflush_init(void) 233 { 234???????? int i; 235 236???????? for (i = 0; i < MIN_PDFLUSH_THREADS; i++) 237???????????????? start_one_pdflush_thread(); 238???????? return 0; 239 } 240 241 module_init(pdflush_init); ? |
創建MIN_PDFLUSH_THREADS個pdflush線程實例。請注意,這里只有module_init()定義,而沒有module_exit(),言外之意就是:這個程序即使編譯成內核模塊,也是只能添加不能刪除。請參看sys_delete_module()的實現:
File: kernel/module.c
?
| ?? 609????? /* If it has an init func, it must have an exit func to unload */ ?? 610????? if ((mod->init != NULL && mod->exit == NULL) ?? 611????????? || mod->unsafe) { ?? 612????????? forced = try_force(flags); ?? 613????????? if (!forced) { ?? 614????????????? /* This module can't be removed */ ?? 615????????????? ret = -EBUSY; ?? 616????????????? goto out; ?? 617????????? } ?? 618????? } ? |
?
| ?? 498? #ifdef CONFIG_MODULE_FORCE_UNLOAD ?? 499? static inline int try_force(unsigned int flags) ?? 500? { ?? 501????? int ret = (flags & O_TRUNC); ?? 502????? if (ret) ?? 503????????? add_taint(TAINT_FORCED_MODULE); ?? 504????? return ret; ?? 505? } ?? 506? #else ?? 507? static inline int try_force(unsigned int flags) ?? 508? { ?? 509????? return 0; ?? 510? } ?? 511? #endif /* CONFIG_MODULE_FORCE_UNLOAD */ ? |
可見,除非編譯的時候選擇了模塊強制卸載(注意:這個選項比較危險,不要嘗試)的選項,否則這樣的模塊是不允許被卸載的。再次回到pdflush:
?
| 227 static void start_one_pdflush_thread(void) 228 { 229???????? kthread_run(pdflush, NULL, "pdflush"); 230 } 231 ? |
用kthread_run借助kthread幫助線程生成pdflush內核線程實例:
?
| 164 /* 165? * Of course, my_work wants to be just a local in __pdflush().? It is 166? * separated out in this manner to hopefully prevent the compiler from 167? * performing unfortunate optimisations against the auto variables.? Because 168? * these are visible to other tasks and CPUs.? (No problem has actually 169? * been observed.? This is just paranoia). 170? */ 這段注釋比較有意思,為了防止編譯器將局部變量my_work優化成寄存器變量,所以這里整個處理流程轉變成了pdflush套__pdflush的方式。實際上,局部變量的采用相對于動態申請內存,無論是在空間利用率還是在時間效率上都是有好處的。 171 static int pdflush(void *dummy) 172 { 173???????? struct pdflush_work my_work; 174???????? cpumask_t cpus_allowed; 175 176???????? /* 177????????? * pdflush can spend a lot of time doing encryption via dm-crypt.? We 178????????? * don't want to do that at keventd's priority. 179????????? */ 180???????? set_user_nice(current, 0); 微調優先級,提高系統的整體響應。 181 182???????? /* 183????????? * Some configs put our parent kthread in a limited cpuset, 184????????? * which kthread() overrides, forcing cpus_allowed == CPU_MASK_ALL. 185????????? * Our needs are more modest - cut back to our cpusets cpus_allowed. 186????????? * This is needed as pdflush's are dynamically created and destroyed. 187????????? * The boottime pdflush's are easily placed w/o these 2 lines. 188????????? */ 189???????? cpus_allowed = cpuset_cpus_allowed(current); 190???????? set_cpus_allowed(current, cpus_allowed); 設置允許運行的CPU集合掩碼。 191 192???????? return __pdflush(&my_work); 193 } ? |
?
| ?91 static int __pdflush(struct pdflush_work *my_work) ?92 { ?93???????? current->flags |= PF_FLUSHER; ?94???????? my_work->fn = NULL; ?95???????? my_work->who = current; ?96???????? INIT_LIST_HEAD(&my_work->list); 做些初始化動作。 ?97 ?98???????? spin_lock_irq(&pdflush_lock); 因為要對nr_pdflush_threads和pdflush_list操作,所以需要加互斥鎖,為了避免意外(pdflush任務的添加可能在硬中斷上下文),故同時關閉硬中斷。 ?99???????? nr_pdflush_threads++; 將nr_pdflush_threads的計數加1,因為多了一個pdflush內核線程實例。 100???????? for ( ; ; ) { 101???????????????? struct pdflush_work *pdf; 102 103???????????????? set_current_state(TASK_INTERRUPTIBLE); 104???????????????? list_move(&my_work->list, &pdflush_list); 105???????????????? my_work->when_i_went_to_sleep = jiffies; 106???????????????? spin_unlock_irq(&pdflush_lock); 107 108???????????????? schedule(); 將自己加入空閑線程列表pdflush_list,然后讓出cpu,等待被調度。 109???????????????? if (try_to_freeze()) { 110???????????????????????? spin_lock_irq(&pdflush_lock); 111???????????????????????? continue; 112???????????????? } 如果正在凍結當前進程,繼續循環。 113 114???????????????? spin_lock_irq(&pdflush_lock); 115???????????????? if (!list_empty(&my_work->list)) { 116???????????????????????? printk("pdflush: bogus wakeup!\n"); 117???????????????????????? my_work->fn = NULL; 118???????????????????????? continue; 119???????????????? } 120???????????????? if (my_work->fn == NULL) { 121???????????????????????? printk("pdflush: NULL work function\n"); 122???????????????????????? continue; 123???????????????? } 124???????????????? spin_unlock_irq(&pdflush_lock); 上面是對被意外喚醒情況的處理。 125 126???????????????? (*my_work->fn)(my_work->arg0); 127 帶參數arg0執行任務函數。 128???????????????? /* 129????????????????? * Thread creation: For how long have there been zero 130????????????????? * available threads? 131????????????????? */ 132???????????????? if (jiffies - last_empty_jifs > 1 * HZ) { 133???????????????????????? /* unlocked list_empty() test is OK here */ 134???????????????????????? if (list_empty(&pdflush_list)) { 135???????????????????????????????? /* unlocked test is OK here */ 136???????????????????????????????? if (nr_pdflush_threads < MAX_PDFLUSH_THREADS) 137???????????????????????????????????????? start_one_pdflush_thread(); 138???????????????????????? } 139???????????????? } 如果pdflush_list為空超過1妙,并且線程數量還有可以增長的余地,則重新啟動一個新的pdflush線程實例。 140 141???????????????? spin_lock_irq(&pdflush_lock); 142???????????????? my_work->fn = NULL; 143 144???????????????? /* 145????????????????? * Thread destruction: For how long has the sleepiest 146????????????????? * thread slept? 147????????????????? */ 148???????????????? if (list_empty(&pdflush_list)) 149???????????????????????? continue; 如果pdflush_list依然為空,繼續循環。 150???????????????? if (nr_pdflush_threads <= MIN_PDFLUSH_THREADS) 151???????????????????????? continue; 如果線程數量不大于最小線程數,繼續循環。 152???????????????? pdf = list_entry(pdflush_list.prev, struct pdflush_work, list); 153???????????????? if (jiffies - pdf->when_i_went_to_sleep > 1 * HZ) { 154???????????????????????? /* Limit exit rate */ 155???????????????????????? pdf->when_i_went_to_sleep = jiffies; 156???????????????????????? break;????????????????????????????????? /* exeunt */ 157???????????????? } 如果pdflush_list的最后一個內核線程睡眠超過1秒,可能系統變得較為輕閑,結束本線程。為什么是最后一個?因為這個list是作為棧來使用的,所以棧底的元素也肯定就是最老的元素。 158???????? } 159???????? nr_pdflush_threads--; 160???????? spin_unlock_irq(&pdflush_lock); 161???????? return 0; nr_pdflush_threads減1,退出本線程。 162 } 163 ? |
是不是少做了些工作?沒錯,好象沒有處理SIGCHLD信號。其實用kthread創建的進程都是自己清理自己的,根本就無須父進程wait,不會產生僵尸進程,請參看
File: kernel/workqueue.c
| ?? 200????? /* SIG_IGN makes children autoreap: see do_notify_parent(). */ ?? 201????? sa.sa.sa_handler = SIG_IGN; ?? 202????? sa.sa.sa_flags = 0; ?? 203????? siginitset(&sa.sa.sa_mask, sigmask(SIGCHLD)); ?? 204????? do_sigaction(SIGCHLD, &sa, (struct k_sigaction *)0); ? |
另外在sigaction的手冊頁中可以詳細的看到關于忽略SIGCHLD的“后果”:
?
| ?????? POSIX.1-1990? disallowed setting the action for SIGCHLD to SIG_IGN. ?????? POSIX.1-2001 allows this possibility, so that ignoring SIGCHLD? can ?????? be? used? to prevent the creation of zombies (see wait(2)).? Never- ?????? theless, the historical BSD and System V? behaviours? for? ignoring ?????? SIGCHLD? differ,? so? that? the? only completely portable method of ?????? ensuring that terminated children do not become zombies is to catch ?????? the SIGCHLD signal and perform a wait(2) or similar. ? |
無疑Linux內核是符合較新的POSIX標準的,這也給我們提供了一個避免產生僵尸進程的“簡易”方法,不過要注意:這種手法是不可以移植的。
請折回頭來再次考慮函數__pdflush(),這次我們關注其間的競爭:
?
| 135???????????????????????????????? /* unlocked test is OK here */ 136???????????????????????????????? if (nr_pdflush_threads < MAX_PDFLUSH_THREADS) 137???????????????????????????????????????? start_one_pdflush_thread(); ? |
雖然開鎖判斷線程數不會造成數據損壞,但是如果有幾個進程并行判斷nr_pdflush_threads的值,并都一致認為線程數還有可以增長的余地,然后都調用start_one_pdflush_thread()去產生新的pdflush線程實例,那么線程數就可能超過MAX_PDFLUSH_THREADS,最壞的情況下可能是其兩倍。
再來看接下來的行:
?
| 152???????????????? pdf = list_entry(pdflush_list.prev, struct pdflush_work, list); 153???????????????? if (jiffies - pdf->when_i_went_to_sleep > 1 * HZ) { 154???????????????????????? /* Limit exit rate */ 155???????????????????????? pdf->when_i_went_to_sleep = jiffies; 156???????????????????????? break;????????????????????????????????? /* exeunt */ 157???????????????? } ? |
考慮瞬間的迸發請求,然后都在同一時刻停止運行,這時所有進程退出的時候都不會滿足153行的判定,然后都會去睡眠,再假設接下來的n秒內都沒有新的請求出發,那么pdflush內核線程數最大的情況將持續n秒,不符合當初的設計要求3。
?
| 195 /* 196? * Attempt to wake up a pdflush thread, and get it to do some work for you. 197? * Returns zero if it indeed managed to find a worker thread, and passed your 198? * payload to it. 199? */ 200 int pdflush_operation(void (*fn)(unsigned long), unsigned long arg0) 201 { 202???????? unsigned long flags; 203???????? int ret = 0; 204 205???????? if (fn == NULL) 206???????????????? BUG();????????? /* Hard to diagnose if it's deferred */ 207 208???????? spin_lock_irqsave(&pdflush_lock, flags); 209???????? if (list_empty(&pdflush_list)) { 210???????????????? spin_unlock_irqrestore(&pdflush_lock, flags); 211???????????????? ret = -1; 212???????? } else { 213???????????????? struct pdflush_work *pdf; 214 215???????????????? pdf = list_entry(pdflush_list.next, struct pdflush_work, list); 216???????????????? list_del_init(&pdf->list); 217???????????????? if (list_empty(&pdflush_list)) 218???????????????????????? last_empty_jifs = jiffies; 219???????????????? pdf->fn = fn; 220???????????????? pdf->arg0 = arg0; 221???????????????? wake_up_process(pdf->who); 222???????????????? spin_unlock_irqrestore(&pdflush_lock, flags); 223???????? } 224???????? return ret; 225 } 226 ? |
上面的函數用來給pdflush線程分配任務,如果當前有空閑線程可用,則分配一個任務給它,接著喚醒它,讓它去執行。
總結:
內核編程需要縝密的思維,稍有不甚就有可能引發意外,無論你的代碼有多短,必須慎之又慎。雖然pdflush的線程池實現存在以上提到的兩點競爭,但是他們都不會造成十分嚴重的后果,只不過不符合設計要求,不能作為一個良好的實現而推行。
注意:
本文中“內核線程”、“線程”和“進程”交叉使用,但實際上他們都代表“內核線程”,并且這樣也沒啥不妥,“線程”作為“內核線程”的簡稱,而“內核線程”本質就是共享內核數據空間的一組“進程”,所以在某些情況下兩者互換,并無大礙。
原文:http://blog.chinaunix.net/u/5251/showart_320793.html
轉載于:https://www.cnblogs.com/yuanfang/archive/2010/12/24/1916227.html
總結
以上是生活随笔為你收集整理的pdflush内核线程池及其中隐含的竞争的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: eclipse从git拉去出现红色方块的
- 下一篇: 《Advanced .NET Debug