editorloop 占用_systemd CPU占用100%,并出现大量僵尸进程
有一天,突然大量CentOS 7服務(wù)器出現(xiàn)異常,表現(xiàn)為systemd CPU占用100%,并出現(xiàn)大量僵尸進(jìn)程,top信息如下:top信息
隨著僵尸進(jìn)程的增加,系統(tǒng)資源漸漸被消耗完,導(dǎo)致宕機(jī)。
在CentOS7中,systemd作為pid為1的進(jìn)程,負(fù)責(zé)給孤兒進(jìn)程收尸。這個(gè)問題中,systemd CPU占用100%是因,出現(xiàn)大量僵尸進(jìn)程是果,所以看看systemd為什么占用了100%的CPU。
裝上systemd的debuginfo包,并用perf對(duì)systemd進(jìn)行觀察,發(fā)現(xiàn)在systemd的用戶態(tài)中占用較高CPU的函數(shù)有endswith,hidden_file_allow_backup,dirent_ensure_type,hidden_file,find_symlinks_fd,內(nèi)核態(tài)占用CPU高的函數(shù)中有dcache_readdir,推斷內(nèi)核在讀目錄。perf report的結(jié)果
編寫hidden_file.stp文件,在hidden_file函數(shù)被調(diào)用時(shí)將用戶態(tài)stack打印:
probe process("/usr/lib/systemd/systemd").function("hidden_file").call{
print_usyms(ubacktrace())
}
執(zhí)行stap hidden_file.stp,結(jié)果如下:
0x7f927f3b4b35 : 0x7f927f3b4b35 [/usr/lib64/libc-2.17.so+0x21b35/0x3bc000]
0x7f9280d04b20 : hidden_file+0x0/0x50 [/usr/lib/systemd/systemd]
0x7f9280cdeaf8 : find_symlinks_fd+0xc8/0x510 [/usr/lib/systemd/systemd]
0x7f9280ce386e : unit_file_lookup_state+0x26e/0x410 [/usr/lib/systemd/systemd]
0x7f9280ce4be7 : unit_file_get_list+0x1b7/0x390 [/usr/lib/systemd/systemd]
0x7ff9280cc77ce : method_list_unit_files+0xae/0x220 [/usr/lib/systemd/systemd]
0x7f9280d294c7 : object_find_and_run+0xac7/0x1670 [/usr/lib/systemd/systemd]
0x7f9280d2a189 : bus_process_object+0x119/0x310 [/usr/lib/systemd/systemd]
0x7f9280d32e53 : bus_process_internal+0xdb3/0x1210 [/usr/lib/systemd/systemd]
0x7f9280d332d3 : io_callback+0x13/0x50 [/usr/lib/systemd/systemd]
0x7f9280d387d0 : source_dispatch+0x1c0/0x320 [/usr/lib/systemd/systemd]
0x7f9280d3986a : sd_event_dispatch+0x6a/0x1b0 [/usr/lib/systemd/systemd]
0x7f9280c992e3 : manager_loop+0x403/0x500 [/usr/lib/systemd/systemd]
0x7f9280c8d72b : main+0x1e7b/0x3e00 [/usr/lib/systemd/systemd]
從systemd的source_dispatch函數(shù)的代碼看,這里觸發(fā)了一個(gè)SOURCE_IO事件。
static int source_dispatch(sd_event_source *s) {
int r = 0;
assert(s);
assert(s->pending || s->type == SOURCE_EXIT);
if (s->type != SOURCE_DEFER && s->type != SOURCE_EXIT) {
r = source_set_pending(s, false);
if (r < 0)
return r;
}
if (s->type != SOURCE_POST) {
sd_event_source *z;
Iterator i;
/* If we execute a non-post source, let's mark all
* post sources as pending */
SET_FOREACH(z, s->event->post_sources, i) {
if (z->enabled == SD_EVENT_OFF)
continue;
r = source_set_pending(z, true);
if (r < 0)
return r;
}
}
if (s->enabled == SD_EVENT_ONESHOT) {
r = sd_event_source_set_enabled(s, SD_EVENT_OFF);
if (r < 0)
return r;
}
s->dispatching = true;
switch (s->type) {
case SOURCE_IO:
r = s->io.callback(s, s->io.fd, s->io.revents, s->userdata);
break;
case SOURCE_TIME_REALTIME:
case SOURCE_TIME_BOOTTIME:
case SOURCE_TIME_MONOTONIC:
case SOURCE_TIME_REALTIME_ALARM:
case SOURCE_TIME_BOOTTIME_ALARM:
r = s->time.callback(s, s->time.next, s->userdata);
break;
所以懷疑機(jī)器上有大量的SOURCE_IO事件被觸發(fā)。找到觸發(fā)新建IO事件的函數(shù)sd_event_add_io,用systemtap打用戶stack,居然沒進(jìn)這個(gè)函數(shù)。說明出問題時(shí)沒人大量發(fā)SOURCE_IO事件。那就是systemd卡在哪個(gè)循環(huán)里了。回到用戶stack,看代碼檢查循環(huán),懷疑是這里,大概看了下是遍歷paths.unitpath中的所有目錄:
int unit_file_get_list(
UnitFileScope scope,
const char *root_dir,
Hashmap *h) {
_cleanup_lookup_paths_free_ LookupPaths paths = {};
char **i;
int r;
assert(scope >= 0);
assert(scope < _UNIT_FILE_SCOPE_MAX);
assert(h);
r = verify_root_dir(scope, &root_dir);
if (r < 0)
return r;
r = lookup_paths_init_from_scope(&paths, scope, root_dir);
if (r < 0)
return r;
STRV_FOREACH(i, paths.unit_path) {
_cleanup_closedir_ DIR *d = NULL;
_cleanup_free_ char *units_dir;
units_dir = path_join(root_dir, *i, NULL);
if (!units_dir)
return -ENOMEM;
d = opendir(units_dir);
if (!d) {
if (errno == ENOENT)
continue;
return -errno;
}
for (;;) {
_cleanup_(unit_file_list_free_onep) UnitFileList *f = NULL;
struct dirent *de;
errno = 0;
de = readdir(d);
if (!de && errno != 0)
return -errno;
if (!de)
break;
if (hidden_file(de->d_name))
continue;
為了驗(yàn)證循環(huán)的位置,使用systemtap腳本驗(yàn)證不同的行代碼是否正在被運(yùn)行,找到不斷循環(huán)的地方確實(shí)是那個(gè)for (;;)。
probe process("/usr/lib/systemd/systemd").function("*@src/shared/install.c:2461").call{
printf("hit\n")
}
再打印出目錄,發(fā)現(xiàn)是/run/systemd/system/session-***.scope.d/
到/run/systemd/下一看,可真大,可真多。有的機(jī)器目錄本身的大小已經(jīng)有32M里面有幾十W個(gè)東西
google一下,大量/run/systemd/system/session的原因,找到一個(gè)dbus的bug,現(xiàn)象一樣:https://bugs.freedesktop.org/show_bug.cgi?id=95263。根據(jù)帖子里的說明
I think what I'm going to do here is:
* For 1.11.x (master): apply the patches you used, and revert the
"uid 0" workaround.
* For 1.10.x: stick with the "uid 0" workaround, because that workaround
is enough to address this for logind, which is the most important impact.
We can consider additionally backporting the patches from Bug #95619
and this bug later, once they have had more testing in master; but
if we do, we will also keep the workaround.
這個(gè)bug在1.11.x 和1.10.x上各自有了修復(fù)。一看問題機(jī)器的dbus版本,dbus-1.10.24-13.el7_6.x86_64。這好像是已修復(fù)版本啊。。。陷入沉思。觀察了一下dbus升級(jí)時(shí)間:
再對(duì)一臺(tái)問題機(jī)器按文件時(shí)間排序,可以發(fā)現(xiàn),3月4月的還大量存在,到5月就突然沒了。這與dbus的升級(jí)時(shí)間吻合。
所以這些sessions是在dbus升級(jí)之前老版本的dbus下積累的,升級(jí)dbus后sessions沒有被清除。沒清除本來也沒什么問題,直到有一天執(zhí)行了systemd的某個(gè)命令觸發(fā)了一個(gè)SOURCE_IO事件引發(fā)了問題。將/run/systemd/system/下早于dbus升級(jí)時(shí)間的session*.scope刪除,問題解決。
總結(jié)
以上是生活随笔為你收集整理的editorloop 占用_systemd CPU占用100%,并出现大量僵尸进程的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: python使用正则验证电子邮件_如何使
- 下一篇: 对应chd5.14的spark_GitH