System got high load average and it wasn’t responding for long which is a typical hang situation.
crash> sys | egrep -e LOAD -e CPUS
CPUS: 14
LOAD AVERAGE: 520.69, 210.35, 79.69
crash> hangcheck
[0 00:00:00.003] [UN] PID: 5507 TASK: ffff8d257723cf10 CPU: 6 COMMAND: "ora_dia0_gladp6"
[0 00:00:00.006] [UN] PID: 6068 TASK: ffff8d266239cf10 CPU: 7 COMMAND: "ora_ckpt_feslog"
...
[0 00:01:45.282] [UN] PID: 3212 TASK: ffff8d483fe04f10 CPU: 13 COMMAND: "cbdaemon"
[0 00:02:05.238] [UN] PID: 27266 TASK: ffff8d2686304f10 CPU: 9 COMMAND: "oraagent.bin"
[0 00:02:05.240] [UN] PID: 28076 TASK: ffff8d26fb221fa0 CPU: 6 COMMAND: "oraagent.bin"
[0 00:02:05.831] [UN] PID: 27390 TASK: ffff8d26633e3f40 CPU: 7 COMMAND: "oraagent.bin"
[0 00:02:05.832] [UN] PID: 27249 TASK: ffff8d483fca0000 CPU: 7 COMMAND: "oraagent.bin"
[0 00:02:05.922] [UN] PID: 28186 TASK: ffff8d2535b05ee0 CPU: 12 COMMAND: "oraagent.bin"
============================================================
Total 56 tasks were in D state. 5 tasks were in D state longer than 120 seconds\
It shows total 56 tasks in D (Uninterruptible) state and 5 tasks were in D state longer than 120 seconds which is considered as a hung task.
Let’s see what it was waiting for in this process.
crash> bt
PID: 28186 TASK: ffff8d2535b05ee0 CPU: 12 COMMAND: "oraagent.bin"
#0 [ffff8d2574b4bd38] __schedule at ffffffff86b19aa4
#1 [ffff8d2574b4bdc0] schedule at ffffffff86b1a119
#2 [ffff8d2574b4bdd0] rwsem_down_read_failed at ffffffff86b1b74d
#3 [ffff8d2574b4be50] call_rwsem_down_read_failed at ffffffff8675fb08
#4 [ffff8d2574b4bea0] down_read at ffffffff86b19240
#5 [ffff8d2574b4beb8] __do_page_fault at ffffffff86b2175c
#6 [ffff8d2574b4bf20] do_page_fault at ffffffff86b218d5
#7 [ffff8d2574b4bf50] page_fault at ffffffff86b1d758
RIP: 00007f08802f9c8d RSP: 00007f08306a3fc8 RFLAGS: 00010246
RAX: 0000000000000000 RBX: 00007f0885c8b890 RCX: 0000000000000004
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000001
RBP: 00007f08306a44c0 R8: 00000000022f4230 R9: 00007f0885c83404
R10: 00007f07c4023420 R11: 00007f0885739d0e R12: 0000000000000000
R13: 0000000000201000 R14: 0000000000000000 R15: 00007f08306b3700
ORIG_RAX: ffffffffffffffff CS: 0033 SS: 002b
Alright, so, it was waiting for a semaphore. It is time to find the semaphore before find out who was holding this semaphore.
crash> edis -lr ffffffff86b2175c
0xffffffff86b213b0 <__do_page_fault>: nopl 0x0(%rax,%rax,1) [FTRACE NOP]
...
/usr/src/debug/kernel-3.10.0-862.27.1.el7/linux-3.10.0-862.27.1.el7.x86_64/arch/x86/mm/fault.c: 1238
1238 down_read(&mm->mmap_sem);
0xffffffff86b21753 <__do_page_fault+0x3a3>: mov -0x38(%rbp),%rdi ; 0xffff8d445a9df0f8
0xffffffff86b21757 <__do_page_fault+0x3a7>: callq 0xffffffff86b19220 <down_read>
0xffffffff86b2175c <__do_page_fault+0x3ac>: jmpq 0xffffffff86b214d3 <__do_page_fault+0x123>
Luckily ‘%rdi’ which is the first argument for down_read() was saved in stack. This stack value is automatically retrieved by ‘edis’ command and can be seen next to the op code.
crash> down_read
down_read = $2 =
{void (struct rw_semaphore *)} 0xffffffff86b19220
crash> crashinfo --rwsemaphore=0xffff8d445a9df0f8
<struct rw_semaphore 0xffff8d445a9df0f8>
27249 oraagent.bin
27266 oraagent.bin
27390 oraagent.bin
27815 oraagent.bin
28056 oraagent.bin
28061 oraagent.bin
28070 oraagent.bin
28076 oraagent.bin
28151 oraagent.bin
28186 oraagent.bin
28187 oraagent.bin
So, we can see there were some other tasks awaiting for this semaphore, but we have no idea who is holding it. With ‘psinfo’ command, we can search each stack to find out specific value (list) using ‘-i’ while ignore any tasks with the value (list) specified with ‘-e’.
crash> psinfo -s -i ffff8d445a9df0f8 -e down_read,do_sys_poll
PID: 27309 TASK: ffff8d483fc50000 CPU: 2 COMMAND: "oraagent.bin"
#0 [ffff8d25b1397b68] __schedule at ffffffff86b19aa4
ffff8d25b1397b70: 0000000000000082 ffff8d483fc50000
ffff8d25b1397b80: ffff8d25b1397fd8 ffff8d25b1397fd8
ffff8d25b1397b90: ffff8d25b1397fd8 ffff8d483fc50000
ffff8d25b1397ba0: 0000000000000001 0000000000000000
ffff8d25b1397bb0: ffff8d25a4378798 000000000280a1f1
ffff8d25b1397bc0: 8000000000000845 ffff8d25b1394000
ffff8d25b1397bd0: ffff8d2564bf2660 ffff8d22dffab660
ffff8d25b1397be0: 00007f07f42cc000 0000000000000000
ffff8d25b1397bf0: ffff8d25b1397c08 ffffffff864cebd6
#1 [ffff8d25b1397bf8] __cond_resched at ffffffff864cebd6
ffff8d25b1397c00: 8000000f2599d845 ffff8d25b1397c18
ffff8d25b1397c10: ffffffff86b1a3ba
#2 [ffff8d25b1397c10] _cond_resched at ffffffff86b1a3ba
ffff8d25b1397c18: ffff8d25b1397ce8 ffffffff865c7b2d
#3 [ffff8d25b1397c20] copy_pte_range at ffffffff865c7b2d
ffff8d25b1397c28: 0000000000000000 00000000ffffffff
ffff8d25b1397c38: 0000000180400040 ffff8d2636653f18
ffff8d25b1397c48: ffff8d445a9df118 ffff8d483fc50000
ffff8d25b1397c58: ffff8d25b1397cb4 fffff6d35492fcb0
ffff8d25b1397c68: ffff8d259ff95d08 ffff8d22f360fd08
ffff8d25b1397c78: ffff8d445a9df080 fffff6d34a7feaf0
ffff8d25b1397c88: ffff8d2636653e80 ffff8d25b1397fd8
ffff8d25b1397c98: ffff8d25a4378798 0000000000000000
ffff8d25b1397ca8: 0000000800000000 0000000000000000
ffff8d25b1397cb8: 000000000280a1f1 00007f07f4378000
ffff8d25b1397cc8: ffff8d259ff95d08 ffff8d22f360fd08
ffff8d25b1397cd8: 00007f07f4200000 00007f07f4378000
ffff8d25b1397ce8: ffff8d25b1397da0 ffffffff865c9b06
#4 [ffff8d25b1397cf0] copy_page_range at ffffffff865c9b06
ffff8d25b1397cf8: 00007f07f4378000 01ffffff865d5117
ffff8d25b1397d08: 00007f07f4000000 00007f07f4377fff
ffff8d25b1397d18: ffff8d1c85bee7f0 ffff8d25d1e7b0f8
ffff8d25b1397d28: 00007f07f4378000 ffff8d25db7d27f0
ffff8d25b1397d38: 00007f07f4377fff ffff8d259a3cc0f8
ffff8d25b1397d48: 00007f07f4378000 ffff8d25a4378798
ffff8d25b1397d58: ffff8d2636653e80 ffff8d445a9df080
ffff8d25b1397d68: fffffffd5367a000 00007f07f4377fff
ffff8d25b1397d78: ffff8d25a4378798 ffff8d2636653e80
ffff8d25b1397d88: 0000000000000000 0000000000000000
ffff8d25b1397d98: ffff8d21dbb31a28 ffff8d25b1397e28
ffff8d25b1397da8: ffffffff86491750
#5 [ffff8d25b1397da8] dup_mm at ffffffff86491750
ffff8d25b1397db0: ffff8d2636653ef8 ffff8d445a9df0f8
ffff8d25b1397dc0: ffff8d445a9df080 ffff8d21dbb31a48
ffff8d25b1397dd0: ffff8d21dbb31a50 ffff8d21dbb31a38
ffff8d25b1397de0: ffff8d21dbb30f30 ffff8d25b1397de8
ffff8d25b1397df0: ffff8d25b1397de8 000000000280a1f1
ffff8d25b1397e00: 0000000001200011 ffff8d20a3e29fa0
ffff8d25b1397e10: 00007f08325f29d0 0000000000000000
ffff8d25b1397e20: 0000000000000000 ffff8d25b1397eb8
ffff8d25b1397e30: ffffffff86492f42
#6 [ffff8d25b1397e30] copy_process at ffffffff86492f42
ffff8d25b1397e38: 000000000280a1f1 ffff8d20a3e2a478
ffff8d25b1397e48: 0000000000000000 0000000000000000
ffff8d25b1397e58: 0000000000000000 0000000000000000
ffff8d25b1397e68: 0000000000000000 0000000000000000
ffff8d25b1397e78: fffffff48642a621 0000000000000000
ffff8d25b1397e88: 000000000280a1f1 0000000000000000
ffff8d25b1397e98: 0000000000000000 0000000001200011
ffff8d25b1397ea8: 0000000000000000 0000000000000000
ffff8d25b1397eb8: ffff8d25b1397f30 ffffffff864936e1
#7 [ffff8d25b1397ec0] do_fork at ffffffff864936e1
ffff8d25b1397ec8: 0000000000000000 0000000000000000
ffff8d25b1397ed8: 0000000000000000 0000000000000000
ffff8d25b1397ee8: 0000000000000000 0000000000000000
ffff8d25b1397ef8: 0000000000000000 000000000280a1f1
ffff8d25b1397f08: 0000000000000000 0000000000000000
ffff8d25b1397f18: 0000000000000000 0000000000000000
ffff8d25b1397f28: 0000000000000000 ffff8d25b1397f40
ffff8d25b1397f38: ffffffff864939f6
#8 [ffff8d25b1397f38] sys_clone at ffffffff864939f6
ffff8d25b1397f40: 0000000000000000 ffffffff86b26b64
#9 [ffff8d25b1397f48] stub_clone at ffffffff86b26b64
RIP: 00007f087d377612 RSP: 00007f08325c9690 RFLAGS: 00000246
RAX: 0000000000000038 RBX: 00007f08325c9690 RCX: ffffffffffffffff
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011
RBP: 00007f08325c96d0 R8: 00007f08325f2700 R9: 0000000000006a41
R10: 00007f08325f29d0 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 00007f08325f2700
ORIG_RAX: 0000000000000038 CS: 0033 SS: 002b
crash> edis -lrn ffffffff86491750
0xffffffff864913b0 <dup_mm>: nopl 0x0(%rax,%rax,1) [FTRACE NOP]
0xffffffff864913b5 <dup_mm+0x5>: push %rbp ; 0xffff8d25b1397eb8
...
0xffffffff864914e2 <dup_mm+0x132>: lea 0x78(%r12),%rax
0xffffffff864914e7 <dup_mm+0x137>: mov %rax,%rdi
0xffffffff864914ea <dup_mm+0x13a>: mov %rax,-0x78(%rbp) ; 0xffff8d2636653ef8
0xffffffff864914ee <dup_mm+0x13e>: callq 0xffffffff86b19260 <down_write>
...
0xffffffff8649174b <dup_mm+0x39b>: callq 0xffffffff865c9690 <copy_page_range>
0xffffffff86491750 <dup_mm+0x3a0>: mov 0x90(%r15),%rdx
Now we can tell that this process 27309 was holding the semaphore and was doing other operation.
Leave a Reply