Sungju's Slow Life

Personal journal


An example case with some of my commands

System got high load average and it wasn’t responding for long which is a typical hang situation.

crash> sys | egrep -e LOAD -e CPUS
        CPUS: 14
LOAD AVERAGE: 520.69, 210.35, 79.69
crash> hangcheck
[0 00:00:00.003] [UN]  PID: 5507   TASK: ffff8d257723cf10  CPU: 6   COMMAND: "ora_dia0_gladp6"
[0 00:00:00.006] [UN]  PID: 6068   TASK: ffff8d266239cf10  CPU: 7   COMMAND: "ora_ckpt_feslog"
...
[0 00:01:45.282] [UN]  PID: 3212   TASK: ffff8d483fe04f10  CPU: 13  COMMAND: "cbdaemon"
[0 00:02:05.238] [UN]  PID: 27266  TASK: ffff8d2686304f10  CPU: 9   COMMAND: "oraagent.bin"
[0 00:02:05.240] [UN]  PID: 28076  TASK: ffff8d26fb221fa0  CPU: 6   COMMAND: "oraagent.bin"
[0 00:02:05.831] [UN]  PID: 27390  TASK: ffff8d26633e3f40  CPU: 7   COMMAND: "oraagent.bin"
[0 00:02:05.832] [UN]  PID: 27249  TASK: ffff8d483fca0000  CPU: 7   COMMAND: "oraagent.bin"
[0 00:02:05.922] [UN]  PID: 28186  TASK: ffff8d2535b05ee0  CPU: 12  COMMAND: "oraagent.bin"
============================================================
Total 56 tasks were in D state. 5 tasks were in D state longer than 120 seconds\

It shows total 56 tasks in D (Uninterruptible) state and 5 tasks were in D state longer than 120 seconds which is considered as a hung task.

Let’s see what it was waiting for in this process.

crash> bt
PID: 28186  TASK: ffff8d2535b05ee0  CPU: 12  COMMAND: "oraagent.bin"
 #0 [ffff8d2574b4bd38] __schedule at ffffffff86b19aa4
 #1 [ffff8d2574b4bdc0] schedule at ffffffff86b1a119
 #2 [ffff8d2574b4bdd0] rwsem_down_read_failed at ffffffff86b1b74d
 #3 [ffff8d2574b4be50] call_rwsem_down_read_failed at ffffffff8675fb08
 #4 [ffff8d2574b4bea0] down_read at ffffffff86b19240
 #5 [ffff8d2574b4beb8] __do_page_fault at ffffffff86b2175c
 #6 [ffff8d2574b4bf20] do_page_fault at ffffffff86b218d5
 #7 [ffff8d2574b4bf50] page_fault at ffffffff86b1d758
    RIP: 00007f08802f9c8d  RSP: 00007f08306a3fc8  RFLAGS: 00010246
    RAX: 0000000000000000  RBX: 00007f0885c8b890  RCX: 0000000000000004
    RDX: 0000000000000000  RSI: 0000000000000000  RDI: 0000000000000001
    RBP: 00007f08306a44c0   R8: 00000000022f4230   R9: 00007f0885c83404
    R10: 00007f07c4023420  R11: 00007f0885739d0e  R12: 0000000000000000
    R13: 0000000000201000  R14: 0000000000000000  R15: 00007f08306b3700
    ORIG_RAX: ffffffffffffffff  CS: 0033  SS: 002b

Alright, so, it was waiting for a semaphore. It is time to find the semaphore before find out who was holding this semaphore.

crash> edis -lr ffffffff86b2175c
0xffffffff86b213b0 <__do_page_fault>:	nopl   0x0(%rax,%rax,1) [FTRACE NOP]
...
/usr/src/debug/kernel-3.10.0-862.27.1.el7/linux-3.10.0-862.27.1.el7.x86_64/arch/x86/mm/fault.c: 1238
    1238 		down_read(&mm->mmap_sem);
0xffffffff86b21753 <__do_page_fault+0x3a3>:	mov    -0x38(%rbp),%rdi    ; 0xffff8d445a9df0f8
0xffffffff86b21757 <__do_page_fault+0x3a7>:	callq  0xffffffff86b19220 <down_read>
0xffffffff86b2175c <__do_page_fault+0x3ac>:	jmpq   0xffffffff86b214d3 <__do_page_fault+0x123>

Luckily ‘%rdi’ which is the first argument for down_read() was saved in stack. This stack value is automatically retrieved by ‘edis’ command and can be seen next to the op code.

crash> down_read
down_read = $2 = 
 {void (struct rw_semaphore *)} 0xffffffff86b19220
crash> crashinfo --rwsemaphore=0xffff8d445a9df0f8
<struct rw_semaphore 0xffff8d445a9df0f8>
	   27249  oraagent.bin
	   27266  oraagent.bin
	   27390  oraagent.bin
	   27815  oraagent.bin
	   28056  oraagent.bin
	   28061  oraagent.bin
	   28070  oraagent.bin
	   28076  oraagent.bin
	   28151  oraagent.bin
	   28186  oraagent.bin
	   28187  oraagent.bin

So, we can see there were some other tasks awaiting for this semaphore, but we have no idea who is holding it. With ‘psinfo’ command, we can search each stack to find out specific value (list) using ‘-i’ while ignore any tasks with the value (list) specified with ‘-e’.

crash> psinfo -s -i ffff8d445a9df0f8 -e down_read,do_sys_poll

PID: 27309  TASK: ffff8d483fc50000  CPU: 2   COMMAND: "oraagent.bin"
 #0 [ffff8d25b1397b68] __schedule at ffffffff86b19aa4
    ffff8d25b1397b70: 0000000000000082 ffff8d483fc50000 
    ffff8d25b1397b80: ffff8d25b1397fd8 ffff8d25b1397fd8 
    ffff8d25b1397b90: ffff8d25b1397fd8 ffff8d483fc50000 
    ffff8d25b1397ba0: 0000000000000001 0000000000000000 
    ffff8d25b1397bb0: ffff8d25a4378798 000000000280a1f1 
    ffff8d25b1397bc0: 8000000000000845 ffff8d25b1394000 
    ffff8d25b1397bd0: ffff8d2564bf2660 ffff8d22dffab660 
    ffff8d25b1397be0: 00007f07f42cc000 0000000000000000 
    ffff8d25b1397bf0: ffff8d25b1397c08 ffffffff864cebd6 
 #1 [ffff8d25b1397bf8] __cond_resched at ffffffff864cebd6
    ffff8d25b1397c00: 8000000f2599d845 ffff8d25b1397c18 
    ffff8d25b1397c10: ffffffff86b1a3ba 
 #2 [ffff8d25b1397c10] _cond_resched at ffffffff86b1a3ba
    ffff8d25b1397c18: ffff8d25b1397ce8 ffffffff865c7b2d 
 #3 [ffff8d25b1397c20] copy_pte_range at ffffffff865c7b2d
    ffff8d25b1397c28: 0000000000000000 00000000ffffffff 
    ffff8d25b1397c38: 0000000180400040 ffff8d2636653f18 
    ffff8d25b1397c48: ffff8d445a9df118 ffff8d483fc50000 
    ffff8d25b1397c58: ffff8d25b1397cb4 fffff6d35492fcb0 
    ffff8d25b1397c68: ffff8d259ff95d08 ffff8d22f360fd08 
    ffff8d25b1397c78: ffff8d445a9df080 fffff6d34a7feaf0 
    ffff8d25b1397c88: ffff8d2636653e80 ffff8d25b1397fd8 
    ffff8d25b1397c98: ffff8d25a4378798 0000000000000000 
    ffff8d25b1397ca8: 0000000800000000 0000000000000000 
    ffff8d25b1397cb8: 000000000280a1f1 00007f07f4378000 
    ffff8d25b1397cc8: ffff8d259ff95d08 ffff8d22f360fd08 
    ffff8d25b1397cd8: 00007f07f4200000 00007f07f4378000 
    ffff8d25b1397ce8: ffff8d25b1397da0 ffffffff865c9b06 
 #4 [ffff8d25b1397cf0] copy_page_range at ffffffff865c9b06
    ffff8d25b1397cf8: 00007f07f4378000 01ffffff865d5117 
    ffff8d25b1397d08: 00007f07f4000000 00007f07f4377fff 
    ffff8d25b1397d18: ffff8d1c85bee7f0 ffff8d25d1e7b0f8 
    ffff8d25b1397d28: 00007f07f4378000 ffff8d25db7d27f0 
    ffff8d25b1397d38: 00007f07f4377fff ffff8d259a3cc0f8 
    ffff8d25b1397d48: 00007f07f4378000 ffff8d25a4378798 
    ffff8d25b1397d58: ffff8d2636653e80 ffff8d445a9df080 
    ffff8d25b1397d68: fffffffd5367a000 00007f07f4377fff 
    ffff8d25b1397d78: ffff8d25a4378798 ffff8d2636653e80 
    ffff8d25b1397d88: 0000000000000000 0000000000000000 
    ffff8d25b1397d98: ffff8d21dbb31a28 ffff8d25b1397e28 
    ffff8d25b1397da8: ffffffff86491750 
 #5 [ffff8d25b1397da8] dup_mm at ffffffff86491750
    ffff8d25b1397db0: ffff8d2636653ef8 ffff8d445a9df0f8 
    ffff8d25b1397dc0: ffff8d445a9df080 ffff8d21dbb31a48 
    ffff8d25b1397dd0: ffff8d21dbb31a50 ffff8d21dbb31a38 
    ffff8d25b1397de0: ffff8d21dbb30f30 ffff8d25b1397de8 
    ffff8d25b1397df0: ffff8d25b1397de8 000000000280a1f1 
    ffff8d25b1397e00: 0000000001200011 ffff8d20a3e29fa0 
    ffff8d25b1397e10: 00007f08325f29d0 0000000000000000 
    ffff8d25b1397e20: 0000000000000000 ffff8d25b1397eb8 
    ffff8d25b1397e30: ffffffff86492f42 
 #6 [ffff8d25b1397e30] copy_process at ffffffff86492f42
    ffff8d25b1397e38: 000000000280a1f1 ffff8d20a3e2a478 
    ffff8d25b1397e48: 0000000000000000 0000000000000000 
    ffff8d25b1397e58: 0000000000000000 0000000000000000 
    ffff8d25b1397e68: 0000000000000000 0000000000000000 
    ffff8d25b1397e78: fffffff48642a621 0000000000000000 
    ffff8d25b1397e88: 000000000280a1f1 0000000000000000 
    ffff8d25b1397e98: 0000000000000000 0000000001200011 
    ffff8d25b1397ea8: 0000000000000000 0000000000000000 
    ffff8d25b1397eb8: ffff8d25b1397f30 ffffffff864936e1 
 #7 [ffff8d25b1397ec0] do_fork at ffffffff864936e1
    ffff8d25b1397ec8: 0000000000000000 0000000000000000 
    ffff8d25b1397ed8: 0000000000000000 0000000000000000 
    ffff8d25b1397ee8: 0000000000000000 0000000000000000 
    ffff8d25b1397ef8: 0000000000000000 000000000280a1f1 
    ffff8d25b1397f08: 0000000000000000 0000000000000000 
    ffff8d25b1397f18: 0000000000000000 0000000000000000 
    ffff8d25b1397f28: 0000000000000000 ffff8d25b1397f40 
    ffff8d25b1397f38: ffffffff864939f6 
 #8 [ffff8d25b1397f38] sys_clone at ffffffff864939f6
    ffff8d25b1397f40: 0000000000000000 ffffffff86b26b64 
 #9 [ffff8d25b1397f48] stub_clone at ffffffff86b26b64
    RIP: 00007f087d377612  RSP: 00007f08325c9690  RFLAGS: 00000246
    RAX: 0000000000000038  RBX: 00007f08325c9690  RCX: ffffffffffffffff
    RDX: 0000000000000000  RSI: 0000000000000000  RDI: 0000000001200011
    RBP: 00007f08325c96d0   R8: 00007f08325f2700   R9: 0000000000006a41
    R10: 00007f08325f29d0  R11: 0000000000000246  R12: 0000000000000000
    R13: 0000000000000000  R14: 0000000000000000  R15: 00007f08325f2700
    ORIG_RAX: 0000000000000038  CS: 0033  SS: 002b


crash> edis -lrn ffffffff86491750 
0xffffffff864913b0 <dup_mm>:	nopl   0x0(%rax,%rax,1) [FTRACE NOP]
0xffffffff864913b5 <dup_mm+0x5>:	push   %rbp    ; 0xffff8d25b1397eb8
...
0xffffffff864914e2 <dup_mm+0x132>:	lea    0x78(%r12),%rax
0xffffffff864914e7 <dup_mm+0x137>:	mov    %rax,%rdi
0xffffffff864914ea <dup_mm+0x13a>:	mov    %rax,-0x78(%rbp)    ; 0xffff8d2636653ef8
0xffffffff864914ee <dup_mm+0x13e>:	callq  0xffffffff86b19260 <down_write>
...

0xffffffff8649174b <dup_mm+0x39b>:	callq  0xffffffff865c9690 <copy_page_range>
0xffffffff86491750 <dup_mm+0x3a0>:	mov    0x90(%r15),%rdx

Now we can tell that this process 27309 was holding the semaphore and was doing other operation.



Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: