Sungju's Slow Life

Personal journal


Jump into vmcore analysis – Step 3

Now we are ready to deep dive into vmcore analysis to confirm what went wrong in the system. In general, first thing I’m checking is the system’s log.

crash> log
....
....
end_request: I/O error, dev sdajm, sector 0
end_request: I/O error, dev sdajm, sector 8
end_request: I/O error, dev sdajm, sector 0
NMI Watchdog detected LOCKUP on CPU 2
CPU 2 
Modules linked in: mptctl(U) ipmi_watchdog ipmi_devintf ipmi_si ipmi_msghandler oracleasm(U) autofs4 hidp ocfs2(U) nfs fscache nfs_acl rfcomm l2cap bluetooth ocfs2_dlmfs(U) ocfs2_dlm(U) ocfs2_nodemanager(U) configfs lockd sunrpc cpufreq_ondemand powernow_k8 freq_table bonding ipv6 xfrm_nalgo crypto_api be2iscsi(U) ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_addr iscsi_tcp cxgb3i iw_cxgb3 ib_core cxgb3 8021q libiscsi_tcp bnx2i(U) libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi cnic(U) emcpvlumd(PU) emcpxcrypt(PU) emcpdm(PU) emcpgpx(PU) emcpmpx(PU) emcp(PU) dm_multipath scsi_dh video backlight sbs power_meter i2c_ec i2c_core dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport sg bnx2(U) tg3(U) k8_edac k8temp hpilo hwmon serio_raw shpchp edac_mc pcspkr dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod mptspi(U) scsi_transport_spi mptsas(U) mptscsih(U) scsi_transport_sas usb_storage qla2xxx scsi_transport_fc cciss(U) ext3 jbd uhci_hcd ohci_hcd ehci_hcd hpahcisr(PU) sd_mod scsi_mod mptbase(U)
Pid: 5543, comm: ocfs2dc Tainted: P      2.6.18-194.el5 #1
RIP: 0010:[]  [] .text.lock.spinlock+0x5/0x30
RSP: 0000:ffff81010fcebbe8  EFLAGS: 00000082
RAX: 0000000000000005 RBX: ffff81045b1a9000 RCX: 0000000000000000
RDX: 0000000000000003 RSI: ffff810f805b58c8 RDI: ffff81045b1a9670
RBP: 0000000000000003 R08: ffff81010535f7e0 R09: 0000000000000000
R10: ffff810bb35dc300 R11: ffffffff8004533c R12: ffff810f805b58d8
R13: 0000000000000246 R14: ffff81045b1a9000 R15: ffff8107d6c85448
FS:  00002b037b8b8780(0000) GS:ffff81088e0f44c0(0000) knlGS:00000000f7aefb90
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00000000bca34004 CR3: 00000007cbae1000 CR4: 00000000000006e0
Process ocfs2dc (pid: 5543, threadinfo ffff810c72776000, task ffff81087e14b0c0)
Stack:  ffffffff88c4e2b7 ffff810f805b58c8 ffffffff88c4e928 0000000000000003
 ffff810f805b5d20 0000000000000000 ffff810f805b58c0 ffff81045b1a9000
 ffffffff88c5690b 0000000000000000 ffff8108d31acc00 0000000000008000
Call Trace:
   [] :ocfs2:ocfs2_wake_downconvert_thread+0x10/0x3d
 [] :ocfs2:ocfs2_cluster_unlock+0x2ff/0x30d
 [] :ocfs2:ocfs2_rw_unlock+0x127/0x197
 [] dio_complete+0x62/0x99
 [] dio_bio_end_aio+0x8e/0xbf
 [] :dm_mod:clone_endio+0xbf/0xce
 [] :emcp:PowerPlatformTopIodone+0x74/0x210
 [] :emcp:PowerTopIodone+0x22/0x150
 [] :emcp:PowerProcessTopIodonePirps+0x4e/0xa0
 [] :emcp:PowerBottomIodoneNew+0x2a3/0x4d0
 [] __end_that_request_first+0x23c/0x5bf
 [] __activate_task+0x56/0x6d
 [] :scsi_mod:scsi_end_request+0x27/0xcd
 [] :scsi_mod:scsi_io_completion+0x14e/0x324
 [] :scsi_mod:scsi_delete_timer+0x12/0x59
 [] :sd_mod:sd_rw_intr+0x252/0x28c
 [] :scsi_mod:scsi_device_unbusy+0x67/0x81
 [] blk_done_softirq+0x5f/0x6d
 [] __do_softirq+0x89/0x133
 [] call_softirq+0x1c/0x28
 [] do_softirq+0x2c/0x85
 [] do_IRQ+0xec/0xf5
 [] ret_from_intr+0x0/0xa
   [] :ocfs2:ocfs2_downconvert_thread+0xbf5/0xcdd
 [] autoremove_wake_function+0x0/0x2e
 [] :ocfs2:ocfs2_downconvert_thread+0x0/0xcdd
 [] keventd_create_kthread+0x0/0xc4
 [] kthread+0xfe/0x132
 [] child_rip+0xa/0x11
 [] keventd_create_kthread+0x0/0xc4
 [] pci_mmcfg_read+0x0/0x92
 [] kthread+0x0/0x132
 [] child_rip+0x0/0x11


Code: 7e f9 e9 7f fe ff ff f3 90 83 3f 00 7e f9 e9 f9 fe ff ff f3 

http://pagead2.googlesyndication.com/pagead/show_ads.js

In this case, it shows watchdog error messages just after some IO related error messages. To check more clear evident at the time of crash, you might want to check backtrace of the process that was involved in the crash. You can get details with ‘bt’ command.

crash> bt
PID: 5543   TASK: ffff81087e14b0c0  CPU: 2   COMMAND: "ocfs2dc"
 #0 [ffff810c8006cdc0] crash_kexec at ffffffff800aeb6b
 #1 [ffff810c8006ce80] die_nmi at ffffffff800662a5
 #2 [ffff810c8006cea0] nmi_watchdog_tick at ffffffff80066a0b
 #3 [ffff810c8006cef0] default_do_nmi at ffffffff80066629
 #4 [ffff810c8006cf40] do_nmi at ffffffff80066896
 #5 [ffff810c8006cf50] nmi at ffffffff80065eef
    [exception RIP: .text.lock.spinlock+5]
    RIP: ffffffff80065bff  RSP: ffff81010fcebbe8  RFLAGS: 00000082
    RAX: 0000000000000005  RBX: ffff81045b1a9000  RCX: 0000000000000000
    RDX: 0000000000000003  RSI: ffff810f805b58c8  RDI: ffff81045b1a9670
    RBP: 0000000000000003   R8: ffff81010535f7e0   R9: 0000000000000000
    R10: ffff810bb35dc300  R11: ffffffff8004533c  R12: ffff810f805b58d8
    R13: 0000000000000246  R14: ffff81045b1a9000  R15: ffff8107d6c85448
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
---  ---
 #6 [ffff81010fcebbe8] .text.lock.spinlock at ffffffff80065bff (via _spin_lock)
 #7 [ffff81010fcebbe8] ocfs2_wake_downconvert_thread at ffffffff88c4e2b7 [ocfs2]
 #8 [ffff81010fcebbf8] ocfs2_cluster_unlock at ffffffff88c4e928 [ocfs2]
...

So, from above, we can see that it was in ‘.text.lock.spinlock’ which smells some lock contention issue.

If you want to check other current processes in the other CPUs, you can use ‘bt -a’. With this you can check what kinds of lock contention had happened.

crash> bt -a
PID: 0      TASK: ffffffff80309b60  CPU: 0   COMMAND: "swapper"
 #0 [ffffffff8044bf20] crash_nmi_callback at ffffffff8007bf44
 #1 [ffffffff8044bf40] do_nmi at ffffffff8006688a
 #2 [ffffffff8044bf50] nmi at ffffffff80065eef
    [exception RIP: default_idle+41]
...

If you just want to check what processes are on each CPUs, you can use below command.

crash> 
crash> bt -a | grep CPU
PID: 0      TASK: ffffffff80309b60  CPU: 0   COMMAND: "swapper"
PID: 5544   TASK: ffff810c7fce8080  CPU: 1   COMMAND: "dlm_thread"
PID: 5543   TASK: ffff81087e14b0c0  CPU: 2   COMMAND: "ocfs2dc"
PID: 0      TASK: ffff81107ffc97e0  CPU: 3   COMMAND: "swapper"
PID: 0      TASK: ffff81107ffbf0c0  CPU: 4   COMMAND: "swapper"
PID: 0      TASK: ffff81048e1c30c0  CPU: 5   COMMAND: "swapper"
PID: 24960  TASK: ffff8102088b2080  CPU: 6   COMMAND: "oracle"
PID: 0      TASK: ffff81048e1c4100  CPU: 7   COMMAND: "swapper"


Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: