Now we are ready to deep dive into vmcore analysis to confirm what went wrong in the system. In general, first thing I’m checking is the system’s log.
crash> log .... .... end_request: I/O error, dev sdajm, sector 0 end_request: I/O error, dev sdajm, sector 8 end_request: I/O error, dev sdajm, sector 0 NMI Watchdog detected LOCKUP on CPU 2 CPU 2 Modules linked in: mptctl(U) ipmi_watchdog ipmi_devintf ipmi_si ipmi_msghandler oracleasm(U) autofs4 hidp ocfs2(U) nfs fscache nfs_acl rfcomm l2cap bluetooth ocfs2_dlmfs(U) ocfs2_dlm(U) ocfs2_nodemanager(U) configfs lockd sunrpc cpufreq_ondemand powernow_k8 freq_table bonding ipv6 xfrm_nalgo crypto_api be2iscsi(U) ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_addr iscsi_tcp cxgb3i iw_cxgb3 ib_core cxgb3 8021q libiscsi_tcp bnx2i(U) libiscsi2 scsi_transport_iscsi2 scsi_transport_iscsi cnic(U) emcpvlumd(PU) emcpxcrypt(PU) emcpdm(PU) emcpgpx(PU) emcpmpx(PU) emcp(PU) dm_multipath scsi_dh video backlight sbs power_meter i2c_ec i2c_core dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport sg bnx2(U) tg3(U) k8_edac k8temp hpilo hwmon serio_raw shpchp edac_mc pcspkr dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod mptspi(U) scsi_transport_spi mptsas(U) mptscsih(U) scsi_transport_sas usb_storage qla2xxx scsi_transport_fc cciss(U) ext3 jbd uhci_hcd ohci_hcd ehci_hcd hpahcisr(PU) sd_mod scsi_mod mptbase(U) Pid: 5543, comm: ocfs2dc Tainted: P 2.6.18-194.el5 #1 RIP: 0010:[] [] .text.lock.spinlock+0x5/0x30 RSP: 0000:ffff81010fcebbe8 EFLAGS: 00000082 RAX: 0000000000000005 RBX: ffff81045b1a9000 RCX: 0000000000000000 RDX: 0000000000000003 RSI: ffff810f805b58c8 RDI: ffff81045b1a9670 RBP: 0000000000000003 R08: ffff81010535f7e0 R09: 0000000000000000 R10: ffff810bb35dc300 R11: ffffffff8004533c R12: ffff810f805b58d8 R13: 0000000000000246 R14: ffff81045b1a9000 R15: ffff8107d6c85448 FS: 00002b037b8b8780(0000) GS:ffff81088e0f44c0(0000) knlGS:00000000f7aefb90 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00000000bca34004 CR3: 00000007cbae1000 CR4: 00000000000006e0 Process ocfs2dc (pid: 5543, threadinfo ffff810c72776000, task ffff81087e14b0c0) Stack: ffffffff88c4e2b7 ffff810f805b58c8 ffffffff88c4e928 0000000000000003 ffff810f805b5d20 0000000000000000 ffff810f805b58c0 ffff81045b1a9000 ffffffff88c5690b 0000000000000000 ffff8108d31acc00 0000000000008000 Call Trace: [] :ocfs2:ocfs2_wake_downconvert_thread+0x10/0x3d [] :ocfs2:ocfs2_cluster_unlock+0x2ff/0x30d [] :ocfs2:ocfs2_rw_unlock+0x127/0x197 [] dio_complete+0x62/0x99 [] dio_bio_end_aio+0x8e/0xbf [] :dm_mod:clone_endio+0xbf/0xce [] :emcp:PowerPlatformTopIodone+0x74/0x210 [] :emcp:PowerTopIodone+0x22/0x150 [] :emcp:PowerProcessTopIodonePirps+0x4e/0xa0 [] :emcp:PowerBottomIodoneNew+0x2a3/0x4d0 [] __end_that_request_first+0x23c/0x5bf [] __activate_task+0x56/0x6d [] :scsi_mod:scsi_end_request+0x27/0xcd [] :scsi_mod:scsi_io_completion+0x14e/0x324 [] :scsi_mod:scsi_delete_timer+0x12/0x59 [] :sd_mod:sd_rw_intr+0x252/0x28c [] :scsi_mod:scsi_device_unbusy+0x67/0x81 [] blk_done_softirq+0x5f/0x6d [] __do_softirq+0x89/0x133 [] call_softirq+0x1c/0x28 [] do_softirq+0x2c/0x85 [] do_IRQ+0xec/0xf5 [] ret_from_intr+0x0/0xa [] :ocfs2:ocfs2_downconvert_thread+0xbf5/0xcdd [] autoremove_wake_function+0x0/0x2e [] :ocfs2:ocfs2_downconvert_thread+0x0/0xcdd [] keventd_create_kthread+0x0/0xc4 [] kthread+0xfe/0x132 [] child_rip+0xa/0x11 [] keventd_create_kthread+0x0/0xc4 [] pci_mmcfg_read+0x0/0x92 [] kthread+0x0/0x132 [] child_rip+0x0/0x11 Code: 7e f9 e9 7f fe ff ff f3 90 83 3f 00 7e f9 e9 f9 fe ff ff f3
http://pagead2.googlesyndication.com/pagead/show_ads.js
In this case, it shows watchdog error messages just after some IO related error messages. To check more clear evident at the time of crash, you might want to check backtrace of the process that was involved in the crash. You can get details with ‘bt’ command.
crash> bt PID: 5543 TASK: ffff81087e14b0c0 CPU: 2 COMMAND: "ocfs2dc" #0 [ffff810c8006cdc0] crash_kexec at ffffffff800aeb6b #1 [ffff810c8006ce80] die_nmi at ffffffff800662a5 #2 [ffff810c8006cea0] nmi_watchdog_tick at ffffffff80066a0b #3 [ffff810c8006cef0] default_do_nmi at ffffffff80066629 #4 [ffff810c8006cf40] do_nmi at ffffffff80066896 #5 [ffff810c8006cf50] nmi at ffffffff80065eef [exception RIP: .text.lock.spinlock+5] RIP: ffffffff80065bff RSP: ffff81010fcebbe8 RFLAGS: 00000082 RAX: 0000000000000005 RBX: ffff81045b1a9000 RCX: 0000000000000000 RDX: 0000000000000003 RSI: ffff810f805b58c8 RDI: ffff81045b1a9670 RBP: 0000000000000003 R8: ffff81010535f7e0 R9: 0000000000000000 R10: ffff810bb35dc300 R11: ffffffff8004533c R12: ffff810f805b58d8 R13: 0000000000000246 R14: ffff81045b1a9000 R15: ffff8107d6c85448 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000 --- --- #6 [ffff81010fcebbe8] .text.lock.spinlock at ffffffff80065bff (via _spin_lock) #7 [ffff81010fcebbe8] ocfs2_wake_downconvert_thread at ffffffff88c4e2b7 [ocfs2] #8 [ffff81010fcebbf8] ocfs2_cluster_unlock at ffffffff88c4e928 [ocfs2] ...
So, from above, we can see that it was in ‘.text.lock.spinlock’ which smells some lock contention issue.
If you want to check other current processes in the other CPUs, you can use ‘bt -a’. With this you can check what kinds of lock contention had happened.
crash> bt -a PID: 0 TASK: ffffffff80309b60 CPU: 0 COMMAND: "swapper" #0 [ffffffff8044bf20] crash_nmi_callback at ffffffff8007bf44 #1 [ffffffff8044bf40] do_nmi at ffffffff8006688a #2 [ffffffff8044bf50] nmi at ffffffff80065eef [exception RIP: default_idle+41] ...
If you just want to check what processes are on each CPUs, you can use below command.
crash> crash> bt -a | grep CPU PID: 0 TASK: ffffffff80309b60 CPU: 0 COMMAND: "swapper" PID: 5544 TASK: ffff810c7fce8080 CPU: 1 COMMAND: "dlm_thread" PID: 5543 TASK: ffff81087e14b0c0 CPU: 2 COMMAND: "ocfs2dc" PID: 0 TASK: ffff81107ffc97e0 CPU: 3 COMMAND: "swapper" PID: 0 TASK: ffff81107ffbf0c0 CPU: 4 COMMAND: "swapper" PID: 0 TASK: ffff81048e1c30c0 CPU: 5 COMMAND: "swapper" PID: 24960 TASK: ffff8102088b2080 CPU: 6 COMMAND: "oracle" PID: 0 TASK: ffff81048e1c4100 CPU: 7 COMMAND: "swapper"
Leave a Reply