An Exercise in Debugging When the Kernel Is Implicated

Situation: an MPI application (10s of processes) seems to freeze after completion. Once it freezes, I can’t attach to any of its processes via gdb -p (that also freezes). I can’t SIGTERM or even SIGKILL the processes. ps aux (my standard set of ps flags) shows the processes in either an Is state, or a Ds state.

Apparently, D processes are waiting for I/O in a non-interruptible state. I processes are idle, and are in an interruptible state. If I tried to poke around in /proc/<pid> too much, even procfs would freeze.

wchan

The first useful thing I learned was the wchan column in ps. wchan means wait channel - it tells you what stream (or whatever the definition of a channel is) is a process waiting on.

$ -eo ppid,pid,user,stat,pcpu,comm,wchan -p 21061 | grep stoch
 21060 ankushj  Zs    0.3 stoch <defunct> -
 21061 ankushj  Is    0.4 stochastic_subg ptlrpc_set_wait
 21062 ankushj  Zs    0.2 stoch <defunct> -
 21068 ankushj  Zs    0.2 stoch <defunct> -
 21069 ankushj  Ds    0.4 stochastic_subg rwsem_down_write_slowpath
 21070 ankushj  Ds    0.4 stochastic_subg rwsem_down_write_slowpath
 21071 ankushj  Zs    0.4 stoch <defunct> -
 21072 ankushj  Zs    0.4 stoch <defunct> -
  ...

Bingo - we have some function names to blame. Apparently we can get more context around these without needing to attach a debugger, which we can not.

We can cat /proc/<pid>/stack to get a stack trace from a process without a debugger. Thankfully this works.

$ sudo cat /proc/21061/stack

[<0>] ptlrpc_set_wait+0x5e8/0x730 [ptlrpc]
[<0>] ptlrpc_queue_wait+0x88/0x230 [ptlrpc]
[<0>] ldlm_cli_enqueue+0x436/0x990 [ptlrpc]
[<0>] mdc_enqueue_base+0x2f2/0x1c90 [mdc]
[<0>] mdc_intent_lock+0x212/0x530 [mdc]
[<0>] lmv_intent_lock+0x385/0x16b0 [lmv]
[<0>] ll_lookup_it+0x7ad/0x20e0 [lustre]
[<0>] ll_atomic_open+0x198/0xff0 [lustre]
[<0>] lookup_open+0x364/0x6e0
[<0>] do_last+0x2cb/0x900
[<0>] path_openat+0x8d/0x290
[<0>] do_filp_open+0x91/0x100
[<0>] do_sys_open+0x17e/0x290
[<0>] __x64_sys_openat+0x20/0x30
[<0>] do_syscall_64+0x57/0x190
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

$ sudo cat /proc/21065/stack

[<0>] rwsem_down_write_slowpath+0x244/0x4d0
[<0>] do_last+0x2b1/0x900
[<0>] path_openat+0x8d/0x290
[<0>] do_filp_open+0x91/0x100
[<0>] do_sys_open+0x17e/0x290
[<0>] __x64_sys_openat+0x20/0x30
[<0>] do_syscall_64+0x57/0x190
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

This is beautiful! Instead of gaslighting ourselves and slipping into an existential crisis, we can point our fingers elsewhere.

Conclusion

Both these stack traces indicate getting stuck on a File I/O call. The first trace is more specific - it tells us that it is a Lustre I/O. Lustre is a parallel filesystem.

Turns out that one of my Lustre OSTs was dead. That would explain this behavior - the Lustre module sits in the kernel, and is blocked trying to reach that OST. Maybe there are good reasons a timeout there can not be handled gracefully, maybe there are not. But a kernel module freezing is tough - a whole bunch of debugging options go for a toss. Thankfully some still work!