Bug ID 1230062
Summary RISC-V: virtio-console sometimes gets stuck in set_termios.part.0
Classification openSUSE
Product openSUSE Tumbleweed
Version Current
Hardware RISC-V
OS Other
Status NEW
Severity Normal
Priority P5 - None
Component Kernel
Assignee kernel-bugs@opensuse.org
Reporter fvogt@suse.com
QA Contact qa-bugs@suse.de
Target Milestone ---
Found By ---
Blocker ---

Created attachment 877101 [details]
/proc/*/task/*/stack + strace of bash

openQA types in commands to the serial console, sends a newline and then waits
until the prompt appears. Running this in a loop causes it to get stuck
sometimes: https://openqa.opensuse.org/tests/4445068#step/http_srv/16

In the QEMU chardev log it appears like this:

# systemctl --no-pager is-active apache2; echo hjhkw-$?-
active
hjhkw-0-
echo Lorem ipsum dolor sit amet > /srv/www/htdocs/index.html

openQA sends 'systemctl --no-pager is-active apache2; echo hjhkw-$?-\n',
receives 'active\nhjhkw-0-\n', then sends 'echo Lorem ipsum dolor sit amet >
/srv/www/htdocs/index.html\n' and waits for '# '. That does not arrive however
and the wait times out. In most cases that's just ignored by openQA and it
moves on to the next command. When it sends that ("curl -f ..." here), the
missing part suddenly appears:

# echo Lorem ipsum dolor sit amet > /srv/www/htdocs/index.html
# curl -f http://localhost/ | grep 'Lorem ipsum dolor sit amet'; echo _Z5Xg-$?-

This issue appears to happen on riscv64 only, I have not observed it anywhere
else so far.

Thankfully the arbitrarily long delay of the prompt allows to debug this. I
created a testcase that exercises the serial terminal in a loop and so far it
always gets stuck before the 20th attempt. I logged onto the graphical console
and collected some info. The /bin/bash process on the console is stuck in the
kernel:

/proc/1624/task/1624/stack:[<0>] set_termios.part.0+0x130/0x206
/proc/1624/task/1624/stack:[<0>] tty_mode_ioctl+0x46e/0x576
/proc/1624/task/1624/stack:[<0>] n_tty_ioctl_helper+0x2a/0xe4
/proc/1624/task/1624/stack:[<0>] n_tty_ioctl+0x9a/0x148
/proc/1624/task/1624/stack:[<0>] tty_ioctl+0x170/0x64e
/proc/1624/task/1624/stack:[<0>] __riscv_sys_ioctl+0x82/0xaa
/proc/1624/task/1624/stack:[<0>] do_trap_ecall_u+0x200/0x212
/proc/1624/task/1624/stack:[<0>] handle_exception+0xde/0xde

FWICT this is inside the "retry_write_wait" loop:
https://github.com/torvalds/linux/blob/67784a74e258a467225f0e68335df77acd67b7ab/drivers/tty/tty_ioctl.c#L487

No other process trace has any related tty call pending. When I attached to the
process with strace, it immediately continued and the missing output appeared
on the serial console. /proc/*/task/*/stack and strace log are attached.

The first syscall that strace reports is rt_sigprocmask(SIG_BLOCK, ...), which
usually follows ioctl(0, TCSETSW, ...), so it was probably stuck in that ioctl.

As it helps to provide more input or attach with strace, it appears like some
wakeup signal is missing or got lost that would notify the wait that the write
has completed.


You are receiving this mail because: