Mailinglist Archive: opensuse-bugs (4669 mails)

< Previous Next >
[Bug 636175] New: Dom0 'looses' BIOs
  • From: bugzilla_noreply@xxxxxxxxxx
  • Date: Wed, 1 Sep 2010 10:13:47 +0000
  • Message-id: <bug-636175-21960@xxxxxxxxxxxxxxxxxxxxxxxx/>

https://bugzilla.novell.com/show_bug.cgi?id=636175

https://bugzilla.novell.com/show_bug.cgi?id=636175#c0


Summary: Dom0 'looses' BIOs
Classification: openSUSE
Product: openSUSE 11.3
Version: Final
Platform: x86-64
OS/Version: openSUSE 11.3
Status: NEW
Severity: Critical
Priority: P5 - None
Component: Kernel
AssignedTo: kernel-maintainers@xxxxxxxxxxxxxxxxxxxxxx
ReportedBy: novell-bugzilla-6mgxae@xxxxxxxxx
QAContact: qa@xxxxxxx
Found By: ---
Blocker: ---


User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; de; rv:1.9.2.8)
Gecko/20100723 Ubuntu/10.04 (lucid) Firefox/3.6.8

DOM0 Setup: blkback -> DRBD (local primary, no remote) -> LVM2 -> MD RAID1 ->
SATA

Symptom: the md raid1 device hangs during raid resync (resync hangs, accesses
to the md raid1 device are hanging, accesses to the underlying SATA devices are
ok). There is a deadlock in the *_barrier functions of raid1.c. The resync
process is waiting for a pending request to finish (but which either never
finishes or at least 'forgets' to decrease the pending count related to the
resync barrier handling in raid1.c. While the resync process waits for pending
regular I/O to complete, it has already risen the resync barrier and all
further normal I/O is therefore waiting for the resync op to lower its barrier.
(see call trace below)

The bug has been tested and verified on totally different x86-64 platforms (AMD
Opterion 1214HE + MCP55 chipset, Intel Core2Duo Notebook ICH9M Chipset), so it
is unlikely to be a hardware issue.

It has been verified using OpenSUSE 11.2 (2.6.31.12-0.2-xen) and 11.3 (2.6.34)
dom0 kernels, running on xen hypervisor 3.4.1, 3.4.2 and 3.4.3.

I could not reproduce the bug with kernels without any xen dom0 patches.

The situation seems to occur preferably when crashing the hardware node and the
VMs therefore start a file system journal replay (ext3). That is also the
potential reason why I could not reproduce the bug with regular kernels (ie.
without xen patches) -- I have no definitive clue whether this is a
xen-specific problem.


Also reported at http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=1659.

Also effects OpenSUSE 11.2.




[ 603.229215] INFO: task md1_resync:1441 blocked for more than 120 seconds.
[ 603.229294] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
[ 603.229413] md1_resync D 0000000000000000 0 1441 2 0x00000000
[ 603.229505] ffff88003d967bb0 0000000000000246 ffff88003d967b10
ffff88003d967b30
[ 603.229627] 0000000000000000 ffff88003d967b78 000000000000a380
ffff88003dba8be8
[ 603.229753] 000000000000a380 000000000000a380 000000000000a380
0000000000007d00
[ 603.229880] Call Trace:
[ 603.229964] [<ffffffffa002c63e>] raise_barrier+0xde/0x2e0 [raid1]
[ 603.230037] [<ffffffffa002d5cb>] sync_request+0x12b/0x680 [raid1]
[ 603.230112] [<ffffffff80399de9>] md_do_sync+0x669/0xc40
[ 603.230180] [<ffffffff8039ac54>] md_thread+0x54/0x150
[ 603.230249] [<ffffffff8006fac6>] kthread+0xb6/0xc0
[ 603.230318] [<ffffffff8000d38a>] child_rip+0xa/0x20
[ 603.230401] INFO: task python:5365 blocked for more than 120 seconds.
[ 603.230467] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
[ 603.230584] python D 00000000c7c9e55f 0 5365 5348 0x00000000
[ 603.230657] ffff8800382595b8 0000000000000282 ffff880038259518
ffff880038259538
[ 603.230783] ffff8800382594e8 ffff880038259580 000000000000a380
ffff8800381e88e8
[ 603.230909] 000000000000a380 000000000000a380 000000000000a380
0000000000007d00
[ 603.231035] Call Trace:
[ 603.231098] [<ffffffffa002c99d>] wait_barrier+0x15d/0x1f0 [raid1]
[ 603.231169] [<ffffffffa002fef8>] make_request+0x58/0x690 [raid1]
[ 603.231239] [<ffffffff80399498>] md_make_request+0xc8/0x140
[ 603.231309] [<ffffffff802224db>] generic_make_request+0x19b/0x4c0
[ 603.231380] [<ffffffff8022287d>] submit_bio+0x7d/0x110
[ 603.231448] [<ffffffff80153535>] mpage_bio_submit+0x35/0x50
[ 603.231517] [<ffffffff80153aa3>] do_mpage_readpage+0x383/0x710
[ 603.231595] [<ffffffff80153fb3>] mpage_readpages+0xf3/0x150
[ 603.231664] [<ffffffff801b8ccb>] ext2_readpages+0x2b/0x50
[ 603.231733] [<ffffffff800e0353>] read_pages+0x43/0x110
[ 603.231801] [<ffffffff800e05ac>] __do_page_cache_readahead+0x18c/0x1b0
[ 603.231871] [<ffffffff800e05ff>] ra_submit+0x2f/0x50
[ 603.231936] [<ffffffff800e087d>] ondemand_readahead+0x11d/0x260
[ 603.232005] [<ffffffff800e0a60>] page_cache_async_readahead+0xa0/0xc0
[ 603.232082] [<ffffffff800d7311>] T.731+0x1f1/0x440
[ 603.232149] [<ffffffff800d7626>] generic_file_aio_read+0xc6/0x1f0
[ 603.232218] [<ffffffff80118da2>] do_sync_read+0x102/0x160
[ 603.232286] [<ffffffff801192d5>] vfs_read+0xd5/0x1c0
[ 603.232352] [<ffffffff801199fb>] sys_read+0x5b/0xa0
[ 603.233212] [<ffffffff8000c868>] system_call_fastpath+0x16/0x1b
[ 603.233279] [<00007fab60821a90>] 0x7fab60821a90
[ 603.233340] INFO: task blkid:5393 blocked for more than 120 seconds.
[ 603.233402] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
[ 603.233508] blkid D 0000000000000001 0 5393 1 0x00000000
[ 603.233573] ffff8800406d96e8 0000000000000282 ffff8800406d9648
ffff8800406d9668
[ 603.233685] ffff8800406d9618 ffff8800406d96b0 000000000000a380
ffff8800382fe4a8
[ 603.233798] 000000000000a380 000000000000a380 000000000000a380
0000000000007d00
[ 603.233914] Call Trace:
[ 603.233969] [<ffffffffa002c99d>] wait_barrier+0x15d/0x1f0 [raid1]
[ 603.234037] [<ffffffffa002fef8>] make_request+0x58/0x690 [raid1]
[ 603.234106] [<ffffffff80399498>] md_make_request+0xc8/0x140
[ 603.234176] [<ffffffff802224db>] generic_make_request+0x19b/0x4c0
[ 603.234244] [<ffffffff8022287d>] submit_bio+0x7d/0x110
[ 603.234250] [<ffffffff80147c12>] submit_bh+0x102/0x150
[ 603.234257] [<ffffffff8014adac>] block_read_full_page+0x23c/0x3b0
[ 603.234262] [<ffffffff801509b6>] blkdev_readpage+0x26/0x50
[ 603.234268] [<ffffffff800e03f6>] read_pages+0xe6/0x110
[ 603.234273] [<ffffffff800e05ac>] __do_page_cache_readahead+0x18c/0x1b0
[ 603.234278] [<ffffffff800e0839>] ondemand_readahead+0xd9/0x260
[ 603.234284] [<ffffffff800e0aad>] page_cache_sync_readahead+0x2d/0x50
[ 603.234288] [<ffffffff800d73d6>] T.731+0x2b6/0x440
[ 603.234293] [<ffffffff800d7626>] generic_file_aio_read+0xc6/0x1f0
[ 603.234300] [<ffffffff80118da2>] do_sync_read+0x102/0x160
[ 603.234305] [<ffffffff801192d5>] vfs_read+0xd5/0x1c0
[ 603.234310] [<ffffffff801199fb>] sys_read+0x5b/0xa0
[ 603.234315] [<ffffffff8000c868>] system_call_fastpath+0x16/0x1b
[ 603.234320] [<00007fbbf8f1ea90>] 0x7fbbf8f1ea90
[ 603.234323] INFO: task blkid:5397 blocked for more than 120 seconds.
[ 603.234324] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
[ 603.234326] blkid D 0000000098d5b5f5 0 5397 1 0x00000000
[ 603.234330] ffff8800381656c8 0000000000000286 ffff880038165628
ffff880038165648
[ 603.234333] 0000000000000000 ffff880038165690 000000000000a380
ffff8800382f8be8
[ 603.234336] 000000000000a380 000000000000a380 000000000000a380
0000000000007d00
[ 603.234339] Call Trace:
[ 603.234345] [<ffffffffa002c99d>] wait_barrier+0x15d/0x1f0 [raid1]
[ 603.234351] [<ffffffffa002fef8>] make_request+0x58/0x690 [raid1]
[ 603.234358] [<ffffffff80399498>] md_make_request+0xc8/0x140
[ 603.234363] [<ffffffff802224db>] generic_make_request+0x19b/0x4c0
[ 603.234369] [<ffffffff8022287d>] submit_bio+0x7d/0x110
[ 603.234373] [<ffffffff80147c12>] submit_bh+0x102/0x150
[ 603.234379] [<ffffffff8014adac>] block_read_full_page+0x23c/0x3b0
[ 603.234383] [<ffffffff801509b6>] blkdev_readpage+0x26/0x50
[ 603.234388] [<ffffffff800e03f6>] read_pages+0xe6/0x110
[ 603.234393] [<ffffffff800e05ac>] __do_page_cache_readahead+0x18c/0x1b0
[ 603.234398] [<ffffffff800e05ff>] ra_submit+0x2f/0x50
[ 603.234403] [<ffffffff800e087d>] ondemand_readahead+0x11d/0x260
[ 603.234408] [<ffffffff800e0aad>] page_cache_sync_readahead+0x2d/0x50
[ 603.234412] [<ffffffff800d73d6>] T.731+0x2b6/0x440
[ 603.234417] [<ffffffff800d7626>] generic_file_aio_read+0xc6/0x1f0
[ 603.234422] [<ffffffff80118da2>] do_sync_read+0x102/0x160
[ 603.234427] [<ffffffff801192d5>] vfs_read+0xd5/0x1c0
[ 603.234432] [<ffffffff801199fb>] sys_read+0x5b/0xa0
[ 603.234437] [<ffffffff8000c868>] system_call_fastpath+0x16/0x1b
[ 603.234442] [<00007fb60ac89a90>] 0x7fb60ac89a90
[ 723.225805] INFO: task md1_resync:1441 blocked for more than 120 seconds.
[ 723.225892] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
[ 723.226010] md1_resync D 0000000000000000 0 1441 2 0x00000000
[ 723.226108] ffff88003d967bb0 0000000000000246 ffff88003d967b10
ffff88003d967b30
[ 723.226239] 0000000000000000 ffff88003d967b78 000000000000a380
ffff88003dba8be8
[ 723.226372] 000000000000a380 000000000000a380 000000000000a380
0000000000007d00
[ 723.226504] Call Trace:
[ 723.226600] [<ffffffffa002c63e>] raise_barrier+0xde/0x2e0 [raid1]
[ 723.226684] [<ffffffffa002d5cb>] sync_request+0x12b/0x680 [raid1]
[ 723.226766] [<ffffffff80399de9>] md_do_sync+0x669/0xc40
[ 723.226841] [<ffffffff8039ac54>] md_thread+0x54/0x150
[ 723.226913] [<ffffffff8006fac6>] kthread+0xb6/0xc0
[ 723.226987] [<ffffffff8000d38a>] child_rip+0xa/0x20
[ 723.227078] INFO: task python:5365 blocked for more than 120 seconds.
[ 723.227147] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
[ 723.227265] python D 00000000c7c9e55f 0 5365 5348 0x00000000
[ 723.227346] ffff8800382595b8 0000000000000282 ffff880038259518
ffff880038259538
[ 723.227482] ffff8800382594e8 ffff880038259580 000000000000a380
ffff8800381e88e8
[ 723.227616] 000000000000a380 000000000000a380 000000000000a380
0000000000007d00
[ 723.227750] Call Trace:
[ 723.227820] [<ffffffffa002c99d>] wait_barrier+0x15d/0x1f0 [raid1]
[ 723.227904] [<ffffffffa002fef8>] make_request+0x58/0x690 [raid1]
[ 723.227983] [<ffffffff80399498>] md_make_request+0xc8/0x140
[ 723.228059] [<ffffffff802224db>] generic_make_request+0x19b/0x4c0
[ 723.228135] [<ffffffff8022287d>] submit_bio+0x7d/0x110
[ 723.228209] [<ffffffff80153535>] mpage_bio_submit+0x35/0x50
[ 723.228284] [<ffffffff80153aa3>] do_mpage_readpage+0x383/0x710
[ 723.228362] [<ffffffff80153fb3>] mpage_readpages+0xf3/0x150
[ 723.228437] [<ffffffff801b8ccb>] ext2_readpages+0x2b/0x50
[ 723.228512] [<ffffffff800e0353>] read_pages+0x43/0x110
[ 723.228586] [<ffffffff800e05ac>] __do_page_cache_readahead+0x18c/0x1b0
[ 723.228667] [<ffffffff800e05ff>] ra_submit+0x2f/0x50
[ 723.228746] [<ffffffff800e087d>] ondemand_readahead+0x11d/0x260
[ 723.228828] [<ffffffff800e0a60>] page_cache_async_readahead+0xa0/0xc0
[ 723.228906] [<ffffffff800d7311>] T.731+0x1f1/0x440
[ 723.228978] [<ffffffff800d7626>] generic_file_aio_read+0xc6/0x1f0
[ 723.229059] [<ffffffff80118da2>] do_sync_read+0x102/0x160
[ 723.229133] [<ffffffff801192d5>] vfs_read+0xd5/0x1c0
[ 723.229209] [<ffffffff801199fb>] sys_read+0x5b/0xa0
[ 723.229285] [<ffffffff8000c868>] system_call_fastpath+0x16/0x1b
[ 723.229362] [<00007fab60821a90>] 0x7fab60821a90
[ 723.229431] INFO: task blkid:5393 blocked for more than 120 seconds.
[ 723.229500] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
[ 723.229619] blkid D 0000000000000001 0 5393 1 0x00000000
[ 723.229700] ffff8800406d96e8 0000000000000282 ffff8800406d9648
ffff8800406d9668
[ 723.229857] ffff8800406d9618 ffff8800406d96b0 000000000000a380
ffff8800382fe4a8
[ 723.229998] 000000000000a380 000000000000a380 000000000000a380
0000000000007d00
[ 723.230083] Call Trace:
[ 723.230097] [<ffffffffa002c99d>] wait_barrier+0x15d/0x1f0 [raid1]
[ 723.230109] [<ffffffffa002fef8>] make_request+0x58/0x690 [raid1]
[ 723.230126] [<ffffffff80399498>] md_make_request+0xc8/0x140
[ 723.230142] [<ffffffff802224db>] generic_make_request+0x19b/0x4c0
[ 723.230152] [<ffffffff8022287d>] submit_bio+0x7d/0x110
[ 723.230162] [<ffffffff80147c12>] submit_bh+0x102/0x150
[ 723.230173] [<ffffffff8014adac>] block_read_full_page+0x23c/0x3b0
[ 723.230184] [<ffffffff801509b6>] blkdev_readpage+0x26/0x50
[ 723.230193] [<ffffffff800e03f6>] read_pages+0xe6/0x110
[ 723.230203] [<ffffffff800e05ac>] __do_page_cache_readahead+0x18c/0x1b0
[ 723.230213] [<ffffffff800e0839>] ondemand_readahead+0xd9/0x260
[ 723.230223] [<ffffffff800e0aad>] page_cache_sync_readahead+0x2d/0x50
[ 723.230235] [<ffffffff800d73d6>] T.731+0x2b6/0x440
[ 723.230244] [<ffffffff800d7626>] generic_file_aio_read+0xc6/0x1f0
[ 723.230254] [<ffffffff80118da2>] do_sync_read+0x102/0x160
[ 723.230263] [<ffffffff801192d5>] vfs_read+0xd5/0x1c0
[ 723.230278] [<ffffffff801199fb>] sys_read+0x5b/0xa0
[ 723.230288] [<ffffffff8000c868>] system_call_fastpath+0x16/0x1b
[ 723.230298] [<00007fbbf8f1ea90>] 0x7fbbf8f1ea90
[ 723.230303] INFO: task blkid:5397 blocked for more than 120 seconds.
[ 723.230306] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
[ 723.230310] blkid D 0000000098d5b5f5 0 5397 1 0x00000000
[ 723.230316] ffff8800381656c8 0000000000000286 ffff880038165628
ffff880038165648
[ 723.230322] 0000000000000000 ffff880038165690 000000000000a380
ffff8800382f8be8
[ 723.230327] 000000000000a380 000000000000a380 000000000000a380
0000000000007d00
[ 723.230333] Call Trace:
[ 723.230344] [<ffffffffa002c99d>] wait_barrier+0x15d/0x1f0 [raid1]
[ 723.230355] [<ffffffffa002fef8>] make_request+0x58/0x690 [raid1]
[ 723.230381] [<ffffffff80399498>] md_make_request+0xc8/0x140
[ 723.230391] [<ffffffff802224db>] generic_make_request+0x19b/0x4c0
[ 723.230401] [<ffffffff8022287d>] submit_bio+0x7d/0x110
[ 723.230410] [<ffffffff80147c12>] submit_bh+0x102/0x150
[ 723.230423] [<ffffffff8014adac>] block_read_full_page+0x23c/0x3b0
[ 723.230434] [<ffffffff801509b6>] blkdev_readpage+0x26/0x50
[ 723.230443] [<ffffffff800e03f6>] read_pages+0xe6/0x110
[ 723.230453] [<ffffffff800e05ac>] __do_page_cache_readahead+0x18c/0x1b0
[ 723.230463] [<ffffffff800e05ff>] ra_submit+0x2f/0x50
[ 723.230475] [<ffffffff800e087d>] ondemand_readahead+0x11d/0x260
[ 723.230487] [<ffffffff800e0aad>] page_cache_sync_readahead+0x2d/0x50
[ 723.230495] [<ffffffff800d73d6>] T.731+0x2b6/0x440
[ 723.230504] [<ffffffff800d7626>] generic_file_aio_read+0xc6/0x1f0
[ 723.230514] [<ffffffff80118da2>] do_sync_read+0x102/0x160
[ 723.230523] [<ffffffff801192d5>] vfs_read+0xd5/0x1c0
[ 723.230535] [<ffffffff801199fb>] sys_read+0x5b/0xa0
[ 723.230547] [<ffffffff8000c868>] system_call_fastpath+0x16/0x1b
[ 723.230556] [<00007fb60ac89a90>] 0x7fb60ac89a90
[ 723.230563] INFO: task lvscan:5451 blocked for more than 120 seconds.
[ 723.230566] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
[ 723.230570] lvscan D 000000003a1ec879 0 5451 5449 0x00000000
[ 723.230576] ffff880038207898 0000000000000282 ffff8800382077f8
ffff880038207818
[ 723.230581] ffff8800382077e8 ffff880038207860 000000000000a380
ffff8800407047e8
[ 723.230587] 000000000000a380 000000000000a380 000000000000a380
0000000000007d00
[ 723.230592] Call Trace:
[ 723.230606] [<ffffffffa002c99d>] wait_barrier+0x15d/0x1f0 [raid1]
[ 723.230618] [<ffffffffa002fef8>] make_request+0x58/0x690 [raid1]
[ 723.230631] [<ffffffff80399498>] md_make_request+0xc8/0x140
[ 723.230641] [<ffffffff802224db>] generic_make_request+0x19b/0x4c0
[ 723.230650] [<ffffffff8022287d>] submit_bio+0x7d/0x110
[ 723.230659] [<ffffffff8015210b>] dio_bio_submit+0x6b/0xc0
[ 723.230668] [<ffffffff80152d28>] direct_io_worker+0x258/0x3c0
[ 723.230678] [<ffffffff801530be>] __blockdev_direct_IO+0x22e/0x4d0
[ 723.230687] [<ffffffff80150838>] blkdev_direct_IO+0x58/0x80
[ 723.230695] [<ffffffff800d7737>] generic_file_aio_read+0x1d7/0x1f0
[ 723.230708] [<ffffffff80118da2>] do_sync_read+0x102/0x160
[ 723.230718] [<ffffffff801192d5>] vfs_read+0xd5/0x1c0
[ 723.230727] [<ffffffff801199fb>] sys_read+0x5b/0xa0
[ 723.230736] [<ffffffff8000c868>] system_call_fastpath+0x16/0x1b
[ 723.230745] [<00007f78d4b85a80>] 0x7f78d4b85a80
[ 843.222203] INFO: task md1_resync:1441 blocked for more than 120 seconds.
[ 843.222290] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
[ 843.222406] md1_resync D 0000000000000000 0 1441 2 0x00000000
[ 843.222508] ffff88003d967bb0 0000000000000246 ffff88003d967b10
ffff88003d967b30
[ 843.222641] 0000000000000000 ffff88003d967b78 000000000000a380
ffff88003dba8be8
[ 843.222777] 000000000000a380 000000000000a380 000000000000a380
0000000000007d00
[ 843.222912] Call Trace:
[ 843.223009] [<ffffffffa002c63e>] raise_barrier+0xde/0x2e0 [raid1]
[ 843.223091] [<ffffffffa002d5cb>] sync_request+0x12b/0x680 [raid1]
[ 843.223175] [<ffffffff80399de9>] md_do_sync+0x669/0xc40
[ 843.223250] [<ffffffff8039ac54>] md_thread+0x54/0x150
[ 843.223325] [<ffffffff8006fac6>] kthread+0xb6/0xc0
[ 843.223400] [<ffffffff8000d38a>] child_rip+0xa/0x20


Reproducible: Always

Steps to Reproduce:
1. DOM0 Setup: blkback -> DRBD -> LVM2 -> MD RAID1 -> SATA
2. Maybe run some disk load inside a few VMs and set min raid sync speed higher
quite a bit to force the situation more frequently.
3. Crash the hardware node (echo b > /proc/sysrq-trigger) and then restart the
VMs. Crashing is not required, disk stress is enough. (dd writes)
Actual Results:
md raid1 device with VMs' LVM2 devices hangs. SATA layer works perfectly.

Expected Results:
I/O continues, resync completes, no I/O hanging at the softraid level.

--
Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

< Previous Next >
Follow Ups