[Bug 775694] New: I/O subsystem completely broken; no multitasking possible
https://bugzilla.novell.com/show_bug.cgi?id=775694 https://bugzilla.novell.com/show_bug.cgi?id=775694#c0 Summary: I/O subsystem completely broken; no multitasking possible Classification: openSUSE Product: openSUSE 12.1 Version: Final Platform: x86-64 OS/Version: Linux Status: NEW Severity: Major Priority: P5 - None Component: Kernel AssignedTo: kernel-maintainers@forge.provo.novell.com ReportedBy: mantel@suse.com QAContact: qa-bugs@suse.de Found By: Development Blocker: --- When one process is doing significant I/O, no other process can read or write data and is suspended until the first process finally finishes. How to reproduce: Play some music with your favorite music player. Then in some shell, type something like: cp -a hugefile.in hugefile.out after some seconds, music will stop playing and only continue after the cp has finished. I'm currently copying a 53 GB file and music stopped more than one minute ago! Currently the Linux multitasking is worse than it ever was in Windows a decade ago when it still had "cooperative multitasking". Oh, and it's not the hugetables bug: Mandelbrot:/sys/kernel/mm/transparent_hugepage # cat enabled always madvise [never] Mandelbrot:/sys/kernel/mm/transparent_hugepage # uname -a Linux Mandelbrot 3.1.10-1.9-desktop #1 SMP PREEMPT Thu Apr 5 18:48:38 UTC 2012 (4a97ec8) x86_64 x86_64 x86_64 GNU/Linux Mandelbrot:/sys/kernel/mm/transparent_hugepage # Meanwhile the "cp" has finished and music continues to play. After 2 (TWO!!!) minutes. On a system with 8 GB RAM and core-7 processor. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=775694
https://bugzilla.novell.com/show_bug.cgi?id=775694#c
Jeff Mahoney
https://bugzilla.novell.com/show_bug.cgi?id=775694
https://bugzilla.novell.com/show_bug.cgi?id=775694#c1
--- Comment #1 from Mel Gorman
When one process is doing significant I/O, no other process can read or write data and is suspended until the first process finally finishes.
1. The bug is against openSUSE 12.1 but what kernel version are you using exactly? 2. What filesystem? 3. Is the filesystem you are doing the significant IO on and the music the same filesystem? 4. Are barriers enabled? If so, what is the experience if they are disabled? 5. What is the value of /proc/sys/vm/dirty_ratio? If it's high (like 40), what is the experience if it's lower like 10?
How to reproduce:
Play some music with your favorite music player. Then in some shell, type something like:
cp -a hugefile.in hugefile.out
after some seconds, music will stop playing and only continue after the cp has finished. I'm currently copying a 53 GB file and music stopped more than one minute ago!
I do not see the same problem as such but I do know that IO interactivity for the 3.1 kernel was particularly bad. If there is a lot of different IO requests then I would sometimes see a process stall for a long period of time waiting on log space for ext4. This was later fixed but not all the fixes were not backported. I recently backported a second round of patches related to excessive reclaim and stalls while copying to USB. I do not believe a kernel has been released with them yet but I could build a test kernel if you wanted to try it. However, it's not exactly the same problem On a similar note, have you tried the tumbleweed kernel? It should be 3.4 based.
<SNIP> Meanwhile the "cp" has finished and music continues to play. After 2 (TWO!!!) minutes. On a system with 8 GB RAM and core-7 processor.
The number of cores is not as important if they are stalled waiting on access to storage. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=775694
https://bugzilla.novell.com/show_bug.cgi?id=775694#c2
Jan Kara
https://bugzilla.novell.com/show_bug.cgi?id=775694
https://bugzilla.novell.com/show_bug.cgi?id=775694#c3
--- Comment #3 from Hubert Mantel
https://bugzilla.novell.com/show_bug.cgi?id=775694
https://bugzilla.novell.com/show_bug.cgi?id=775694#c4
--- Comment #4 from Jan Kara
https://bugzilla.novell.com/show_bug.cgi?id=775694
https://bugzilla.novell.com/show_bug.cgi?id=775694#c5
--- Comment #5 from Mel Gorman
OK, then it gets more interesting and my patch isn't going to help. So the music is on a separate, otherwise unused disk.
And the information on the USB disk being read so it's not like we're waiting on data to be written to slow storage. Is there any possibility that a barrier due to the write on ext4 can cause a read on a ext3 USB stick to be delayed? Hubert, what are the mount options used for the USB stick? In particular, is it mounted with atime because that would mean that a write is necessary when accessing a file which could be delayed for a long time.
That means that the problem shouldn't be caused by disk being busy due to other IO. The only idea I have is that reclaim is too eager and reclaims some of the loaded music data / player executable (due to dirty pages from large copy putting stress to reclaim).
That seems unlikely, the amount of data needed by the music player should be so low that I find it hard to believe it's getting reclaimed that quickly. That said the most recent round of backports for the 3.1 kernel included a fix for keeping mapped file pages resident but I don't think it'll fix this particular problem but if you build a test kernel with your ext3 latency fixes then it'll also include those reclaim fixes.
But to confirm this we'd need some back traces of blocked processes when audio stalls happen I guess. Mel, do you have other ideas?
If Huburt could capture vmstat -n 1 during the whole test it would be useful to see if swapping is actually happening. Second, capturing a log of the following substituting MUSICPLAYER for whatever the music player process is called should tell us where it is stalling while [ 1 ]; do date; cat /proc/`pidof MUSICPLAYER`/stack; sleep 1; done We also know from https://lkml.org/lkml/2012/7/10/132 that read latencies for ext3 can be very bad. Hubert, to see if this is ext3 specific would it be possible for you to test with a similarly sized USB stick formatted with ext4 instead of ext3 please? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=775694
https://bugzilla.novell.com/show_bug.cgi?id=775694#c6
--- Comment #6 from Jan Kara
(In reply to comment #4)
OK, then it gets more interesting and my patch isn't going to help. So the music is on a separate, otherwise unused disk.
And the information on the USB disk being read so it's not like we're waiting on data to be written to slow storage. Is there any possibility that a barrier due to the write on ext4 can cause a read on a ext3 USB stick to be delayed? No. Barriers (cache flushes in fact) are per disk and completely independent.
That means that the problem shouldn't be caused by disk being busy due to other IO. The only idea I have is that reclaim is too eager and reclaims some of the loaded music data / player executable (due to dirty pages from large copy putting stress to reclaim).
That seems unlikely, the amount of data needed by the music player should be so low that I find it hard to believe it's getting reclaimed that quickly. That said the most recent round of backports for the 3.1 kernel included a fix for keeping mapped file pages resident but I don't think it'll fix this particular problem but if you build a test kernel with your ext3 latency fixes then it'll also include those reclaim fixes. http://beta.suse.com/private/jack/775694/ has the test kernel with my ext3 latency fix build from openSUSE-12.1 branch as of yesterday. Hubert can you try running it?
We also know from https://lkml.org/lkml/2012/7/10/132 that read latencies for ext3 can be very bad. Yes, but only in presence of heavier writing to the same filesystem. So I don't think ext3 on the stick really matters. I guess gathering stack traces is the first good step, then we will see where to go from there...
-- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=775694
https://bugzilla.novell.com/show_bug.cgi?id=775694#c7
--- Comment #7 from Jan Kara
https://bugzilla.novell.com/show_bug.cgi?id=775694
https://bugzilla.novell.com/show_bug.cgi?id=775694#c8
Hubert Mantel
participants (1)
-
bugzilla_noreply@novell.com