[Bug 390384] New: Dom0 kernel oops while processing aio operation
https://bugzilla.novell.com/show_bug.cgi?id=390384 User kwolf@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=390384#c380514 Summary: Dom0 kernel oops while processing aio operation Product: openSUSE 11.0 Version: Beta 2 Platform: Other OS/Version: Other Status: NEW Severity: Normal Priority: P5 - None Component: Xen AssignedTo: cgriffin@novell.com ReportedBy: kwolf@novell.com QAContact: qa@suse.de Found By: --- Created an attachment (id=215278) --> (https://bugzilla.novell.com/attachment.cgi?id=215278) /var/log/messages snippet Another oops that occurred on my development box while debugging the tap:aio problems. Note that I selected Xen because it's a Dom0 kernel, but I can't exclude that it is a general kernel problem. This one happened somewhere in the middle of a PV VM installation using tap:aio both for the virtual harddisk and the installation DVD iso. The IO requests of the guest are handled by tapdisk which in turn accesses the image file using Linux aio. Obviously something went wrong with one of these accesses. This could be even directly related to bug #380514. The usual symptom I get there is an EIO return value from random aio operations (say, up to five failing operations per VM installation). In most cases it worked to simply repeat a failed request in tapdisk. In the logfile the oops is immediately following such a repeated request. I've not been able to reproduce this yet. A snippet from /var/log/messages containing the stacktrace is attached. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=390384 User carnold@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=390384#c1 Charles Arnold <carnold@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |carnold@novell.com, lbendixs@novell.com, | |jbeulich@novell.com Status|NEW |NEEDINFO Info Provider| |kwolf@novell.com QAContact|qa@suse.de |jdouglas@novell.com --- Comment #1 from Charles Arnold <carnold@novell.com> 2008-05-21 09:19:12 MST --- Kevin, Can you reproduce this on the latest kernel for openSUSE 11.0 and provide the same kind of trace for Jan. It still may not be xen specific. If we can't reproduce this on a current kernel, we will probably close this bug. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=390384 User kwolf@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=390384#c2 Kevin Wolf <kwolf@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEEDINFO |NEW Info Provider|kwolf@novell.com | --- Comment #2 from Kevin Wolf <kwolf@novell.com> 2008-05-26 11:36:08 MDT --- Created an attachment (id=218185) --> (https://bugzilla.novell.com/attachment.cgi?id=218185) Another /var/log/messages snippet for a newer kernel I was able to reproduce it on the latest kernel. Please see the attached log for the new trace. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=390384 User jbeulich@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=390384#c3 Jan Beulich <jbeulich@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |NEEDINFO Info Provider| |kwolf@novell.com --- Comment #3 from Jan Beulich <jbeulich@novell.com> 2008-05-28 04:42:27 MDT --- Where do the "TAPDISK[4223]: EIO. Repeating op = 1, 0x602e5a/8/5" messages originate from? As they in both cases appeared right before the crash, I would suspect that they might be key to understanding what's going on here. As I wasn't able to locate a message like this anywhere, I wonder whether it is from a local patch of yours... Also, what build does kernel 2.6.25.4-6 come from - all I can find are builds with 2.6.25.4-2, and in order to analyze the back trace I need the matching kernel RPM. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=390384 User kwolf@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=390384#c4 Kevin Wolf <kwolf@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEEDINFO |NEW Info Provider|kwolf@novell.com | --- Comment #4 from Kevin Wolf <kwolf@novell.com> 2008-05-28 05:42:03 MDT --- Created an attachment (id=218574) --> (https://bugzilla.novell.com/attachment.cgi?id=218574) Local patch to tapdisk Yes, this error message is from my local changes (attaching the patch) which I made to debug the tap:aio problems. Besides the debug messages, the real change is that whenever an AIO operation returns -EIO, this operation is repeated (I know that this is probably no proper solution, just wanted to see if it helps - and it certainly shouldn't cause an oops). This is exactly the entry you're seeing in the logfiles. As Charles asked me to take the latest kernel, the kernel is from /mounts/dist/next-x86_64. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=390384 User jbeulich@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=390384#c5 Jan Beulich <jbeulich@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- AssignedTo|cgriffin@novell.com |kernel-maintainers@forge.provo.novell.com --- Comment #5 from Jan Beulich <jbeulich@novell.com> 2008-05-28 07:15:07 MDT ---
From the little analysis we were able to do this looks like a problem not specific to the Xen kernel. And even if it is, we'd need someone with much better AIO/direct-io knowledge to assist here.
-- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=390384 Lars Marowsky-Bree <lmb@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- AssignedTo|kernel-maintainers@forge.provo.novell.com |jack@novell.com -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=390384 User jack@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=390384#c6 Jan Kara <jack@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |NEEDINFO Info Provider| |kwolf@novell.com --- Comment #6 from Jan Kara <jack@novell.com> 2008-05-28 15:03:50 MDT --- Hmm, it looks the oops happened in do_direct_IO() somewhere in zero_user(). Just to verify - your program was doing direct IO on block device with block size 512 bytes. Somehow the page user provided got freed before we got to it (that could possibly be a fault of Xen's tap:aio if it plays some dirty tricks ordinary userspace process couldn't do). So is tap:aio just an ordinary process using AIO/dio or does it do anything special? Regarding those EIO errors before - what filesystem do you use? There's a problem in ext3 which could cause EIO return from direct IO writes when they are combined with buffered writes. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=390384 User kwolf@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=390384#c7 Kevin Wolf <kwolf@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEEDINFO |NEW Info Provider|kwolf@novell.com | --- Comment #7 from Kevin Wolf <kwolf@novell.com> 2008-05-29 01:49:31 MDT --- tapdisk is just an ordinary process (running as root, though). It gets requests from the blktap frontend and then uses the normal AIO functions to actually read/write to the image. Yes, the images are on ext3. Btw, looking at my local patch to tapdisk I saw that it doesn't make too much sense. Please forget everything what I said about repeating a failed request. What actually happens is that an AIO write fails with EIO (for unknown reason, this would be the key to bug #380514 I suspect) and tapdisk then issues a _read_ with otherwise same parameters. And it seems to be this read which causes the oops - I can try to verify this tomorrow. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=390384 User jack@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=390384#c8 --- Comment #8 from Jan Kara <jack@novell.com> 2008-05-29 02:04:01 MDT --- I see. So it's indeed a bug in AIO/DIO code in kernel (userspace process shouldn't be able to oops the kernel). I'll have a look into that. Anyway, for the EIO problem, I'll attach a fix which just went into -mm tree. Can you try it please, whether it fixes the problem for you? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=390384 User jack@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=390384#c9 --- Comment #9 from Jan Kara <jack@novell.com> 2008-05-29 02:07:47 MDT --- Created an attachment (id=218811) --> (https://bugzilla.novell.com/attachment.cgi?id=218811) Patch fixing EIO errors returned by ext3 when DIO used together with buffered IO -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=390384 User jack@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=390384#c10 --- Comment #10 from Jan Kara <jack@novell.com> 2008-05-29 04:01:25 MDT --- To decrease some uncertainty I have a few questions: The XEN device tap:aio has been accessing was backed by a regular file, wasn't it?
From what I've understood from reading tap:aio sources do_cow_read() issues a direct IO read and as a buffer provides mmapped region of the same file, correct?
I'm trying to reproduce the problem here (by some simple program) and I'm unsuccessful so far... -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=390384 User jack@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=390384#c11 Jan Kara <jack@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |NEEDINFO Info Provider| |kwolf@novell.com --- Comment #11 from Jan Kara <jack@novell.com> 2008-05-29 04:01:59 MDT --- Forgot to set NEEDINFO... -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=390384 User kwolf@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=390384#c12 Kevin Wolf <kwolf@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEEDINFO |NEW Info Provider|kwolf@novell.com | --- Comment #12 from Kevin Wolf <kwolf@novell.com> 2008-05-29 12:24:52 MDT --- Yes, I use a regular file as image. I think the buffer is a page from the ring buffer shared by Dom0 backend and DomU frontend blktap drivers. If I understand correctly the mmap is not on the image file but on /dev/xen/blktapN. I will try if your patch helps with the EIO returns. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=390384 User kwolf@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=390384#c13 --- Comment #13 from Kevin Wolf <kwolf@novell.com> 2008-05-30 02:12:45 MDT --- The patch doesn't seem to help, I'm still getting the EIO returns. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=390384 User jack@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=390384#c14 Jan Kara <jack@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |NEEDINFO Info Provider| |kwolf@novell.com --- Comment #14 from Jan Kara <jack@novell.com> 2008-06-02 07:37:03 MDT --- OK, thanks for testing. Would it be possible to create a test program that wouldn't use XEN's kernel blktap driver (only regular files) and still trigger the bug (either EIO or Oops)? At least the Oops seems like it could well be caused by the blktap kernel driver because direct-io oopses because the pages it had been provided to do IO to suddently disappear and these are the pages from mmaped blktap driver if I understand the situation correctly. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=390384 User jbeulich@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=390384#c15 --- Comment #15 from Jan Beulich <jbeulich@novell.com> 2008-06-02 07:52:36 MDT --- Not exactly: The pages don't disappear, they become read-only. Unless blktap re-maps pages read-only, this is a pretty good sign that the page in question meanwhile got handed back to the allocator, and got re-allocated as page table page (verifying this would require looking at the contents of the page). (Of course, if the page was readonly even earlier, maybe this could explain the -EIO - the error value wouldn't be well chosen if so, but it's a possibility anyway.) -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=390384 User jack@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=390384#c16 --- Comment #16 from Jan Kara <jack@novell.com> 2008-06-02 08:09:50 MDT --- How have you found out the page is RO (I always like to learn how to find more info from the oops ;)? Anyway yes, what you write would cause the oops as well and it would be a bug of blktap driver. Thanks for pointing that out. Regarding the EIO: EIO from direct-io can be caused by the fact that invalidate_inode_pages2_range() we do before the actual direct io failed for some reason (this call evicts pages of page cache in the area where direct write is going to happen). This is usually caused by the fact that someone still holds references to buffers where we want to write with direct-IO. Also the fact that simply retrying the IO usually helps suggests that this could be the case here. But all the cases we were aware of (and were able to trigger by our testing) should be fixed by the patch I've attached so I'm currently not sure what we can be missing (and therefore I'd like to have a program to reproduce it without the blktap interference). -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=390384 User jbeulich@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=390384#c17 --- Comment #17 from Jan Beulich <jbeulich@novell.com> 2008-06-02 08:21:41 MDT ---
How have you found out the page is RO (I always like to learn how to find more info from the oops ;)?
Quite early in the oops there is a line like this PGD 1559067 PUD 175b067 PMD 18fb067 PTE 70ce165 which says that all upper page table levels have their write bit set, just the leaf entry (PTE) doesn't. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=390384 User kwolf@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=390384#c18 --- Comment #18 from Kevin Wolf <kwolf@novell.com> 2008-06-04 11:25:39 MDT --- I agree that a small test program independent of blktap would be helpful. I haven't succeeded yet in creating one, though. Jan, I don't think blktap would intentionally free the page, after all it's a ring buffer and should be reused. It seems to map DomU blkfront pages read-only for write operations, though. As I'm not too familiar with the kernel code, could you take a look at the blktap kernel module source? And while we're at it, just out of curiosity with respect to oops reading... Is the 0003 in "Oops: 0003 [1] SMP" the error code of the page fault? That's the way I interpreted it and then it would be another hint that the page is read-only. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=390384 User jbeulich@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=390384#c19 --- Comment #19 from Jan Beulich <jbeulich@novell.com> 2008-06-05 04:42:36 MDT --- Ah, indeed, I wasn't aware of them mapping pages for write requests readonly. But that shouldn't cause any problems - nothing should try to access these pages other than for reading in that path. And yes, the 0003 is the fault error code, saying it was a present page that had an access rights violation. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=390384 User kwolf@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=390384#c20 --- Comment #20 from Kevin Wolf <kwolf@novell.com> 2008-06-06 11:26:18 MDT ---
Ah, indeed, I wasn't aware of them mapping pages for write requests readonly. But that shouldn't cause any problems - nothing should try to access these pages other than for reading in that path.
Thinking about it again, this _must_ cause problems in combination with my patch. As I wrote above it issues a read when the write fails - and then of course we are in the wrong code path. So I think we can forget about the oops... Anyway, there is still the question why I'm getting EIO in the first place. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=390384 User jack@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=390384#c21 --- Comment #21 from Jan Kara <jack@novell.com> 2008-06-18 10:42:04 MDT --- Hmm, I don't understand that as well. I'll create a debug patch for you to try. I wanted to run Xen on my test 11.0 machine and try to reproduce but I was not able to create a virtual machine - Yast complains "Not enough memory" when creating the host (it has 512 MB which seems enough to me ;(). BTW, just to make sure: You've applied the patch I've provided to the *host* kernel, didn't you? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=390384 User jack@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=390384#c22 --- Comment #22 from Jan Kara <jack@novell.com> 2008-06-18 11:09:48 MDT --- Oo, sorry. I've just looked at the patch I've posted here originaly and it was a patch for ext4! Now wonder it didn't help to ext3... I'll attach the right patch. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=390384 User jack@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=390384#c23 Jan Kara <jack@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #218811|0 |1 is obsolete| | --- Comment #23 from Jan Kara <jack@novell.com> 2008-06-18 11:11:38 MDT --- Created an attachment (id=222848) --> (https://bugzilla.novell.com/attachment.cgi?id=222848) The right fix for EIO when doing DIO So does this patch fix the problem for you? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=390384 User kwolf@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=390384#c24 Kevin Wolf <kwolf@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEEDINFO |NEW Info Provider|kwolf@novell.com | --- Comment #24 from Kevin Wolf <kwolf@novell.com> 2008-06-19 06:21:56 MDT --- Looks good. I did three test installations and didn't get a single EIO. So chances are high that your patch is a fix for this problem. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=390384 User jack@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=390384#c25 Jan Kara <jack@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED --- Comment #25 from Jan Kara <jack@novell.com> 2008-06-19 06:35:49 MDT --- Glad to hear it. I have commited the fix to OpenSUSE 11 tree. HEAD will get the fix from mainline. Closing the bug. Please reopen if you see the bug again. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=390384 User kwolf@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=390384#c26 Kevin Wolf <kwolf@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |mmeeks@novell.com --- Comment #26 from Kevin Wolf <kwolf@novell.com> 2008-06-19 06:45:37 MDT --- *** Bug 380514 has been marked as a duplicate of this bug. *** https://bugzilla.novell.com/show_bug.cgi?id=380514 -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=390384 User meissner@novell.com added comment https://bugzilla.novell.com/show_bug.cgi?id=390384#c27 --- Comment #27 from Marcus Meissner <meissner@novell.com> 2008-07-08 08:51:06 MDT --- 11.0 update kernel released, version-release is 2.6.25.9-0.2 -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
participants (1)
-
bugzilla_noreply@novell.com