[Bug 208782] New: When accessing the tape drive the system "crashes" sometimes with an Oops
https://bugzilla.novell.com/show_bug.cgi?id=208782 Summary: When accessing the tape drive the system "crashes" sometimes with an Oops Product: SUSE Linux 10.1 Version: Final Platform: 32bit OS/Version: SuSE Linux 10.1 Status: NEW Severity: Major Priority: P5 - None Component: Kernel AssignedTo: kernel-maintainers@forge.provo.novell.com ReportedBy: kern@sibbald.com QAContact: qa@suse.de When accessing the tape drive (several machines) with Bacula (backup software) the keyboard input goes away, then shortly later the mouse, always requiring a hard reboot. This happens on several machines, and once I have a program (under development) that fails, the failure is 100% reproducible. A small change in the program can make the problem go away, so it seems to be position dependent or size dependent. I have tested this on two machines with different SCSI cards and three different kernels: kernel-smp-2.6.16.21-0.13 on i686 SCSI Adaptec AIC-7892A U160/m (rev 02) kernel-smp-2.6.16.21-0.21 on i686 SCSI Adaptec AIC-7892A U160/m (rev 02) kernel-default-2.6.16.13-4 on i586 SCSI Adaptec AHA-2940U2/U2W Basic senario: open() tape drive do a few I/O's read() 64524 bytes (on tape with EOF at beginning) read returns errno=EBUSY re-read a few times system crashes The crash does not always produce an oops (probably too severe) I've straced the code an everything looks pretty normal except the return value from the read() (EBUSY). Change a few lines in the program more or less at random can make the problem go away. It is terribly frustrating ... I have numerous Oopses of various types. Here is an example (I have wrapped a few lines). Sep 28 10:30:02 rufus kernel: Unable to handle kernel NULL pointer dereference at virtual address 00000000 Sep 28 10:30:02 rufus kernel: printing eip: Sep 28 10:30:02 rufus kernel: 00000000 Sep 28 10:30:02 rufus kernel: *pde = 00000000 Sep 28 10:30:02 rufus kernel: Oops: 0000 [#1] Sep 28 10:30:02 rufus kernel: SMP Sep 28 10:30:02 rufus kernel: last sysfs file: /devices/pci0000:00/0000:00:01.0/0000:01:00.0/resource Sep 28 10:30:02 rufus kernel: Modules linked in: appletalk ax25 ipx p8023 autofs4 ipv6 nfsd exportfs lockd nfs_acl sunrpc snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device edd button battery ac apparmor aamatch_pcre loop usbhid dm_mod hw_random intel_agp agpgart shpchp pci_hotplug ehci_hcd uhci_hcd usbcore i8xx_tco e100 mii i2c_i801 snd_intel8x0 snd_ac97_codec i2c_core snd_ac97_bus snd_pcm snd_timer snd soundcore snd_page_alloc ide_cd cdrom parport_pc lp parport ext3 jbd fan thermal processor sg st aic7xxx scsi_transport_spi ata_piix libata piix sd_mod scsi_mod ide_disk ide_core Sep 28 10:30:02 rufus kernel: CPU: 0 Sep 28 10:30:02 rufus kernel: EIP: 0060:[<00000000>] Tainted: G U VLI Sep 28 10:30:02 rufus kernel: EFLAGS: 00010016 (2.6.16.21-0.21-smp #1) Sep 28 10:30:02 rufus kernel: EIP is at _stext+0x3feffd40/0x29 Sep 28 10:30:02 rufus kernel: eax: f6d7eff4 ebx: f6d7eff4 ecx: 00000000 edx: 00000003 Sep 28 10:30:02 rufus kernel: esi: c1a76000 edi: 00000001 ebp: df66bd5c esp: df66bd3c Sep 28 10:30:02 rufus kernel: ds: 007b es: 007b ss: 0068 Sep 28 10:30:02 rufus kernel: Process cc1plus (pid: 5293, threadinfo=df66a000 task=c1a62130) Sep 28 10:30:02 rufus kernel: Stack: <0>c0118d94 df66bd8c 00000003 c1800014 00000000 c1800014 df66bd8c 00000001 Sep 28 10:30:02 rufus kernel: df66bd80 c011ad45 00000000 df66bd8c 00000003 00000296 c1800014 00000000 Sep 28 10:30:02 rufus kernel: c13ec060 00001000 c0130964 df66bd8c c13ec060 00000000 00001000 c0142fcc Sep 28 10:30:02 rufus kernel: Call Trace: Sep 28 10:30:02 rufus kernel: [<c0118d94>] __wake_up_common+0x2f/0x53 Sep 28 10:30:02 rufus kernel: [<c011ad45>] __wake_up+0x2a/0x3d Sep 28 10:30:02 rufus kernel: [<c0130964>] __wake_up_bit+0x29/0x2e Sep 28 10:30:02 rufus kernel: [<c0142fcc>] generic_file_buffered_write+0x462/0x58b Sep 28 10:30:02 rufus kernel: [<f95e1538>] ext3_permission+0x0/0xa [ext3] Sep 28 10:30:02 rufus kernel: [<c01241f0>] current_fs_time+0x4c/0x58 Sep 28 10:30:02 rufus kernel: [<c01434cd>] __generic_file_aio_write_nolock+0x3d8/0x425 Sep 28 10:30:02 rufus kernel: [<c0168f75>] link_path_walk+0xb3/0xbd Sep 28 10:30:02 rufus kernel: [<c0143921>] generic_file_aio_write+0x57/0xab Sep 28 10:30:02 rufus kernel: [<f95d2d0d>] ext3_file_write+0x19/0x84 [ext3] Sep 28 10:30:02 rufus kernel: [<c015b5fb>] do_sync_write+0xb8/0xf3 Sep 28 10:30:02 rufus kernel: [<c013097f>] autoremove_wake_function+0x0/0x2d Sep 28 10:30:02 rufus kernel: [<c0133088>] hrtimer_run_queues+0x55/0xf0 Sep 28 10:30:02 rufus kernel: [<c015b543>] do_sync_write+0x0/0xf3 Sep 28 10:30:02 rufus kernel: [<c015beac>] vfs_write+0xaa/0x14f Sep 28 10:30:02 rufus kernel: [<c015c4b1>] sys_write+0x3c/0x63 Sep 28 10:30:02 rufus kernel: [<c0103bdb>] sysenter_past_esp+0x54/0x79 Sep 28 10:30:02 rufus kernel: Code: Bad EIP value. I can send more oopses and strace output. I also have a relatively simple version of the program that I can send that shows the problem, but Bacula is a big program, and with an Oops like this, I just am not going to be able to simplify it to a 20 line program. This is critical for me, because at this point the project has been dead since approximately 8 Sept when the problem first showed up. I haven't been able to isolate exactly what is triggering this. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=208782 ------- Comment #1 from kern@sibbald.com 2006-09-28 10:46 MST ------- (In reply to comment #0)
When accessing the tape drive (several machines) with Bacula (backup software) the keyboard input goes away, then shortly later the mouse, always requiring a hard reboot. This happens on several machines, and once I have a program (under development) that fails, the failure is 100% reproducible. A small change in the program can make the problem go away, so it seems to be position dependent or size dependent.
I have tested this on two machines with different SCSI cards and three different kernels: kernel-smp-2.6.16.21-0.13 on i686 SCSI Adaptec AIC-7892A U160/m (rev 02) kernel-smp-2.6.16.21-0.21 on i686 SCSI Adaptec AIC-7892A U160/m (rev 02) kernel-default-2.6.16.13-4 on i586 SCSI Adaptec AHA-2940U2/U2W
Basic senario: open() tape drive do a few I/O's
Ooops the above should read: do a few ioctl()s.
read() 64524 bytes (on tape with EOF at beginning) read returns errno=EBUSY re-read a few times system crashes The crash does not always produce an oops (probably too severe)
I've straced the code an everything looks pretty normal except the return value from the read() (EBUSY). Change a few lines in the program more or less at random can make the problem go away. It is terribly frustrating ...
I have numerous Oopses of various types. Here is an example (I have wrapped a few lines). Sep 28 10:30:02 rufus kernel: Unable to handle kernel NULL pointer dereference at virtual address 00000000 Sep 28 10:30:02 rufus kernel: printing eip: Sep 28 10:30:02 rufus kernel: 00000000 Sep 28 10:30:02 rufus kernel: *pde = 00000000 Sep 28 10:30:02 rufus kernel: Oops: 0000 [#1] Sep 28 10:30:02 rufus kernel: SMP Sep 28 10:30:02 rufus kernel: last sysfs file: /devices/pci0000:00/0000:00:01.0/0000:01:00.0/resource Sep 28 10:30:02 rufus kernel: Modules linked in: appletalk ax25 ipx p8023 autofs4 ipv6 nfsd exportfs lockd nfs_acl sunrpc snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device edd button battery ac apparmor aamatch_pcre loop usbhid dm_mod hw_random intel_agp agpgart shpchp pci_hotplug ehci_hcd uhci_hcd usbcore i8xx_tco e100 mii i2c_i801 snd_intel8x0 snd_ac97_codec i2c_core snd_ac97_bus snd_pcm snd_timer snd soundcore snd_page_alloc ide_cd cdrom parport_pc lp parport ext3 jbd fan thermal processor sg st aic7xxx scsi_transport_spi ata_piix libata piix sd_mod scsi_mod ide_disk ide_core Sep 28 10:30:02 rufus kernel: CPU: 0 Sep 28 10:30:02 rufus kernel: EIP: 0060:[<00000000>] Tainted: G U VLI Sep 28 10:30:02 rufus kernel: EFLAGS: 00010016 (2.6.16.21-0.21-smp #1) Sep 28 10:30:02 rufus kernel: EIP is at _stext+0x3feffd40/0x29 Sep 28 10:30:02 rufus kernel: eax: f6d7eff4 ebx: f6d7eff4 ecx: 00000000 edx: 00000003 Sep 28 10:30:02 rufus kernel: esi: c1a76000 edi: 00000001 ebp: df66bd5c esp: df66bd3c Sep 28 10:30:02 rufus kernel: ds: 007b es: 007b ss: 0068 Sep 28 10:30:02 rufus kernel: Process cc1plus (pid: 5293, threadinfo=df66a000 task=c1a62130) Sep 28 10:30:02 rufus kernel: Stack: <0>c0118d94 df66bd8c 00000003 c1800014 00000000 c1800014 df66bd8c 00000001 Sep 28 10:30:02 rufus kernel: df66bd80 c011ad45 00000000 df66bd8c 00000003 00000296 c1800014 00000000 Sep 28 10:30:02 rufus kernel: c13ec060 00001000 c0130964 df66bd8c c13ec060 00000000 00001000 c0142fcc Sep 28 10:30:02 rufus kernel: Call Trace: Sep 28 10:30:02 rufus kernel: [<c0118d94>] __wake_up_common+0x2f/0x53 Sep 28 10:30:02 rufus kernel: [<c011ad45>] __wake_up+0x2a/0x3d Sep 28 10:30:02 rufus kernel: [<c0130964>] __wake_up_bit+0x29/0x2e Sep 28 10:30:02 rufus kernel: [<c0142fcc>] generic_file_buffered_write+0x462/0x58b Sep 28 10:30:02 rufus kernel: [<f95e1538>] ext3_permission+0x0/0xa [ext3] Sep 28 10:30:02 rufus kernel: [<c01241f0>] current_fs_time+0x4c/0x58 Sep 28 10:30:02 rufus kernel: [<c01434cd>] __generic_file_aio_write_nolock+0x3d8/0x425 Sep 28 10:30:02 rufus kernel: [<c0168f75>] link_path_walk+0xb3/0xbd Sep 28 10:30:02 rufus kernel: [<c0143921>] generic_file_aio_write+0x57/0xab Sep 28 10:30:02 rufus kernel: [<f95d2d0d>] ext3_file_write+0x19/0x84 [ext3] Sep 28 10:30:02 rufus kernel: [<c015b5fb>] do_sync_write+0xb8/0xf3 Sep 28 10:30:02 rufus kernel: [<c013097f>] autoremove_wake_function+0x0/0x2d Sep 28 10:30:02 rufus kernel: [<c0133088>] hrtimer_run_queues+0x55/0xf0 Sep 28 10:30:02 rufus kernel: [<c015b543>] do_sync_write+0x0/0xf3 Sep 28 10:30:02 rufus kernel: [<c015beac>] vfs_write+0xaa/0x14f Sep 28 10:30:02 rufus kernel: [<c015c4b1>] sys_write+0x3c/0x63 Sep 28 10:30:02 rufus kernel: [<c0103bdb>] sysenter_past_esp+0x54/0x79 Sep 28 10:30:02 rufus kernel: Code: Bad EIP value.
I can send more oopses and strace output. I also have a relatively simple version of the program that I can send that shows the problem, but Bacula is a big program, and with an Oops like this, I just am not going to be able to simplify it to a 20 line program.
This is critical for me, because at this point the project has been dead since approximately 8 Sept when the problem first showed up. I haven't been able to isolate exactly what is triggering this.
-- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=208782 kern@sibbald.com changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |kern@sibbald.com ------- Comment #2 from kern@sibbald.com 2006-09-28 13:12 MST ------- Kernel kernel-smp-2.6.16.21-0.25 dies even faster than previous kernels, but did not produce an oops in the log. Here is the interesting part of the output of strace for the program that crashes the system: stat64("/dev/nst0", {st_mode=S_IFCHR|0660, st_rdev=makedev(9, 128), ...}) = 0 write(1, "bscan: dev.c:242 init_dev: tape="..., 53) = 53 write(1, "bscan: butil.c:174 Acquire devic"..., 43) = 43 write(1, "bscan: acquire.c:52 jcr->dcr=80b"..., 37) = 37 write(1, "bscan: acquire.c:84 MediaType dc"..., 45) = 45 write(1, "bscan: acquire.c:155 dir_get_vol"..., 41) = 41 write(1, "bscan: bscan.c:1275 Fake dir_get"..., 45) = 45 write(1, "bscan: autochanger.c:108 Device "..., 74) = 74 write(1, "bscan: acquire.c:178 bstored: op"..., 53) = 53 write(1, "bscan: dev.c:283 open dev: type="..., 101) = 101 time(NULL) = 1159470131 write(1, "bscan: dev.c:331 Open dev: devic"..., 42) = 42 write(1, "bscan: dev.c:344 Try open \"DDS-4"..., 66) = 66 open("/dev/nst0", O_RDONLY|O_NONBLOCK|O_LARGEFILE) = 3 write(1, "bscan: dev.c:366 Rewind after op"..., 35) = 35 ioctl(3, MGSL_IOCGPARAMS or MTIOCTOP or SNDCTL_MIDI_MPUMODE, 0xbfa2e030) = 0 close(3) = 0 open("/dev/nst0", O_RDONLY|O_LARGEFILE) = 3 ioctl(3, MGSL_IOCGPARAMS or MTIOCTOP or SNDCTL_MIDI_MPUMODE, 0xbfa2dfe0) = 0 write(1, "bscan: dev.c:2223 In set_os_devi"..., 46) = 46 write(1, "bscan: dev.c:2229 Set block size"..., 41) = 41 ioctl(3, MGSL_IOCGPARAMS or MTIOCTOP or SNDCTL_MIDI_MPUMODE, 0xbfa2dfdc) = 0 write(1, "bscan: dev.c:417 open dev: tape "..., 41) = 41 write(1, "bscan: dev.c:297 preserve=0x0 fd"..., 35) = 35 write(1, "bscan: acquire.c:184 opened dev "..., 55) = 55 write(1, "bscan: acquire.c:188 calling rea"..., 44) = 44 write(1, "bscan: label.c:68 Enter read_vol"..., 108) = 108 ioctl(3, MGSL_IOCGPARAMS or MTIOCTOP or SNDCTL_MIDI_MPUMODE, 0xbfa2e004) = 0 write(1, "bscan: label.c:137 Big if statem"..., 57) = 57 write(1, "bscan: block.c:921 Full read in "..., 67) = 67 read(3, 0x80b3e00, 64512) = -1 EBUSY (Device or resource busy) open("/usr/share/locale/en_US.UTF-8/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No suc open("/usr/share/locale/en_US.utf8/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such open("/usr/share/locale/en_US/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file open("/usr/share/locale/en.UTF-8/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such f open("/usr/share/locale/en.utf8/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such fi open("/usr/share/locale/en/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or write(1, "bscan: block.c:967 ===== read re"..., 84) = 84 nanosleep({10, 0}, System crash -- the nanosleep output did not even make it to disk. Note: the read() of a tape containing one EOF mark at the beginning returns an EBUSY -- odd. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=208782 kern@sibbald.com changed: What |Removed |Added ---------------------------------------------------------------------------- Severity|Major |Blocker -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=208782 ------- Comment #3 from kern@sibbald.com 2006-09-29 01:35 MST ------- Last night, I loaded the Fedora FC4 kernel 2.6.17-1.2142_FC4 on my SuSE 10.1 system and booted it. Aside from the fact that AppArmor obviously doesn't run, and mtx fails to work, everything seems to run fine. Results: using the *identical* binaries that failed on the SuSE kernels, I am unable to reproduce the bug on the Fedora kernel. Running my full tape regression script, no crash. Conclusion: you have a badly broken SCSI driver. My best guess *wild* from the symptoms is that you are forgetting to lock down my pages that are probably getting swapped during I/O. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=208782 ------- Comment #4 from kern@sibbald.com 2006-09-29 02:49 MST ------- Created an attachment (id=99917) --> (https://bugzilla.novell.com/attachment.cgi?id=99917&action=view) Kernel log entries for 16 oopses I've pulled out the 16 oopses found in my system log for my main development machine. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=208782 ------- Comment #5 from kern@sibbald.com 2006-09-29 03:02 MST ------- Created an attachment (id=99918) --> (https://bugzilla.novell.com/attachment.cgi?id=99918&action=view) 6 kernel log oopes for 2.6.16.13-4-default Here are 6 more oopses from my other machine. They look a bit different and are for a slightly older kernel (not upgraded if I remember right). -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=208782 lmb@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- AssignedTo|kernel- |hare@novell.com |maintainers@forge.provo.nove| |ll.com | Severity|Blocker |Major ------- Comment #6 from lmb@novell.com 2006-09-29 03:54 MST ------- Sorry, this can't be a shipment blocker; 10.1 _has_ already been shipped. I'm not downgrading this to indicate it isn't important, just that the severity doesn't apply. 1. Hannes, is this something for you? 2. Kern: Does this also occur with the latest kernel of the day from ftp://ftp.suse.com/pub/projects/kernel/kotd/sles10-i386/SLES10_GA_BRANCH solve the issue? (10.1 and SLES10 share the same kernel, so this is the most recent one you can get.) -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=208782 ------- Comment #7 from kern@sibbald.com 2006-09-29 04:46 MST ------- Concerning the serverity: OK, that makes sense I understand. Concerning the kotd: as I am a relatively new convert from Fedora, I didn't know about kotd, so I am downloading it and will test ASAP. Can you tell me where to find the 10.2 kernel in binary rpm format? I found the .src.rpm, but would rather not have to compile it, and I would like to file a "blocker" :-) against 10.2 if it fails so that we are sure it will get fixed. At worst, I can extract it from the 10.2 iso I guess. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=208782 ------- Comment #8 from kern@sibbald.com 2006-09-29 07:20 MST ------- The kotd (2.6.16.21-SLES10_GA_BRANCH_20060928132156-smp) has the same bug. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=208782 ------- Comment #9 from kern@sibbald.com 2006-09-29 09:31 MST ------- Good news: the kernel 2.6.18-rc5-git6-2-bigsmp taken from your 10.2 iso does not have the bug. My only concern is the following (the ata1 and ata2 lines) from the 2.6.18-rc5-git6-2-bigsmp boot messages (I included a few extra lines before and after ...): ide0 at 0x1f0-0x1f7,0x3f6 on irq 14 hda: max request size: 128KiB hda: Host Protected Area detected. current capacity is 234375000 sectors (120000 MB) native capacity is 234441648 sectors (120034 MB) hda: Host Protected Area disabled. hda: 234441648 sectors (120034 MB) w/8192KiB Cache, CHS=65535/16/63, UDMA(100) hda: cache flushes not supported hda: hda1 hda2 hda3 hda4 < hda5 hda6 hda7 hda8 > Probing IDE interface ide1... hdc: HL-DT-ST GCE-8483B, ATAPI CD/DVD-ROM drive hdd: PHILIPS DVD+/-RW DVD8631, ATAPI CD/DVD-ROM drive ide1 at 0x170-0x177,0x376 on irq 15 libata version 2.00 loaded. ata_piix 0000:00:1f.2: version 2.00 ata_piix 0000:00:1f.2: MAP [ P0 -- P1 -- ] ACPI: PCI Interrupt 0000:00:1f.2[A] -> GSI 18 (level, low) -> IRQ 169 PCI: Setting latency timer of device 0000:00:1f.2 to 64 ata1: SATA max UDMA/133 cmd 0xFE00 ctl 0xFE12 bmdma 0xFEA0 irq 169 ata2: SATA max UDMA/133 cmd 0xFE20 ctl 0xFE32 bmdma 0xFEA8 irq 169 scsi0 : ata_piix ata1: port is slow to respond, please be patient ata1: port failed to respond (30 secs) ata1: SRST failed (status 0xFF) ata1: SRST failed (err_mask=0x100) ata1: softreset failed, retrying in 5 secs ata1: SRST failed (status 0xFF) ata1: SRST failed (err_mask=0x100) ata1: softreset failed, retrying in 5 secs ata1: SRST failed (status 0xFF) ata1: SRST failed (err_mask=0x100) ata1: reset failed, giving up scsi1 : ata_piix ata2: port is slow to respond, please be patient ata2: port failed to respond (30 secs) ata2: SRST failed (status 0xFF) ata2: SRST failed (err_mask=0x100) ata2: softreset failed, retrying in 5 secs ata2: SRST failed (status 0xFF) ata2: SRST failed (err_mask=0x100) ata2: softreset failed, retrying in 5 secs ata2: SRST failed (status 0xFF) ata2: SRST failed (err_mask=0x100) ata2: reset failed, giving up ACPI: PCI Interrupt 0000:02:01.0[A] -> GSI 22 (level, low) -> IRQ 177 scsi2 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 7.0 <Adaptec 29160 Ultra160 SCSI adapter> aic7892: Ultra160 Wide Channel A, SCSI Id=7, 32/253 SCBs All my SCSI devices do seem to work CDROM and DVD ... -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=208782 hare@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |ASSIGNED -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=208782 ------- Comment #10 from hare@novell.com 2006-10-04 07:07 MST ------- These error messages should be fixed with the next snapshot for 10.2. We've the new libata error handling in there now which should fix the above problem. Please open another bugzilla if the ahci issue persists. Weird thing is that the difference between 2.6.16 and 2.6.18 for 'st.c' is almost zero ... -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=208782 ------- Comment #11 from hare@novell.com 2006-10-04 08:54 MST ------- Hmm. I suspect our 'st-ioctl-idlun-support' patch. This was forward-ported from SLES9; appearently without any checking whether it's a) required or b) breaks anything. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=208782 ------- Comment #12 from kern@sibbald.com 2006-10-04 15:16 MST ------- OK for the ahci messages. No problem. I haven't had any crashes with kernel 2.6.18, so I am happy (and a lot less stressed), but some Bacula users may not want to upgrade. I'm releasing the code for beta testing tomorrow, so ... Can I get a 10.1 kernel with the 'st-ioctl-idlun-support' patch backed out for testing? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=208782 garloff@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |garloff@novell.com ------- Comment #13 from garloff@novell.com 2006-10-04 16:30 MST ------- Hannes, see bug 60446 for an explanation of the necessity of this patch. I can't see the patch breaking anything: ioctl(3, MGSL_IOCGPARAMS or MTIOCTOP or SNDCTL_MIDI_MPUMODE, 0xbfa2e004) = 0 in the backtrace. The switch (cmd_in) statement does not match MTIOCTOP. Locking is consistent as well: In the original code path, the semaphore is upped again before executing the non-MTIOCTOP ioctls. So, I still suspect something else in the SCSI (most probably SATA) layer. Anyway, I'm building a kernel to test now. Hannes, see /abuild/buildsystem.kalman.garloff-sles10-i386 on kalman and feel free to push it to Kern as soon as it finished building. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=208782 ------- Comment #14 from hare@novell.com 2006-10-05 03:27 MST ------- No, sorry, locking is _not_ consistent. With the original st.c driver we first take the semaphore, check for the device being in use, check for the drive status and then _release_ the semaphore prior to handle the ioctls. So we'll be having an implicit synchronization by taking the lock first. The idlun patch does _not_ have that as it doesn't take the lock in the ioctl case. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=208782 ------- Comment #15 from hare@novell.com 2006-10-05 03:31 MST ------- Created an attachment (id=100436) --> (https://bugzilla.novell.com/attachment.cgi?id=100436&action=view) st-no-tape.diff Fixup 'No Medium' handling for tapes. Currently the st driver doesn't handle 'no medium' situations well. The driver issues a TUR to get the status of the device, but retries it in case of failure. And the error code ENOMEDIUM will not be set. This patch fixes both. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=208782 ------- Comment #16 from hare@novell.com 2006-10-05 03:33 MST ------- With the above patch (and the st-ioctl-idlun-support patch backed out) I'm able to call SCSI_IOCTL_IDLUN even with no medium present. I propose to use this patch instead of st-ioctl-idlun-support. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=208782 ------- Comment #17 from kern@sibbald.com 2006-10-05 04:36 MST ------- Here are a few points that may help you pinpoint where the problem (at least for me) is coming from. - All my tests are with a tape in the drive. - The sequence of events (simplifed) that cause the error are: fd = open(/dev/nst0) a couple of ioctl() read(fd) errno == EBUSY (system is probably damaged at this point) loop rapidly 30 times redoing the read() system dyes quickly - The OS never crashes unless an EBUSY is returned from the read(). - IMO, I should never be getting an EBUSY, and when I see it, it indicates that the problem is present. - If I reissue the read() multiple times, the problem quickly becomes serious. - If I stop the program after the first EBUSY, the OS most likely will continue running, but is probably damaged, i.e. it will probably crash within the next 10 minutes. - I don't think (a bit uncertain) that the ioctl()s are important as I think I eliminated them but the failure still occurred. - The failure occurs on both dual-core Pentium and single core Pentium machines. - Whether or not the EBUSY is returned seems to be totally bizarre. I can add an unused 8 byte field to a structure, recompile the program, and the OS will crash. I remove the 8 byte field, and the OS will run fine (no EBUSY returned). I guess this in indicative of a kernel locking problem, but why it is repeatably sensitive to a 8 byte position change in the program's data is a mystery to me. - All buffers presented for the read() are from the heap and properly aligned (see the strace output). -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=208782 hare@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |NEEDINFO Info Provider| |kern@sibbald.com ------- Comment #18 from hare@novell.com 2006-10-05 04:47 MST ------- I've made a test kernel with the patch from comment #15. Please fetch it from http://beta.suse.de/private/hare/testing/bug208782/ It might take some time before it gets mirrored to the outside world. Please test with it; it seemed to fix the problem here. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=208782 ------- Comment #19 from kern@sibbald.com 2006-10-05 05:55 MST ------- I don't seem to be able to access that URL. The response is: "Timeout on server". The rest of this comment you can ignore if you wish .. ======= Since you guys have brought up the problem of no medium present, please permit me to do a bit of general complaining that is not directly related to this bug. Also, I don't imagine you are responsible for my issues with "no medium present" ... Basically, during development of kernel 2.5.x the way the SCSI tape driver responds to an open() call was changed, from not so good to very bad. Prior to the change, the OS would return a file descriptor that could be used, but if there was no medium present would return an error if you attempt to access the medium (certain ioctl()s were permitted). The change made in the 2.5.x kernel causes the SCSI driver to wait approximately 2 minutes if there is no medium in the drive, then to return an error. This is change was a *very* bad idea IMO, even if it does seem to meet some new standard (the people who came up with that obviously never wrote a backup program). This new bad behavior can be avoided by adding O_NONBLOCK to the modes of the open() call, in which case, the kernel will return immediately with a file descriptor (apparently, as in the old way of doing things). The problem is that it is not clear that O_NONBLOCK behaves the same on all OSes, nor is it clear that using O_NONBLOCK does not actually put the fd in non-blocking mode when doing a read() or a write(), which is not generally desired. In the end, my life is vastly complicated, because I am forced to attempt some I/O (rewind) to see if the device has a tape in it or not, and then to close() and reopen the file descriptor without O_NONBLOCK to ensure that on someones system, I don't have a file descriptor in a bizarre state (for a time, I cleared the O_NONBLOCK bit with fnctl(), but this was in the end more complicated). What a programmer really wants is a *universal* way to simply know if there is a tape in the drive or not, then to open() a file descriptor without the OS imposing a wait. In addition to knowing if there is a tape in the drive, programmer wants to know when he/she opens the drive is whether or it is busy (i.e. is it in the process of doing a rewind). ENOMEDIUM is a big help, unfortunately it is Linux specific. The O_NONBLOCK is IMO a kludge (i.e. mistake) that just makes opening a drive much more complicated. If you wonder why a 2 minute wait is bad, then you are probably thinking about a "batch" job trying to do a tar or something. With a modern backup program like Bacula, it is often the user who is via a "console" mounting/unmounting tapes and he gets very impatient and confused if he does some operation and it takes 2 minutes for the program to tell him that there is no medium ... This force me to use the O_NONBLOCK and consequently have a much more complicated program. Sorry to bother you with this, but if you ever happen to run into the SCSI designers, I hope someone can help them understand this problem. ====== -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=208782 ------- Comment #20 from hare@novell.com 2006-10-05 07:12 MST ------- Well, if you had looked at the patch you would have seen that it addresses specifically the 'ENOMEDIUM' issue ... IOW with the patch you don't have to wait anymore in open() if no tape is present. So at least that is solved. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=208782 ------- Comment #21 from hare@novell.com 2006-10-05 07:18 MST ------- Ah, just discovered that I've not been able to write down the correct URL. Try this: http://beta.suse.com/private/hare/testing/bug208782 -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=208782 ------- Comment #22 from kern@sibbald.com 2006-10-05 07:36 MST ------- Concerning the patch: Yes, I did look at it, but not being a kernel guy, it was not totally obvious especially since it involved testing on O_NONBLOCK. However, I am happy to hear that it resolves the ENOMEDIUM issue, and I hope it is a fix that applies to all distros. I was able to download the test kernel, thanks :-) I'm verifying that the old kernel still crashes, then will install the test one and run the identical binary. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=208782 ------- Comment #23 from kern@sibbald.com 2006-10-05 09:49 MST ------- Well, things didn't go so well. Installing the kernel got a lot of errors (mostly requires problems), so I ran it with --nodeps then after installing noticed that /boot/grub was gone. After rebooting, things went from bad to worse -- the only filesystem is /, so it couldn't bring up much of anything. As far as I can tell, the other partitions no longer have valid filesystems. I suspect this was due to my SCSI testing rather than the new kernel. Anyway, I'll reload the system (groan) and try again over the weekend. If you just happen to have the kernel-src.rpm ("binary" rpm with all patches applied and loads in standard kernel build location), I'll try building it here and manually installing it. If not, no problem. PS: I see that part of the installation of the rpm involves using perl. This kind of requirement makes it more difficult to implement disaster recovery CDs to re-install a kernel. Whenever possible, it would be better for the base rpms to rely only on traditional Unix utilities (sed, awk, ...). -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=208782 ------- Comment #26 from hare@novell.com 2006-10-06 01:33 MST ------- Hmm. Actually, the kernel is a SLES10 kernel. Shouldn't be a problem, though, but you never know. I'll build a 10.1 kernel to boot. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=208782 ------- Comment #27 from kern@sibbald.com 2006-10-06 01:59 MST ------- Yes, as I mentioned, I suspect that my testing of the failure prior to loading your rpm corrupted something. I was able to reproduce the EBUSY error, but the kernel didn't want to Oops, so I suspect things went downhill before installing your kernel. Fortunately I didn't run it on my development machine.
From your comments, I'm beginning to think that I did not explain the problem well enough. The kernel Oops that I am seeing *always* occurs when there is a tape in the drive. Bacula should never attempt to read() a drive that is empty -- it may attempt a rewind(), but *all* my tests and *all* my Oops were done with a tape in the drive.
So, just one question: Is the st-no-tape.diff patch supposed to fix the kernel Oops problem? If so, I can try applying it and building it manually. I've manually built/installed hundreds of RH/Fedora kernels, so it shouldn't be a problem in principle since I see that you have a kernel-source rpm, which I have just loaded. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=208782 ------- Comment #28 from hare@novell.com 2006-10-06 02:14 MST ------- Well, it's quite tricky. My current thinking is that the patch mentioned in comment #14 somehow triggered a race condition in that it allowed some ioctl's to pass through without any synchronisation with the tape driver. And as these beasts are notoriously sensitive to command ordering (ie I'm not sure it's a good idea to pass down a 'Test Unit Ready' whilst a read() is in progress) I would think that one ioctl previous to the actual read() messed up the driver and this go downhill from there ... So, I would think that the new kernel-source should not exhibit the oops. If it still does, can you please back-out the patch from comment #15 and re-run the tests? If it still oopses, well, back to the drawing board. Speaking of which: what do you set the block size at? Anything larger than 512kB? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=208782 ------- Comment #29 from kern@sibbald.com 2006-10-06 03:02 MST ------- OK, I understand, and as you say, it is a bit tricky. I'll work on getting my machine operational and testing again. Concerning the block size: Yes, the default (the user can change it), which was what I used for my tests is that Bacula puts the drive in variable block size mode (one of the ioctls() after the open), most read()/write() requests will then be 64512 bytes. The exceptions are perhaps the label record at the beginning of the tape, and the last record of some of the files, which may be shorter than 64512 bytes, but they all should be rounded up to a multiple of 512. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=208782 ------- Comment #30 from kern@sibbald.com 2006-10-06 05:05 MST ------- My disks were in a funny sense wiped out, but getting my machine operational was a bit easier than I thought thanks to your "repair" section of the installation disk. It claimed that fstab was broken, because of ext3. Apparently, the 2.6.16.21-0.13-default rpm that I had installed did not have ext3 either in the kernel or in the initrd, which means that none of my partitions except / and /boot were able to be mounted, and without /usr, not much works. Why that kernel doesn't have ext3 support I don't know. So, I converted all partitions to ext2 (for the moment) and the machine came back. Bottom line: Bad news, I re-ran my tests on your 2.6.16.21-0.25-default kernel, and it also fails the read() with EBUSY. So, unfortunately your patch did not fix my particular bug. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=208782 ------- Comment #31 from kern@sibbald.com 2006-10-08 02:29 MST ------- Update: I wanted to load the 10.2 rc5 kernel to verify with the same binaries that it does not show the problem, but apparently running my disks with ext2 was a bad idea, because after running my program with your test kernel on the 6th (comment #30), the disks were seriously damaged, even with manual fsck repairing *thousands* of errors could not get them back in shape. Result, a system that will not boot properly. Conclusion: this bug is more critical than I realized. I now must re-install, so I will re-install with 10.2 rc5 (or whatever is currently available), and test to ensure this problem is gone as is the case on my development machine. I suspect that if you compare the changes from your test kernel to the 2.6.18-rc5-git6-2-bigsmp kernel, and focus on what returns EBUSY, you will find or get close to the problem. ===== Possible new problem with kernel 2.6.18-rc5-git6-2-bigsmp: I am now occassionally seeing the following error output from mtx when operating my autochangers on two different systems running 2.6.18-rc5-git6-2-bigsmp. This problem has never occurred before. Unloading Data Transfer Element into Storage Element 1...mtx: Request Sense: Long Report=yes mtx: Request Sense: Valid Residual=no mtx: Request Sense: Error Code=70 (Current) mtx: Request Sense: Sense Key=Illegal Request mtx: Request Sense: FileMark=no mtx: Request Sense: EOM=no mtx: Request Sense: ILI=no mtx: Request Sense: Additional Sense Code = 53 mtx: Request Sense: Additional Sense Qualifier = 02 mtx: Request Sense: BPV=no mtx: Request Sense: Error in CDB=no mtx: Request Sense: SKSV=no MOVE MEDIUM from Element Address 1 to 2 Failed Not being a SCSI guy, I have little idea what the above means, but I can assure you I wasn't requesting a MOVE MEDIUM but rather an unload. If this occurred on one system only, I could imagine it was an autochanger failure, but the fact that it occurs on two systems and only *after* installing the 10.2 kernel, I am suspicious. I can correct this condition only in powering down the autochanger AND rebooting the machine. Do you have any suggestions what I should do? (i.e. a new bug report, some way to get more info). -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=208782 brunofr@ioda.net changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |brunofr@ioda.net ------- Comment #32 from brunofr@ioda.net 2006-10-09 01:46 MST ------- Just a word to say that the lastest kernel update 2.6.16.21-0.25-default comparing to the previous 2.6.16.21-0.21-default. We encounter here strange errors with the tape subsystem. This kernel was put on the system at 1st October, and from this date we can't use anymore our tape. mt -f /dev/st0 eject weof rewind -> ok tar -cvf /dev/st0 /usr -> error give a immediate result sequoia:~ # tar -cvf /dev/st0 /usr tar: Retrait de « / » de tête des noms des membres /usr/ /usr/X11R6/ /usr/X11R6/lib/ /usr/X11R6/lib/X11/ /usr/X11R6/lib/X11/icons/ /usr/X11R6/lib/X11/icons/crystalblue/ /usr/X11R6/lib/X11/icons/crystalblue/cursors/ /usr/X11R6/lib/X11/icons/crystalblue/cursors/base_arrow_down /usr/X11R6/lib/X11/icons/crystalblue/cursors/X_cursor /usr/X11R6/lib/X11/icons/crystalblue/cursors/00008160000006810000408080010102 tar: /dev/st0: ne peut write: Erreur d'entrée/sortie tar: Erreur non récupérable: fin de l'exécution immédiate ----------- As this system is a production server, we can't do so much test as kern has done. Especially as we can't do the data backup :-)) which was scheluded after the kernel update and reboot. Please feel free to ask all informations you need. ------- loading kernel module ------ Oct 8 14:45:43 sequoia kernel: PCI: Enabling device 0000:01:03.0 (0156 -> 0157) Oct 8 14:45:43 sequoia kernel: ACPI: PCI Interrupt 0000:01:03.0[A] -> GSI 28 (level, low) -> IRQ 177 Oct 8 14:45:48 sequoia kernel: scsi3 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 7.0 Oct 8 14:45:48 sequoia kernel: <Adaptec aic7892 Ultra160 SCSI adapter> Oct 8 14:45:48 sequoia kernel: aic7892: Ultra160 Wide Channel A, SCSI Id=7, 32/253 SCBs Oct 8 14:45:48 sequoia kernel: Oct 8 14:45:50 sequoia kernel: Vendor: ARCHIVE Model: Python 06240-XXX Rev: 8240 Oct 8 14:45:50 sequoia kernel: Type: Sequential-Access ANSI SCSI revision: 03 Oct 8 14:45:50 sequoia kernel: target3:0:5: Beginning Domain Validation Oct 8 14:45:50 sequoia kernel: target3:0:5: FAST-40 SCSI 40.0 MB/s ST (25 ns, offset 32) Oct 8 14:45:50 sequoia kernel: target3:0:5: Domain Validation skipping write tests Oct 8 14:45:50 sequoia kernel: target3:0:5: Ending Domain Validation Oct 8 14:45:50 sequoia kernel: st 3:0:5:0: Attached scsi tape st0<4>st0: try direct i/o: yes (alignment 512 B) Oct 8 14:45:50 sequoia kernel: st 3:0:5:0: Attached scsi generic sg0 type 1 Oct 8 14:46:59 sequoia kernel: st0: Block limits 1 - 16777215 bytes. ------ error during read/write tape operation ---------- Oct 9 09:41:57 sequoia kernel: (scsi3:A:5:0): Unexpected busfree in Data-out phase Oct 9 09:41:57 sequoia kernel: SEQADDR == 0x84 Oct 9 09:41:57 sequoia kernel: st0: Error 70000 (sugg. bt 0x0, driver bt 0x0, host bt 0x7). Oct 9 09:41:57 sequoia kernel: st0: Error with sense data: <6>st: Current: sense key: Unit Attention Oct 9 09:41:57 sequoia kernel: Additional sense: Power on, reset, or bus device reset occurred Oct 9 09:41:57 sequoia kernel: st0: Error on write filemark. ----- mt status ---------- mt -f /dev/st0 status drive type = Generic SCSI-2 tape drive status = 637534720 sense key error = 0 residue count = 0 file number = 0 block number = 0 Tape block size 512 bytes. Density code 0x26 (unknown). Soft error count since last status=0l General status bits on (41010000): BOT ONLINE IM_REP_EN -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=208782 ------- Comment #33 from brunofr@ioda.net 2006-10-09 02:26 MST ------- Created an attachment (id=100938) --> (https://bugzilla.novell.com/attachment.cgi?id=100938&action=view) All information about the hardware on which st fail find inside information about hardware that fail -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=208782 ------- Comment #34 from brunofr@ioda.net 2006-10-09 02:28 MST ------- Created an attachment (id=100939) --> (https://bugzilla.novell.com/attachment.cgi?id=100939&action=view) hardware info on which st work I've made a test on my on desktop, and in this hardware all tape operation goes well. So I submit same information as previously. Perharps someone could have an idea ... -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=208782 hare@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEEDINFO |ASSIGNED Info Provider|kern@sibbald.com | ------- Comment #35 from hare@novell.com 2006-10-09 05:04 MST ------- Hmm. So you're telling me that the same testcase works on one machine, but not on the other? That would be _really_ curious. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=208782 hare@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |NEEDINFO Info Provider| |brunofr@ioda.net ------- Comment #36 from hare@novell.com 2006-10-09 05:06 MST ------- The only difference I can see is the tape speed. Can you limit the Transfer-rate to 20MB (or even 10 MB) and see whether the error still occurs? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=208782 ------- Comment #37 from brunofr@ioda.net 2006-10-09 06:56 MST ------- sorry but I don't know howto limit speed the transfer bus without changing hardware setting and obviously this could'nt be a solution. futhermore the server is somewhere at 80 miles away from me :-)). Anyway with the failing hardware, we've no problem before the lastest kernel update from -021 (have run nice) to -025 (doesn't want to work). The big diff between the 2 hardware is also that one is running -default the other -smp. Should I try to install the smp kernel on this host and see if it Works ? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=208782 ------- Comment #38 from kern@sibbald.com 2006-10-13 03:23 MST ------- Please ignore my "possible" problem report at the end of my comments #31. This was a false alert caused by the drive door being locked, which caused mtx to fail unloading a tape. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=208782 ------- Comment #40 from brunofr@ioda.net 2006-11-04 04:36 MST ------- Made test with smp kernel and have same issue. Change the dat tape and everything goes now normally. This is a stange case. But close the trouble. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=208782 ------- Comment #41 from kern@sibbald.com 2006-11-10 02:03 MST ------- Can you give me a clear summary of where we stand with fixing this bug including when it will be fixed? -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=208782 hare@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEEDINFO |ASSIGNED Info Provider|brunofr@ioda.net | ------- Comment #42 from hare@novell.com 2006-11-10 03:09 MST ------- I have found some errors in the tape driver (or rather, the commands that the tape driver uses). I'll post them here for reference. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=208782 ------- Comment #43 from hare@novell.com 2006-11-10 03:25 MST ------- Created an attachment (id=104624) --> (https://bugzilla.novell.com/attachment.cgi?id=104624&action=view) Count number of pages for sg array properly. This patch is from mainline; occasionally the number of pages will be counted wrong when sending a request. Oops. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=208782 hare@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #104624|application/octet-stream |text/plain mime type| | Attachment #104624|0 |1 is patch| | -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=208782 ------- Comment #44 from hare@novell.com 2006-11-10 03:40 MST ------- I'm building a kernel for you to test with. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=208782 ------- Comment #45 from hare@novell.com 2006-11-17 00:30 MST ------- Sorry for the delay. kernel rpms are at: http://beta.suse.com/private/hare/testing/bug208782 I already have one happy customer which claims the tape driver works for him with these patches. Please test. And again, thanks for you patience and support. It is very much appreciated. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=208782 ------- Comment #46 from kern@sibbald.com 2006-11-17 07:26 MST ------- I've downloaded the kernel-default that you put up in comment #45, loaded it on my development machine, and ran all my Bacula tests with no problem. So, it looks like it resolves the problem, thanks. I say it "looks like" because from my experience with the bug, it seemed to depend on the exact size and position of Bacula in memory (probably what other programs were running as well), which means I cannot be 100% sure it is resolved. I have however posted a link to this bug and the fixed kernel to the Bacula users list, where there were several people experiencing crashes from this problem, and I will replace the SUSE 10.2 kernel on my "production backup" machine with your test kernel tonight. Bottom line: for me the problem appears to be fixed, and you can close the bug, but if you want to leave it open another week or two, if I get any additional feedback, I'll post it here. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=208782 ------- Comment #47 from hare@novell.com 2006-11-20 01:50 MST ------- <Big sigh of relief> That's him nailed finally. What a nasty bugger. Thanks for your feedback & testing. I don't think I want to have this one lingering around any longer; and as I already had one positive feedback already I think we can lay it to rest. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=208782 ------- Comment #48 from kern@sibbald.com 2006-11-20 03:02 MST ------- OK, thanks for the fix. No problem closing this bug. My production setup ran quite fine over the weekend. I would appreciate it if you could get an official 10.1 kernel update released with this fix as soon as possible. This will reduce the bug reports that both you and I get. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
https://bugzilla.novell.com/show_bug.cgi?id=208782 kgw@novell.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status Whiteboard|kernel:sles10 |fixreleased:kernel:sles10 ------- Comment #51 from kgw@novell.com 2007-01-05 04:11 MST ------- Patches referred to in changelog entry: Wed Nov 22 12:25:50 CET 2006 - hare@suse.de - patches.drivers/st-no-medium-handling: split off patch - patches.drivers/st-return-ENOMEDIUM: add new patch to return ENOMEDIUM correctly for SCSI tapes (208782) published in SLE10 kernelupdate 2.6.16.27-0.6, dated Dec 13, 2006 & released Dec 21, 2006. Setting Whiteboard Status -> fixreleased -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug, or are watching someone who is.
participants (1)
-
bugzilla_noreply@novell.com