time & filesystem problem(s?)
I have a problem I don't understand on a new Opteron box running 9.2. I have two supposedly identical boxes and one is running fine whilst the clock on the other drifts into the future (2 days as of this morning). This morning I've also discovered that the root filesystem has become read-only over the weekend. I don't know whether these symptoms are caused by the same problem or if they are two separate problems? I don't really know where to look. I have ntp running on both machines and that seems like a candidate for the source of the clock trouble at least. The daemon still seemed to be running: ntp 4440 1 0 May19 ? 00:00:00 /usr/sbin/ntpd -p /var/lib/ntp/var/run/ntp/ntpd.pid -u ntp -i /var/lib/ntp I rebooted and /var/log/messages now contains: May 25 04:56:23 suse2 kernel: Losing some ticks... checking if CPU frequency changed. May 23 10:59:01 suse2 ntpdate[4434]: step time server 10.1.0.0 offset -þ|1000°^E:°9414735363.521992 sec May 23 10:59:01 suse2 ntpd[4439]: ntpd 4.2.0a@1.1190-r Wed Jan 26 17:35:54 UTC 2005 (1) May 23 10:59:01 suse2 ntpd[4439]: signal_no_reset: signal 13 had flags 4000000 May 23 10:59:01 suse2 ntpd[4439]: precision = 1.000 usec May 23 10:59:01 suse2 ntpd[4439]: Listening on interface wildcard, 0.0.0.0#123 May 23 10:59:01 suse2 ntpd[4439]: Listening on interface wildcard, ::#123 May 23 10:59:01 suse2 ntpd[4439]: Listening on interface lo, 127.0.0.1#123 May 23 10:59:01 suse2 ntpd[4439]: Listening on interface eth0, 192.168.1.2#123 May 23 10:59:01 suse2 ntpd[4439]: Listening on interface eth1, 10.11.0.1#123 May 23 10:59:01 suse2 ntpd[4439]: Listening on interface eth2, 192.168.4.2#123 May 23 10:59:01 suse2 ntpd[4439]: kernel time sync status 0040 May 23 10:59:01 suse2 ntpd[4439]: frequency initialized 0.000 PPM from /var/lib/ntp/drift/ntp.drift May 23 10:59:01 suse2 kernel: warning: many lost ticks. May 23 10:59:01 suse2 kernel: Your time source seems to be instable or some driver is hogging interupts /var/log/ntp has this (and similar for each previous boot): 23 May 11:02:39 ntpd[4439]: synchronized to LOCAL(0), stratum 10 23 May 11:02:39 ntpd[4439]: kernel time sync disabled 0041 23 May 11:03:51 ntpd[4439]: kernel time sync enabled 0001 and ntpq -p says this: suse2:/var/log # ntpq -p remote refid st t when poll reach delay offset jitter ============================================================================== *LOCAL(0) LOCAL(0) 10 l 67 64 377 0.000 0.000 0.001 server1.lmb.int 131.111.12.21 3 u 11 1024 377 0.001 -272394 163702. The clock on this machine is about ten minutes fast around 40 minutes after I rebooted. I don't have a clue what all the output above actually means. I'm reading through all the NTP docs, but that's going to take me a while :) Can anybody give me any clues where to start looking? A disk error appears to be another possibility for the ro file system. Here's the last part of /var/log/messages before the reboot: May 25 02:26:24 suse2 -- MARK -- May 25 02:57:04 suse2 -- MARK -- May 25 02:59:34 suse2 /usr/sbin/cron[22122]: (root) CMD ( rm -f /var/spool/cron/lastrun/cron.hourly) May 25 03:28:42 suse2 -- MARK -- May 25 03:59:36 suse2 /usr/sbin/cron[22245]: (root) CMD ( rm -f /var/spool/cron/lastrun/cron.hourly) May 25 04:14:36 suse2 /usr/sbin/cron[22283]: (root) CMD ( rm -f /var/spool/cron/lastrun/cron.daily) May 25 04:17:11 suse2 su: (to nobody) root on none May 25 04:17:11 suse2 su: pam_unix2: session started for user nobody, service su May 25 04:17:46 suse2 su: pam_unix2: session finished for user nobody, service su May 25 04:17:46 suse2 su: (to nobody) root on none May 25 04:17:46 suse2 su: pam_unix2: session started for user nobody, service su May 25 04:17:12 suse2 su: pam_unix2: session finished for user nobody, service su May 25 04:17:47 suse2 su: (to nobody) root on none May 25 04:17:47 suse2 su: pam_unix2: session started for user nobody, service su May 25 04:17:47 suse2 su: pam_unix2: session finished for user nobody, service su May 25 04:20:25 suse2 kernel: scsi0:0:0:0: Attempting to abort cmd 0000010004851040: 0x2a 0x0 0x0 0x41 0x3a 0x16 0x0 0x0 0x8 0x0 May 25 04:20:25 suse2 kernel: scsi0: At time of recovery, card was not paused May 25 04:20:25 suse2 kernel: >>>>>>>>>>>>>>>>>> Dump Card State Begins <<<<<<<<<<<<<<<<< May 25 04:20:25 suse2 kernel: scsi0: Dumping Card State at program address 0x24 Mode 0x0 May 25 04:20:25 suse2 kernel: Card was paused May 25 04:20:25 suse2 kernel: HS_MAILBOX[0x0] INTCTL[0xc0]:(SWTMINTEN|SWTMINTMASK) May 25 04:20:25 suse2 kernel: SEQINTSTAT[0x0] SAVED_MODE[0x11] DFFSTAT[0x33]:(CURRFIFO_NONE|FIFO0FREE|FIFO1FREE) May 25 04:20:25 suse2 kernel: SCSISIGI[0x0]:(P_DATAOUT) SCSIPHASE[0x0] SCSIBUS[0x0] May 25 04:20:25 suse2 kernel: LASTPHASE[0x1]:(P_DATAOUT|P_BUSFREE) SCSISEQ0[0x0] May 25 04:20:25 suse2 kernel: SCSISEQ1[0x12]:(ENAUTOATNP|ENRSELI) SEQCTL0[0x0] SEQINTCTL[0x0] May 25 04:20:25 suse2 kernel: SEQ_FLAGS[0x0] SEQ_FLAGS2[0x0] SSTAT0[0x0] SSTAT1[0x8]:(BUSFREE) May 25 04:20:25 suse2 kernel: SSTAT2[0x0] SSTAT3[0x0] PERRDIAG[0x0] SIMODE1[0xa4]:(ENSCSIPERR|ENSCSIRST|ENSELTIMO) May 25 04:21:09 suse2 syslogd: /var/log/warn: Input/output error May 25 04:20:25 suse2 kernel: LQISTAT0[0x0] LQISTAT1[0x0] LQISTAT2[0x0] LQOSTAT0[0x0] May 25 04:21:09 suse2 kernel: LQOSTAT1[0x8]:(LQOSTOPI2) LQOSTAT2[0xe1]:(LQOSTOP0|LQOPKT) May 25 04:21:09 suse2 kernel: May 25 04:21:09 suse2 kernel: SCB Count = 32 CMDS_PENDING = 1 LASTSCB 0x10 CURRSCB 0x2 NEXTSCB 0xff80 May 25 04:21:09 suse2 kernel: qinstart = 36187 qinfifonext = 36187 May 25 04:21:09 suse2 kernel: QINFIFO: May 25 04:21:09 suse2 kernel: WAITING_TID_QUEUES: May 25 04:21:09 suse2 kernel: Pending list: May 25 04:21:09 suse2 kernel: 2 FIFO_USE[0x0] SCB_CONTROL[0x60]:(TAG_ENB|DISCENB) SCB_SCSIID[0x7] May 25 04:21:09 suse2 kernel: Total 1 May 25 04:21:09 suse2 kernel: Kernel Free SCB list: 16 1 21 25 0 10 6 27 15 17 14 5 9 13 3 26 18 28 4 22 11 31 12 19 30 23 24 20 29 8 7 May 25 04:21:09 suse2 May 25 04:21:09
participants (1)
-
Dave Howorth