time & filesystem problem(s?)

23 May 2005

      I have a problem I don't understand on a new Opteron box running 9.2. I 
have two supposedly identical boxes and one is running fine whilst the 
clock on the other drifts into the future (2 days as of this morning). 
This morning I've also discovered that the root filesystem has become 
read-only over the weekend. I don't know whether these symptoms are 
caused by the same problem or if they are two separate problems?

I don't really know where to look. I have ntp running on both machines 
and that seems like a candidate for the source of the clock trouble at 
least. The daemon still seemed to be running:

   ntp       4440     1  0 May19 ?        00:00:00 /usr/sbin/ntpd -p 
/var/lib/ntp/var/run/ntp/ntpd.pid -u ntp -i /var/lib/ntp

I rebooted and /var/log/messages now contains:

May 25 04:56:23 suse2 kernel: Losing some ticks... checking if CPU 
frequency changed.
May 23 10:59:01 suse2 ntpdate[4434]: step time server 10.1.0.0 offset 
-Ã¾|1000Â°^E:Â°9414735363.521992 sec
May 23 10:59:01 suse2 ntpd[4439]: ntpd 4.2.0a@1.1190-r Wed Jan 26 
17:35:54 UTC 2005 (1)
May 23 10:59:01 suse2 ntpd[4439]: signal_no_reset: signal 13 had flags 
4000000
May 23 10:59:01 suse2 ntpd[4439]: precision = 1.000 usec
May 23 10:59:01 suse2 ntpd[4439]: Listening on interface wildcard, 
0.0.0.0#123
May 23 10:59:01 suse2 ntpd[4439]: Listening on interface wildcard, ::#123
May 23 10:59:01 suse2 ntpd[4439]: Listening on interface lo, 127.0.0.1#123
May 23 10:59:01 suse2 ntpd[4439]: Listening on interface eth0, 
192.168.1.2#123
May 23 10:59:01 suse2 ntpd[4439]: Listening on interface eth1, 10.11.0.1#123
May 23 10:59:01 suse2 ntpd[4439]: Listening on interface eth2, 
192.168.4.2#123
May 23 10:59:01 suse2 ntpd[4439]: kernel time sync status 0040
May 23 10:59:01 suse2 ntpd[4439]: frequency initialized 0.000 PPM from 
/var/lib/ntp/drift/ntp.drift
May 23 10:59:01 suse2 kernel: warning: many lost ticks.
May 23 10:59:01 suse2 kernel: Your time source seems to be instable or 
some driver is hogging interupts

/var/log/ntp has this (and similar for each previous boot):

23 May 11:02:39 ntpd[4439]: synchronized to LOCAL(0), stratum 10
23 May 11:02:39 ntpd[4439]: kernel time sync disabled 0041
23 May 11:03:51 ntpd[4439]: kernel time sync enabled 0001

and ntpq -p says this:

suse2:/var/log # ntpq -p
      remote           refid      st t when poll reach   delay   offset 
  jitter
==============================================================================
*LOCAL(0)        LOCAL(0)        10 l   67   64  377    0.000    0.000 
  0.001
  server1.lmb.int 131.111.12.21    3 u   11 1024  377    0.001  -272394 
163702.

The clock on this machine is about ten minutes fast around 40 minutes 
after I rebooted. I don't have a clue what all the output above actually 
means. I'm reading through all the NTP docs, but that's going to take me 
a while :)  Can anybody give me any clues where to start looking?

A disk error appears to be another possibility for the ro file system. 
Here's the last part of /var/log/messages before the reboot:

May 25 02:26:24 suse2 -- MARK --
May 25 02:57:04 suse2 -- MARK --
May 25 02:59:34 suse2 /usr/sbin/cron[22122]: (root) CMD ( rm -f 
/var/spool/cron/lastrun/cron.hourly)
May 25 03:28:42 suse2 -- MARK --
May 25 03:59:36 suse2 /usr/sbin/cron[22245]: (root) CMD ( rm -f 
/var/spool/cron/lastrun/cron.hourly)
May 25 04:14:36 suse2 /usr/sbin/cron[22283]: (root) CMD ( rm -f 
/var/spool/cron/lastrun/cron.daily)
May 25 04:17:11 suse2 su: (to nobody) root on none
May 25 04:17:11 suse2 su: pam_unix2: session started for user nobody, 
service su
May 25 04:17:46 suse2 su: pam_unix2: session finished for user nobody, 
service su
May 25 04:17:46 suse2 su: (to nobody) root on none
May 25 04:17:46 suse2 su: pam_unix2: session started for user nobody, 
service su
May 25 04:17:12 suse2 su: pam_unix2: session finished for user nobody, 
service su
May 25 04:17:47 suse2 su: (to nobody) root on none
May 25 04:17:47 suse2 su: pam_unix2: session started for user nobody, 
service su
May 25 04:17:47 suse2 su: pam_unix2: session finished for user nobody, 
service su
May 25 04:20:25 suse2 kernel: scsi0:0:0:0: Attempting to abort cmd 
0000010004851040: 0x2a 0x0 0x0 0x41 0x3a 0x16 0x0 0x0 0x8 0x0
May 25 04:20:25 suse2 kernel: scsi0: At time of recovery, card was not 
paused
May 25 04:20:25 suse2 kernel: >>>>>>>>>>>>>>>>>> Dump Card State Begins 
<<<<<<<<<<<<<<<<<
May 25 04:20:25 suse2 kernel: scsi0: Dumping Card State at program 
address 0x24 Mode 0x0
May 25 04:20:25 suse2 kernel: Card was paused
May 25 04:20:25 suse2 kernel: HS_MAILBOX[0x0] 
INTCTL[0xc0]:(SWTMINTEN|SWTMINTMASK)
May 25 04:20:25 suse2 kernel: SEQINTSTAT[0x0] SAVED_MODE[0x11] 
DFFSTAT[0x33]:(CURRFIFO_NONE|FIFO0FREE|FIFO1FREE)
May 25 04:20:25 suse2 kernel: SCSISIGI[0x0]:(P_DATAOUT) SCSIPHASE[0x0] 
SCSIBUS[0x0]
May 25 04:20:25 suse2 kernel: LASTPHASE[0x1]:(P_DATAOUT|P_BUSFREE) 
SCSISEQ0[0x0]
May 25 04:20:25 suse2 kernel: SCSISEQ1[0x12]:(ENAUTOATNP|ENRSELI) 
SEQCTL0[0x0] SEQINTCTL[0x0]
May 25 04:20:25 suse2 kernel: SEQ_FLAGS[0x0] SEQ_FLAGS2[0x0] SSTAT0[0x0] 
SSTAT1[0x8]:(BUSFREE)
May 25 04:20:25 suse2 kernel: SSTAT2[0x0] SSTAT3[0x0] PERRDIAG[0x0] 
SIMODE1[0xa4]:(ENSCSIPERR|ENSCSIRST|ENSELTIMO)
May 25 04:21:09 suse2 syslogd: /var/log/warn: Input/output error
May 25 04:20:25 suse2 kernel: LQISTAT0[0x0] LQISTAT1[0x0] LQISTAT2[0x0] 
LQOSTAT0[0x0]
May 25 04:21:09 suse2 kernel: LQOSTAT1[0x8]:(LQOSTOPI2) 
LQOSTAT2[0xe1]:(LQOSTOP0|LQOPKT)
May 25 04:21:09 suse2 kernel:
May 25 04:21:09 suse2 kernel: SCB Count = 32 CMDS_PENDING = 1 LASTSCB 
0x10 CURRSCB 0x2 NEXTSCB 0xff80
May 25 04:21:09 suse2 kernel: qinstart = 36187 qinfifonext = 36187
May 25 04:21:09 suse2 kernel: QINFIFO:
May 25 04:21:09 suse2 kernel: WAITING_TID_QUEUES:
May 25 04:21:09 suse2 kernel: Pending list:
May 25 04:21:09 suse2 kernel:   2 FIFO_USE[0x0] 
SCB_CONTROL[0x60]:(TAG_ENB|DISCENB) SCB_SCSIID[0x7]
May 25 04:21:09 suse2 kernel: Total 1
May 25 04:21:09 suse2 kernel: Kernel Free SCB list: 16 1 21 25 0 10 6 27 
15 17 14 5 9 13 3 26 18 28 4 22 11 31 12 19 30 23 24 20 29 8 7
May 25 04:21:09 suse2 May 25 04:21:09

time & filesystem problem(s?)

Dave Howorth