[Bug 628298] New: api.o.o flakes out when remote bs_srcserver requests
http://bugzilla.novell.com/show_bug.cgi?id=628298 http://bugzilla.novell.com/show_bug.cgi?id=628298#c0 Summary: api.o.o flakes out when remote bs_srcserver requests Classification: openSUSE Product: openSUSE.org Version: unspecified Platform: All OS/Version: Linux Status: NEW Severity: Major Priority: P5 - None Component: BuildService AssignedTo: bnc-team-screening@forge.provo.novell.com ReportedBy: jengelh@medozas.de QAContact: adrian@novell.com Found By: Beta-Customer Blocker: --- When a OBS local instance is booted, the local scheduler queries the local srcserver for projpacks. # rcobssrcserver status Checking for obssrcserver and running processes: R 4287 29613 GET /getprojpack?withsrcmd5&withdeps&withrepos&withconfig&arch=x86_64 ~same line~ ...=i586 ~same line~ ...=sparcv9 ~same line~ ...=sparc64 All but one of these getprojpack calls hangs, that is, # strace -p 29613 read(4, ^C The srcserver logfile is usually empty: 2010-08-04 12:52:08: bs_srcserver started on port 5352 2010-08-04 12:52:12 [29613]: GET /getprojpack?withsrcmd5&withdeps&withrepos&withconfig&arch=x86_64 2010-08-04 12:52:12 [29615]: GET /getprojpack?withsrcmd5&withdeps&withrepos&withconfig&arch=sparcv9 2010-08-04 12:52:12 [29616]: GET /getprojpack?withsrcmd5&withdeps&withrepos&withconfig&arch=sparc64 2010-08-04 12:52:12 [29619]: GET /getprojpack?withsrcmd5&withdeps&withrepos&withconfig&arch=i586 reading config for Black_Ares:11.2-update/standard x86_64 reading config for Black_Ares:11.2-update/standard i586 reading config for Black_Ares:11.2-update/standard sparcv9 reading config for Black_Ares:11.3/standard sparc64 Without much traces. -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=628298 http://bugzilla.novell.com/show_bug.cgi?id=628298#c1 --- Comment #1 from Jan Engelhardt <jengelh@medozas.de> 2010-08-04 12:21:33 UTC --- Created an attachment (id=380539) --> (http://bugzilla.novell.com/attachment.cgi?id=380539) strace -p affected process Here is a log where this hang occurs, and where a.o.o was able to continue -- which is usually not the case and the log just ends after read(4,. 1280923968.837960 write(4, "\200y\1\3\1\0`\0\0\0\20\0\0009\0\0008\0\0005\0\0\210\0\0\207\0\0\204\0\0\26"..., 123) 1280923968.838691 read(4, "\26\3\0\0:\2\0", 7) = 7 1280924115.310855 time(NULL) = 1280924115 1280924115.311186 time(NULL) = 1280924115 Which means that between writing the SSL query and getting a response, almost 45 seconds passed. That alone is _highly_ suspect given most other queries are instantaneous. -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=628298 http://bugzilla.novell.com/show_bug.cgi?id=628298#c2 Adrian Schröter <adrian@novell.com> changed: What |Removed |Added ---------------------------------------------------------------------------- AssignedTo|bnc-team-screening@forge.pr |mls@novell.com |ovo.novell.com | --- Comment #2 from Adrian Schröter <adrian@novell.com> 2010-08-04 14:31:58 UTC --- 45 may be okay, it depends on the request (this the one for a project link to openSUSE:Factory ?) -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
http://bugzilla.novell.com/show_bug.cgi?id=628298 http://bugzilla.novell.com/show_bug.cgi?id=628298#c3 --- Comment #3 from Jan Engelhardt <jengelh@medozas.de> 2010-08-04 14:37:15 UTC --- Project links openSUSE.org -> api.opensuse.org/ Package links Black_Ares:11.3/aaa_base -> openSUSE.org:openSUSE:11.3 +8000 more The problem is that the BSRPCs are not time-limited AFAICS, so if the kernel does not give my local OBS any data back (for whatever reason there is, be it network breakdown or a screwup of the Novell proxies), it hangs basically forever. With a timeout it could at least self-retry so that I do not have to manually kill the hanging srcserver subprocesses to get things going again. -- Configure bugmail: http://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=628298 https://bugzilla.novell.com/show_bug.cgi?id=628298#c4 --- Comment #4 from Jan Engelhardt <jengelh@medozas.de> 2010-09-19 12:01:31 UTC --- *** Bug 637481 has been marked as a duplicate of this bug. *** http://bugzilla.novell.com/show_bug.cgi?id=637481 -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=628298 https://bugzilla.novell.com/show_bug.cgi?id=628298#c5 --- Comment #5 from Michael Schröder <mls@novell.com> 2010-09-20 08:09:45 UTC --- (Actually the code sets SO_KEEPALIVE, so it should detect a broken connection after a while.) -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=628298 https://bugzilla.novell.com/show_bug.cgi?id=628298#c6 --- Comment #6 from Jan Engelhardt <jengelh@medozas.de> 2010-09-20 12:02:24 UTC --- According to /proc/sys/net, tcp_keepalive_intvl is 75. However, running tcpdump on my end shows that no ACKs are sent within more than 5 minutes. /proc/net/tcp contains sl local_address rem_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode 27: 864C5305:B173 C387DD21:01BB 01 00000000:00000000 02:0000B112 00000000 403 0 34891101 2 fffff801ce8dd500 22 4 1 3 2 29: 864C5305:CCDC C387DD21:01BB 01 00000000:00000000 02:000037A8 00000000 403 0 34889329 2 fffff801ed464700 23 4 8 4 12 32: 864C5305:CC75 C387DD21:01BB 01 00000000:00000000 02:00002E40 00000000 403 0 34889234 2 fffff801ed461aa0 22 4 7 3 2 33: 864C5305:D794 C387DD21:01BB 01 00000000:00000000 02:0001E4F5 00000000 403 0 34892072 2 fffff801ed6e61a0 22 4 6 2 2 Since the timer is closing in on the magic boundary of 2 hours, I'll wait a few sec: # rcobssrcserver status R 7164 23866 GET /getprojpack?withsrcmd5&withdeps&withrepos&withconfig&arch=x86_64&project=Black_Ares:11.3&package=844-ksc-pcf&package=Botan&package=CASA&package=CASA-kwallet&package=CASA_auth_token_client&package=CASA_auth_token_server&package=CASA_auth_t running 13:59:48.330402 IP 134.76.83.5.52341 > 195.135.221.33.443: Flags [.], ack 3525918430, win 92, length 0 # rcobssrcserver status R 7247 23866 GET /getprojpack?withsrcmd5&withdeps&withrepos&withconfig&arch=x86_64&project=Black_Ares:11.3&package=844-ksc-pcf&package=Botan&package=CASA&package=CASA-kwallet&package=CASA_auth_token_client&package=CASA_auth_token_server&package=CASA_auth_t So that makes three things. Keepalives are being sent, but very few, api.o.o does not respond to them, and my connection remains. I don't even know of the TCB still exists on api.o.o. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=628298 https://bugzilla.novell.com/show_bug.cgi?id=628298#c7 --- Comment #7 from Jan Engelhardt <jengelh@medozas.de> 2010-09-20 12:16:05 UTC --- Now this looks interesting. 13:59:48.330402 IP 134.76.83.5.52341 > 195.135.221.33.443: Flags [.], ack 3525918430, win 92, length 0 14:02:18.330383 IP 134.76.83.5.52341 > 195.135.221.33.443: Flags [.], ack 1, win 92, length 0 14:03:33.330382 IP 134.76.83.5.52341 > 195.135.221.33.443: Flags [.], ack 1, win 92, length 0 14:04:48.330361 IP 134.76.83.5.52341 > 195.135.221.33.443: Flags [.], ack 1, win 92, length 0 14:06:03.330381 IP 134.76.83.5.52341 > 195.135.221.33.443: Flags [.], ack 3525918430, win 92, length 0 14:07:18.330387 IP 134.76.83.5.52341 > 195.135.221.33.443: Flags [.], ack 1, win 92, length 0 14:08:33.330387 IP 134.76.83.5.52341 > 195.135.221.33.443: Flags [.], ack 1, win 92, length 0 14:09:48.330376 IP 134.76.83.5.52341 > 195.135.221.33.443: Flags [.], ack 1, win 92, length 0 14:06:03.330381 IP 134.76.83.5.52341 > 195.135.221.33.443: Flags [.], ack 3525918430, win 92, length 0 14:07:18.330387 IP 134.76.83.5.52341 > 195.135.221.33.443: Flags [.], ack 1, win 92, length 0 14:08:33.330387 IP 134.76.83.5.52341 > 195.135.221.33.443: Flags [.], ack 1, win 92, length 0 14:09:48.330376 IP 134.76.83.5.52341 > 195.135.221.33.443: Flags [.], ack 1, win 92, length 0 14:11:03.330382 IP 134.76.83.5.52341 > 195.135.221.33.443: Flags [R.], seq 1, ack 1, win 92, length 0 But if it takes 2 hours before it even starts keepaliving, that needs improvement. (Or better yet, figuring out why api.o.o forgets the connection.) -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=628298 https://bugzilla.novell.com/show_bug.cgi?id=628298#c8 Jan Engelhardt <jengelh@inai.de> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution| |FIXED --- Comment #8 from Jan Engelhardt <jengelh@inai.de> 2014-02-04 16:58:03 CET --- Didn't happen anymore in 2013. Good thing. -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
https://bugzilla.novell.com/show_bug.cgi?id=628298 https://bugzilla.novell.com/show_bug.cgi?id=628298#c Jan Engelhardt <jengelh@inai.de> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|RESOLVED |VERIFIED -- Configure bugmail: https://bugzilla.novell.com/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug.
participants (1)
-
bugzilla_noreply@novell.com