[heroes] [RED] rsync.opensuse.org down
Hi all, unfortunately I have to inform you, that rsync.opensuse.org is down since this evening ~17.00 CEST. The machine is a server which was donated by SUSE Linux GmbH. The rack space, power and uplink was donated by QSC AG datacenter in Nuremberg (former IP-Exchange). What happened: I tried to do a kernel update of the machine and now it doesn't come back to life. As this machine is in an external DC, we need to drive there, as we don't have out-of-band-management possibilities in the sponsored rack of QSC AG. What's the impact of this: rsync.opensuse.org is down, which means that all unauthorized rsync-mirrors of openSUSE (which are not in the whitelist on stage.opensuse.org), are not able to sync from us at the moment. Furthermore rsync.opensuse.org is used as first push mirror usually by the OBS, which means that all clients which access download.opensuse.org via HTTP get redirected to rsync.opensuse.org by mirrorbrain, as long as the other servers in the world are syncing the newest updates and have it not yet ready. In summary: - non-whitelisted rsync-mirrors can't mirror at the moment - Users will feel a performance drop for installing and downloading updates because they are redirected to downloadcontent.opensuse.org, the mirror in our office datacenter in Nuremberg with limited Uplink capacities, until other mirrors and the scanning caught up with the changes. - This affects -indeed- every package updated in the OBS which is not published below this path: https://download.opensuse.org/repositories/ - including Updates, Tumbleweed and Leap snapshots Mitigation: In the end, running this on an old SUSE-donated machine with a lot of spinning slow disks, is not matching the performance needs of the machine. So there need to be a final and well-engineered solution for this. Which was, until now, always out of scope of any budget. As a quick fix, somebody needs to drive to the data centre and investigate the machine. I'm willing to help, but I'm not driving those 60km again on my private expense, as I did last time. Sorry. On behalf of the openSUSE Heroes team. Best regards, -- Thorsten Bro <tbro@opensuse.org> - Member of openSUSE Heroes - -- To unsubscribe, e-mail: heroes+unsubscribe@opensuse.org To contact the owner, e-mail: heroes+owner@opensuse.org
Hi Thorsten, On 25/06/2019 18:17, Thorsten Bro | openSUSE Heroes wrote:
unfortunately I have to inform you, that rsync.opensuse.org is down since this evening ~17.00 CEST.
The machine is a server which was donated by SUSE Linux GmbH. The rack space, power and uplink was donated by QSC AG datacenter in Nuremberg (former IP-Exchange).
What happened: I tried to do a kernel update of the machine and now it doesn't come back to life. As this machine is in an external DC, we need to drive there, as we don't have out-of-band-management possibilities in the sponsored rack of QSC AG.
What's the impact of this: rsync.opensuse.org is down, which means that all unauthorized rsync-mirrors of openSUSE (which are not in the whitelist on stage.opensuse.org), are not able to sync from us at the moment.
Furthermore rsync.opensuse.org is used as first push mirror usually by the OBS, which means that all clients which access download.opensuse.org via HTTP get redirected to rsync.opensuse.org by mirrorbrain, as long as the other servers in the world are syncing the newest updates and have it not yet ready.
In summary: - non-whitelisted rsync-mirrors can't mirror at the moment - Users will feel a performance drop for installing and downloading updates because they are redirected to downloadcontent.opensuse.org, the mirror in our office datacenter in Nuremberg with limited Uplink capacities, until other mirrors and the scanning caught up with the changes. - This affects -indeed- every package updated in the OBS which is not published below this path: https://download.opensuse.org/repositories/ - including Updates, Tumbleweed and Leap snapshots
Mitigation: In the end, running this on an old SUSE-donated machine with a lot of spinning slow disks, is not matching the performance needs of the machine. So there need to be a final and well-engineered solution for this. Which was, until now, always out of scope of any budget. Does someone have a good understanding what the needs would be? As a quick fix, somebody needs to drive to the data centre and investigate the machine. I'm willing to help, but I'm not driving those 60km again on my private expense, as I did last time. Sorry.
I don't know why this had to be a private expense when you drove there last time. But definitely not this time -- if you go there, this will be your working time and business travel which of course can be reimbursed. I would appreciate you going there short-term. Best, -- Kurt Garloff <garloff@suse.com> VP Engineering Software Defined Infrastructure SUSE Linux GmbH, GF: Felix Imendörffer, Mary Higgins, Sri Rasiah, HRB 21284 (AG Nürnberg) -- To unsubscribe, e-mail: heroes+unsubscribe@opensuse.org To contact the owner, e-mail: heroes+owner@opensuse.org
Kurt Garloff wrote:
In the end, running this on an old SUSE-donated machine with a lot of spinning slow disks, is not matching the performance needs of the machine. So there need to be a final and well-engineered solution for this. Which was, until now, always out of scope of any budget.
Does someone have a good understanding what the needs would be?
With a 1Gbit uplink, it takes about 100Mbyte/s to saturate. Thorsten ran some numbers in April showing the actual usage is a less than half that. Daily 200-300Mbit/s with peaks at up to 500Mbit/s, so 50Mbyte/s. Being fairly conservative, I think any modern machine with one array controller (Adaptec 1000 for instance) with 8 drives in RAID6 should easily do the job. Assuming we get e.g. 10Mbyte/s from each drive which should be well within reach even with our random-access pattern. -- Per Jessen, Zürich (27.2°C) Member, openSUSE Heroes -- To unsubscribe, e-mail: heroes+unsubscribe@opensuse.org To contact the owner, e-mail: heroes+owner@opensuse.org
Hi Per, Am 25.06.19 um 21:27 schrieb Per Jessen:
Kurt Garloff wrote:
In the end, running this on an old SUSE-donated machine with a lot of spinning slow disks, is not matching the performance needs of the machine. So there need to be a final and well-engineered solution for this. Which was, until now, always out of scope of any budget.
Does someone have a good understanding what the needs would be?
With a 1Gbit uplink, it takes about 100Mbyte/s to saturate. Thorsten ran some numbers in April showing the actual usage is a less than half that. Daily 200-300Mbit/s with peaks at up to 500Mbit/s, so 50Mbyte/s.
Being fairly conservative, I think any modern machine with one array controller (Adaptec 1000 for instance) with 8 drives in RAID6 should easily do the job. Assuming we get e.g. 10Mbyte/s from each drive which should be well within reach even with our random-access pattern.
I go with your calculation, but we had there 2 system disks in a hardware RAID and 6 x 4TB disk in a software MD-RAID in RAID5 mode. So we are like around 50-60MByte/s if we reach to the top :) with this software RAID setup. Furthermore, we have a 1GBit downlink (with a specific route) to download new contents from stage.o.o and getting fired from repopusher as well via this road. Plus we have the rsync-clients on another interface with 1GBit which are downloading parallely. Yes, the stats showed something ~500MBit/s but per interface. And that's why I saw a lot of times "stat D" on the rsync processes. So we were stuck, because of disk performance. I would like to SSDs there - which are super expensive with 4TB and more. And plus, we are also running out of space, because with the current setup we planned for 19 TB (back then 15TB all repos), now we peak, ~1 year later~ around 18.51TB on stage.o.o. All this needs a hardware upgrade - unfortunately. Do you think that are reasonable numbers? What size should we expect? 30TB? or even more? Best regards, Thorsten [1] https://mirrors.opensuse.org/list/rsyncinfo-stage.o.o.txt -- Thorsten Bro <Thorsten.Bro@suse.com> - SUSE SDI -> Engineering Infrastructure - SUSE Linux GmbH Maxfeldstrasse 5 90409 Nuernberg Germany GF: Felix Imendörffer, Mary Higgins, Sri Rasiah HRB 21284 (AG Nürnberg) -- To unsubscribe, e-mail: heroes+unsubscribe@opensuse.org To contact the owner, e-mail: heroes+owner@opensuse.org
Thorsten Bro wrote:
Hi Per,
Am 25.06.19 um 21:27 schrieb Per Jessen:
Kurt Garloff wrote:
In the end, running this on an old SUSE-donated machine with a lot of spinning slow disks, is not matching the performance needs of the machine. So there need to be a final and well-engineered solution for this. Which was, until now, always out of scope of any budget.
Does someone have a good understanding what the needs would be?
With a 1Gbit uplink, it takes about 100Mbyte/s to saturate. Thorsten ran some numbers in April showing the actual usage is a less than half that. Daily 200-300Mbit/s with peaks at up to 500Mbit/s, so 50Mbyte/s.
Being fairly conservative, I think any modern machine with one array controller (Adaptec 1000 for instance) with 8 drives in RAID6 should easily do the job. Assuming we get e.g. 10Mbyte/s from each drive which should be well within reach even with our random-access pattern.
I go with your calculation, but we had there 2 system disks in a hardware RAID and 6 x 4TB disk in a software MD-RAID in RAID5 mode.
Ah, thanks for adding that, I didn't know the current config.
So we are like around 50-60MByte/s if we reach to the top :) with this software RAID setup. Furthermore, we have a 1GBit downlink (with a specific route) to download new contents from stage.o.o and getting fired from repopusher as well via this road. Plus we have the rsync-clients on another interface with 1GBit which are downloading parallely. Yes, the stats showed something ~500MBit/s but per interface.
I'm surprised we can even push that much from stage.o.o - so we should try to look at running with 50MByte in and 50Mbyte out, per second. Writing 50Mbyte/s on RAID5 or 6 - with hardware support, that's also fine.
And that's why I saw a lot of times "stat D" on the rsync processes. So we were stuck, because of disk performance. I would like to SSDs there - which are super expensive with 4TB and more. And plus, we are also running out of space, because with the current setup we planned for 19 TB (back then 15TB all repos), now we peak, ~1 year later~ around 18.51TB on stage.o.o.
I tried to balance speed, size and cost. With cost-effective hardware (=cheaper spinning drives), our desired speed is easily achievable, and we also get more space for less money. Room to grow. I specifically say RAID6 because of the time needed to recover a +4Tb drive - big drive = longer recovery time = longer risk.
All this needs a hardware upgrade - unfortunately.
Do you think that are reasonable numbers? What size should we expect? 30TB? or even more?
More :-) pontifex has about 20Tb right now and is 94% full, it would make sense to go for 30Tb or more. Thinking out loud - comments very welcome - I would go for a storage server from e.g. Thomas Krenn, with IPMI, redundant power supplies. One of those are EUR2000-2500, depending on spec. A 3U box has room for 16 drives, which gives us room to expand. I would want two smaller 2.5" SSDs in RAID1 for booting from, they're cheap, and then start with 8 x 8Tb drives (~ 48Tb). Seagate Enterprise drives are EUR300-350 apiece, I think. Add a hardware RAID controller, EUR500. So, for about EUR5000 we would have a system that is reliable, and has plenty of room to grow. We could probably order it custom built from TK. When we run out of space in a couple of years, it would be easy to add another controller and migrate from one to the other (with more disk space). -- Per Jessen, Zürich (23.3°C) Member, openSUSE Heroes -- To unsubscribe, e-mail: heroes+unsubscribe@opensuse.org To contact the owner, e-mail: heroes+owner@opensuse.org
Hi all, Am 25.06.19 um 18:17 schrieb Thorsten Bro | openSUSE Heroes:
As a quick fix, somebody needs to drive to the data centre and investigate the machine.
Thorsten and me have arranged to go there tomorrow at around 10 o'clock, so that we can analyze (and hopefully) resolve the issue. Will let you know, once we know more. Best regards, Karol Babioch
On 27/06/2019 11.20, Karol Babioch wrote:
Hi all,
Am 25.06.19 um 18:17 schrieb Thorsten Bro | openSUSE Heroes:
As a quick fix, somebody needs to drive to the data centre and investigate the machine.
Thorsten and me have arranged to go there tomorrow at around 10 o'clock, so that we can analyze (and hopefully) resolve the issue.
Will let you know, once we know more.
Just in case getting the replacement takes longer, on my private headless remote servers I do during boot: modprobe netconsole\ netconsole=@176.9.1.211/eth0,6666@144.76.96.7/00:31:46:0d:24:04 so I get kernel messages over UDP. It helps narrow down causes. Hetzner still gives me a net-booted recovery system though. -- To unsubscribe, e-mail: heroes+unsubscribe@opensuse.org To contact the owner, e-mail: heroes+owner@opensuse.org
Bernhard M. Wiedemann wrote:
On 27/06/2019 11.20, Karol Babioch wrote:
Hi all,
Am 25.06.19 um 18:17 schrieb Thorsten Bro | openSUSE Heroes:
As a quick fix, somebody needs to drive to the data centre and investigate the machine.
Thorsten and me have arranged to go there tomorrow at around 10 o'clock, so that we can analyze (and hopefully) resolve the issue.
Will let you know, once we know more.
Just in case getting the replacement takes longer, on my private headless remote servers I do during boot:
modprobe netconsole\ netconsole=@176.9.1.211/eth0,6666@144.76.96.7/00:31:46:0d:24:04
Nice tip, thanks! I most often use a serial console over ipmi, but netconsole is nice when there's no ipmi. -- Per Jessen, Zürich (29.9°C) Member, openSUSE Heroes -- To unsubscribe, e-mail: heroes+unsubscribe@opensuse.org To contact the owner, e-mail: heroes+owner@opensuse.org
Hi all, Am 27.06.19 um 11:20 schrieb Karol Babioch:
Thorsten and me have arranged to go there tomorrow at around 10 o'clock, so that we can analyze (and hopefully) resolve the issue.
Will let you know, once we know more.
We visited the datacenter and were able to resolve the issue. It was related to the Network Manager (wicked) not coming up properly due to a dependency cycle with a unit file. The machine can now be rebooted safely, so this shouldn't be a problem any longer. Thorsten will probably tell you more details on this, but I wanted to already announce the availability of the system for now. Also we've discussed that we want to install another server there in order to have access to the IPMI interface of widehat.opensuse.org. We have some spare hardware in Nuremberg, so this shouldn't be too much of a problem, but will take some further planning and configuration in advance. Best regards, Karol Babioch
participants (6)
-
Bernhard M. Wiedemann
-
Karol Babioch
-
Kurt Garloff
-
Per Jessen
-
Thorsten Bro
-
Thorsten Bro | openSUSE Heroes