[opensuse-kubic] High IO activity for master node
Hi, I have taken kubic for a spin with a local QEMU/KVM setup - 1 master node and 3 worker nodes. The cluster is up and running (ingress, load balancer, storage, monitoring ). It is basically stale - except the management pods nothing is deployed. I'm using the qcow2 files from [1], and launching the machines with qemu-system-x86_64 \ -machine type=pc,accel=kvm \ -cpu host \ -smp $cpus \ -drive file=$disk,if=virtio,l2-cache-size=3145728 -cdrom $cdrom \ -m 1024 \ -netdev bridge,id=hostnet0 \ -device e1000,netdev=hostnet0,mac=$mac \ -nographic $cpus is 2 for the master node and 1 for the others. I have one recurring problem with the master becoming unresponsive after some time. Looking at the grafana charts I can see that there is steadily increasing Disk I/O for the master nodee. At about ~30 minutes after launching all 4 nodes I see the load average is 6 on the master node, with disk read I/O increasing steadily. The 5 minute average as collected by prometheus is at about 350 MB/s. The situation seems to be "fixed" when the kube-scheduler-master and kube-controller-manager-master crash and are not scheduled. Looking at those pods' logs does not show match, they seem to exit because of a failed leader election, e.g. for kube-controller-manager-master I0326 15:59:39.521403 1 leaderelection.go:249] failed to renew lease kube-system/kube-controller-manager: failed to tryAcquireOrRenew context deadline exceeded for kube-scheduler-master I0326 15:59:39.589619 1 leaderelection.go:249] failed to renew lease kube-system/kube-scheduler: failed to tryAcquireOrRenew context deadline exceeded And looking at the I/O grafs it seems that there is a cycle - cluster is ok - IO increases - kube-*-master pods fail - IO decreases - kube-*-master pods recover Right now I'm at a loss about where to look next. Does anyone have any ideas about what might cause this or how to better debug this kind of situation? Thanks, Robert [1]: https://download.opensuse.org/repositories/devel:/kubic:/images/openSUSE_Tum... -- To unsubscribe, e-mail: opensuse-kubic+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kubic+owner@opensuse.org
On Tue, Mar 26, Robert Munteanu wrote:
I have one recurring problem with the master becoming unresponsive after some time. Looking at the grafana charts I can see that there is steadily increasing Disk I/O for the master nodee. At about ~30 minutes after launching all 4 nodes I see the load average is 6 on the master node, with disk read I/O increasing steadily. The 5 minute average as collected by prometheus is at about 350 MB/s.
This pretty much sounds like etcd, which is continously writing to disk for master election. Normally the advice is, to use etcd only on a SSD, could be that in our case, the disk I/O is the problem. In my opinion, the etcd way to implement this algo is a mis-design, HA can do the same without this ... But I don't know enough about etcd if there is anything you could do. Thorsten -- Thorsten Kukuk, Distinguished Engineer, Senior Architect SLES & MicroOS SUSE Linux GmbH, Maxfeldstr. 5, 90409 Nuernberg, Germany GF: Felix Imendoerffer, Mary Higgins, Sri Rasiah, HRB 21284 (AG Nuernberg) -- To unsubscribe, e-mail: opensuse-kubic+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kubic+owner@opensuse.org
On Tue, 2019-03-26 at 17:46 +0100, Thorsten Kukuk wrote:
On Tue, Mar 26, Robert Munteanu wrote:
I have one recurring problem with the master becoming unresponsive after some time. Looking at the grafana charts I can see that there is steadily increasing Disk I/O for the master nodee. At about ~30 minutes after launching all 4 nodes I see the load average is 6 on the master node, with disk read I/O increasing steadily. The 5 minute average as collected by prometheus is at about 350 MB/s.
This pretty much sounds like etcd, which is continously writing to disk for master election. Normally the advice is, to use etcd only on a SSD, could be that in our case, the disk I/O is the problem. In my opinion, the etcd way to implement this algo is a mis-design, HA can do the same without this ...
That seems to be the case. Or at least etcd is slow $ kubectl logs etcd-master -n kube-system | grep -c 'took too long' 1357 But it's surprising to me since I'm running this off a NVME SSD, so etcd should not be starved for IO. Is this maybe hinting at my QEMU setup not being performant enough in terms of disk IO? Thanks, Robert -- To unsubscribe, e-mail: opensuse-kubic+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kubic+owner@opensuse.org
On Tue, 2019-03-26 at 19:25 +0100, Robert Munteanu wrote:
But it's surprising to me since I'm running this off a NVME SSD, so etcd should not be starved for IO. Is this maybe hinting at my QEMU setup not being performant enough in terms of disk IO?
Following up - I switched the master to use a raw disk file drive file=master.raw,if=virtio \ and indeed responsiveness is much better and pods mare mostly stable. Two questions: 1. The total disk reads still totally increasing and peaked at about 1GBps ( now at about 800MBps ). Is that value expected or should I go looking for something wrong? 2. Should the wiki contain a note regarding the disk image format? I would propose something like -----------x8----------- Note that the Kubernetes master requires large amounts of disk I/O to function properly and that the default qcow2 format may not offer that. For best performance it is recommended that the supplied images are converted to other formats, such as the 'raw' one. -----------x8----------- Thanks, Robert -- To unsubscribe, e-mail: opensuse-kubic+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kubic+owner@opensuse.org
On Wed, Mar 27, Robert Munteanu wrote:
On Tue, 2019-03-26 at 19:25 +0100, Robert Munteanu wrote:
But it's surprising to me since I'm running this off a NVME SSD, so etcd should not be starved for IO. Is this maybe hinting at my QEMU setup not being performant enough in terms of disk IO?
Following up - I switched the master to use a raw disk file
drive file=master.raw,if=virtio \
and indeed responsiveness is much better and pods mare mostly stable.
Two questions:
1. The total disk reads still totally increasing and peaked at about 1GBps ( now at about 800MBps ). Is that value expected or should I go looking for something wrong?
I would expect disk writes, not disk reads. And your values are much, much higher then the ones on my test clusters. There must be something else going on.
2. Should the wiki contain a note regarding the disk image format? I would propose something like
While this helps a little bit in your case, I don't think that this is a generic correct statement. And to me it looks like, while this lowers your problem, it's not the root cause. There are many knobs in a virtualisation stack. E.g. the filesystem setup where you store the images has a much higher impact than raw vs. qcow2. I'm using libvirt/kvm on SLES15 with xfs or btrfs with NoCOW and qgroups disabled. If you use e.g. btrfs with COW and qgroups enabled, all other optimizations will not help, your performance will always be worse. But this is something for virtualisation experts. What you could try to do is to install the master node with YaST from the openSUSE Kubic iso image and don't use a pre-build qcow2 image. Our YaST installer uses a slightly different setup then kiwi, maybe this could also make a difference. But I think the problem is somewhere else in your virtualisation stack, I don't see this high load and disk I/O on any of my three kubic clusters, and one is even running on my notebook. Thorsten -- Thorsten Kukuk, Distinguished Engineer, Senior Architect SLES & MicroOS SUSE Linux GmbH, Maxfeldstr. 5, 90409 Nuernberg, Germany GF: Felix Imendoerffer, Mary Higgins, Sri Rasiah, HRB 21284 (AG Nuernberg) -- To unsubscribe, e-mail: opensuse-kubic+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kubic+owner@opensuse.org
On Wed, 2019-03-27 at 10:27 +0100, Thorsten Kukuk wrote:
There are many knobs in a virtualisation stack. E.g. the filesystem setup where you store the images has a much higher impact than raw vs. qcow2. I'm using libvirt/kvm on SLES15 with xfs or btrfs with NoCOW and qgroups disabled. If you use e.g. btrfs with COW and qgroups enabled, all other optimizations will not help, your performance will always be worse. But this is something for virtualisation experts.
FWIW, the raw disk images are stored on an XFS partition ( I am running Tumbleweed ).
What you could try to do is to install the master node with YaST from the openSUSE Kubic iso image and don't use a pre-build qcow2 image. Our YaST installer uses a slightly different setup then kiwi, maybe this could also make a difference. But I think the problem is somewhere else in your virtualisation stack, I don't see this high load and disk I/O on any of my three kubic clusters, and one is even running on my notebook.
I would tend to agree, I think the problem is in the virtualisation stack. One of the processes that is reported to generate a lot of IO reads is kube-apiserver. However, lsof does not report any kind of meaningful open files # lsof -p 2444 | grep -v IPv COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME kube-apis 2444 root cwd DIR 0,98 152 256 / kube-apis 2444 root rtd DIR 0,98 152 256 / kube-apis 2444 root txt REG 0,98 138710240 2317 /usr/local/bin/kube-apiserver kube-apis 2444 root 0u CHR 1,3 0t0 29379 /dev/null kube-apis 2444 root 1w FIFO 0,12 0t0 29318 pipe kube-apis 2444 root 2w FIFO 0,12 0t0 29319 pipe kube-apis 2444 root 4u a_inode 0,13 0 10449 [eventpoll] Running fatrace | grep kube-api does not show any operations. Additionally, running iotop on the host shows very little disk reads but some disk writes. Do you (or anyone else) have any suggestions related to where I could go digging into the qemu setup? Thanks, Robert -- To unsubscribe, e-mail: opensuse-kubic+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kubic+owner@opensuse.org
On Wed, 27 Mar 2019 at 15:57, Robert Munteanu <rombert@apache.org> wrote:
On Wed, 2019-03-27 at 10:27 +0100, Thorsten Kukuk wrote:
There are many knobs in a virtualisation stack. E.g. the filesystem setup where you store the images has a much higher impact than raw vs. qcow2. I'm using libvirt/kvm on SLES15 with xfs or btrfs with NoCOW and qgroups disabled. If you use e.g. btrfs with COW and qgroups enabled, all other optimizations will not help, your performance will always be worse. But this is something for virtualisation experts.
FWIW, the raw disk images are stored on an XFS partition ( I am running Tumbleweed ).
Were the qcow2 disk images also held on an XFS partition? I'm personally curious as to how many layers of 'CoWing' might be in place All of us in this discussion so far are using a host for our VMs with No CoW either as a result of the filesystem choice or NoCoW being explicitly set for the VM location On an openSUSE Kubic installed using the official media (as I and Thorsten do), the /var in the VM is also NoCoW as a result of the installer doing it's job right But I'm not 100% sure the kiwi images do that properly, so it's possible there is an extra layer of CoW introduced by the way kiwi sets up the disks In the VMs can you run `lsattr / | grep /var` and confirm that "C" is set? Thanks -- To unsubscribe, e-mail: opensuse-kubic+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kubic+owner@opensuse.org
Hi, Am Mittwoch, 27. März 2019, 16:39:28 CET schrieb Richard Brown:
On Wed, 27 Mar 2019 at 15:57, Robert Munteanu <rombert@apache.org> wrote:
On Wed, 2019-03-27 at 10:27 +0100, Thorsten Kukuk wrote:
There are many knobs in a virtualisation stack. E.g. the filesystem setup where you store the images has a much higher impact than raw vs. qcow2. I'm using libvirt/kvm on SLES15 with xfs or btrfs with NoCOW and qgroups disabled. If you use e.g. btrfs with COW and qgroups enabled, all other optimizations will not help, your performance will always be worse. But this is something for virtualisation experts.
FWIW, the raw disk images are stored on an XFS partition ( I am running Tumbleweed ).
Were the qcow2 disk images also held on an XFS partition?
I'm personally curious as to how many layers of 'CoWing' might be in place
All of us in this discussion so far are using a host for our VMs with No CoW either as a result of the filesystem choice or NoCoW being explicitly set for the VM location
On an openSUSE Kubic installed using the official media (as I and Thorsten do), the /var in the VM is also NoCoW as a result of the installer doing it's job right
But I'm not 100% sure the kiwi images do that properly,
They do. The .kiwi has <volume name="var" copy_on_write="false"/> and kiwi sets the flag correctly according to lsattr. It's also visible in the buildlog: EXEC: [chattr +C /tmp/kiwi_volumes.9jbuba54/@/var] Cheers, Fabian
so it's possible there is an extra layer of CoW introduced by the way kiwi sets up the disks In the VMs can you run `lsattr / | grep /var` and confirm that "C" is set?
Thanks
-- To unsubscribe, e-mail: opensuse-kubic+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kubic+owner@opensuse.org
On Wed, 2019-03-27 at 16:39 +0100, Richard Brown wrote:
On Wed, 27 Mar 2019 at 15:57, Robert Munteanu <rombert@apache.org> wrote:
On Wed, 2019-03-27 at 10:27 +0100, Thorsten Kukuk wrote:
There are many knobs in a virtualisation stack. E.g. the filesystem setup where you store the images has a much higher impact than raw vs. qcow2. I'm using libvirt/kvm on SLES15 with xfs or btrfs with NoCOW and qgroups disabled. If you use e.g. btrfs with COW and qgroups enabled, all other optimizations will not help, your performance will always be worse. But this is something for virtualisation experts.
FWIW, the raw disk images are stored on an XFS partition ( I am running Tumbleweed ).
Were the qcow2 disk images also held on an XFS partition?
Yes.
I'm personally curious as to how many layers of 'CoWing' might be in place
All of us in this discussion so far are using a host for our VMs with No CoW either as a result of the filesystem choice or NoCoW being explicitly set for the VM location
On an openSUSE Kubic installed using the official media (as I and Thorsten do), the /var in the VM is also NoCoW as a result of the installer doing it's job right
But I'm not 100% sure the kiwi images do that properly, so it's possible there is an extra layer of CoW introduced by the way kiwi sets up the disks
In the VMs can you run `lsattr / | grep /var` and confirm that "C" is set?
That's a good point. Yes, 'C' is set # lsattr / | grep /var lsattr: Inappropriate ioctl for device While reading flags on /dev lsattr: Inappropriate ioctl for device While reading flags on /etc lsattr: Inappropriate ioctl for device While reading flags on /proc lsattr: Inappropriate ioctl for device While reading flags on /run lsattr: Inappropriate ioctl for device While reading flags on /sys ---------------C--- /var Note that the same kind of IO load was reported with both qcow2 and raw images, it's just that the raw images seem to cope much better with it. Thanks, Robert -- To unsubscribe, e-mail: opensuse-kubic+unsubscribe@opensuse.org To contact the owner, e-mail: opensuse-kubic+owner@opensuse.org
participants (4)
-
Fabian Vogt
-
Richard Brown
-
Robert Munteanu
-
Thorsten Kukuk