[Bug 1190632] New: Failure to reboot guests consistently at least 100 times
http://bugzilla.opensuse.org/show_bug.cgi?id=1190632 Bug ID: 1190632 Summary: Failure to reboot guests consistently at least 100 times Classification: openSUSE Product: openSUSE Tumbleweed Version: Current Hardware: Other OS: Other Status: NEW Severity: Normal Priority: P5 - None Component: Virtualization:Tools Assignee: virt-bugs@suse.de Reporter: luis.chamberlain@suse.com QA Contact: qa-bugs@suse.de Found By: --- Blocker: --- Upstream kdevops [0] has a workflow test called "reboot-limit". The goal is to spawn a guest and then reboot it and confirm each reboot at least 100 times. If that fails the test fails. On to of this, the real goal is to run this simple test 100 times, so for a full set of 10,000 reboots. This is called achieving steady state for the reboot-limit kdevops test. To spawn guests vagrant is used. Right now I'm using this to vagrant with libvirt, and so behind the scenes this is all libvirt. I've been following Tumbleweed for a while now and in all releases so far I've only been able to accomplish a successful reboot-limit test only once, and so it would seem it might just be chance. kdevops uses ansible to control guests. There are 3 simple steps to the reboot: 1) ansible reboot module will first ssh to the guest and query /proc/sys/kernel/random/boot_id and saves this random UUID. 2) Ansible will ssh to the guest and issue '/sbin/shutdown -r 0 ' to reboot the guest 3) Yet again ansible will try to loop trying to ssh to the guest for up to 600 seconds, and once it logs in it will query again /proc/sys/kernel/random/boot_id. If the UUIDs match then the reboot is determined to have failed. Now, using /proc/sys/kernel/random/boot_id does not change in containers, and so on ansible 2.10 boot_time_command was added to the reboot module to allow you to specify another check. But for guests, this should work fine. OpenSUSE Tumbleweed is running ansible-2.9.24-1.1.noarch. With this the reboot-failure can't pass more than 100 reboots easily, sometimes maybe it reaches 200 or 500 but that's it. A workaround I noticed was to use: diff --git a/playbooks/roles/reboot-limit/tasks/do-reboot.yml b/playbooks/roles/reboot-limit/tasks/do-reboot.yml index a6876d2..2841bf6 100644 --- a/playbooks/roles/reboot-limit/tasks/do-reboot.yml +++ b/playbooks/roles/reboot-limit/tasks/do-reboot.yml @@ -12,6 +12,7 @@ become: yes become_method: sudo reboot: + post_reboot_delay: 10 when: - reboot_limit_test_type == "ansible" tags: [ 'run_tests' ] I've had to merge the above fix to kdevops because otherwise this test fails easily... With this we are able to get 10,000 reboots on OpenSUSE Tumbleweed. But, on Debian testing using ansible 2.10.7 and linux-next, I'm easily able to get 1,000,0000 reboots and without the above hack. Which begs the question: where is the issue? I checked ansible with verbose (ansible-playbook -vvv) for the test and it would seem everything checks out except the timeout really happens. But the odd thing is that you are able to still ssh to the host when you go and check to see what the hell happened. Which would seem to indicate the issue might be something else. [0] https://github.com/mcgrof/kdevops -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190632 http://bugzilla.opensuse.org/show_bug.cgi?id=1190632#c1 --- Comment #1 from Andreas Stieger <Andreas.Stieger@gmx.de> --- *** Bug 1190633 has been marked as a duplicate of this bug. *** -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190632 http://bugzilla.opensuse.org/show_bug.cgi?id=1190632#c2 James Fehlig <jfehlig@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |jfehlig@suse.com --- Comment #2 from James Fehlig <jfehlig@suse.com> --- (In reply to Luis Chamberlain from comment #0)
Upstream kdevops [0] has a workflow test called "reboot-limit". The goal is to spawn a guest and then reboot it and confirm each reboot at least 100 times. If that fails the test fails. On to of this, the real goal is to run this simple test 100 times, so for a full set of 10,000 reboots. This is called achieving steady state for the reboot-limit kdevops test.
To spawn guests vagrant is used. Right now I'm using this to vagrant with libvirt, and so behind the scenes this is all libvirt. I've been following Tumbleweed for a while now and in all releases so far I've only been able to accomplish a successful reboot-limit test only once, and so it would seem it might just be chance.
kdevops uses ansible to control guests. There are 3 simple steps to the reboot:
1) ansible reboot module will first ssh to the guest and query /proc/sys/kernel/random/boot_id and saves this random UUID. 2) Ansible will ssh to the guest and issue '/sbin/shutdown -r 0 ' to reboot the guest 3) Yet again ansible will try to loop trying to ssh to the guest for up to 600 seconds, and once it logs in it will query again /proc/sys/kernel/random/boot_id. If the UUIDs match then the reboot is determined to have failed.
I created a similar test that uses libvirt directly. It is running but has only completed 36 iterations.
Now, using /proc/sys/kernel/random/boot_id does not change in containers, and so on ansible 2.10 boot_time_command was added to the reboot module to allow you to specify another check. But for guests, this should work fine.
OpenSUSE Tumbleweed is running ansible-2.9.24-1.1.noarch. With this the reboot-failure can't pass more than 100 reboots easily, sometimes maybe it reaches 200 or 500 but that's it.
A workaround I noticed was to use:
diff --git a/playbooks/roles/reboot-limit/tasks/do-reboot.yml b/playbooks/roles/reboot-limit/tasks/do-reboot.yml index a6876d2..2841bf6 100644 --- a/playbooks/roles/reboot-limit/tasks/do-reboot.yml +++ b/playbooks/roles/reboot-limit/tasks/do-reboot.yml @@ -12,6 +12,7 @@ become: yes become_method: sudo reboot: + post_reboot_delay: 10 when: - reboot_limit_test_type == "ansible" tags: [ 'run_tests' ]
I've had to merge the above fix to kdevops because otherwise this test fails easily... With this we are able to get 10,000 reboots on OpenSUSE Tumbleweed.
But, on Debian testing using ansible 2.10.7 and linux-next, I'm easily able to get 1,000,0000 reboots and without the above hack. Which begs the question: where is the issue?
Hopefully my test will help sort that out. If it runs fine at the libvirt level, then I'd suspect something in ansible or vagrant-libvirt. BTW, although TW has older ansible (and perhaps vagrant stuff) than Debian testing, it has a newer libvirt (7.7.0 vs 7.6.0) and qemu (6.1.0 vs 1.6.1). -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190632 Luis Chamberlain <luis.chamberlain@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |dcermak@suse.com -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190632 http://bugzilla.opensuse.org/show_bug.cgi?id=1190632#c3 Dan ��erm��k <dcermak@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |luis.chamberlain@suse.com Flags| |needinfo?(luis.chamberlain@ | |suse.com) --- Comment #3 from Dan ��erm��k <dcermak@suse.com> --- What is the actual issue at hand here? I am honestly puzzled what actually the problem is: do your VMs time out to reboot? Do you reboot them via vagrant? Can you reproduce this consistently? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190632 http://bugzilla.opensuse.org/show_bug.cgi?id=1190632#c4 --- Comment #4 from James Fehlig <jfehlig@suse.com> --- (In reply to James Fehlig from comment #2)
I created a similar test that uses libvirt directly. It is running but has only completed 36 iterations.
Update: The test has now completed 1680 iterations without encountering an issue. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190632 http://bugzilla.opensuse.org/show_bug.cgi?id=1190632#c5 Luis Chamberlain <luis.chamberlain@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Flags|needinfo?(luis.chamberlain@ | |suse.com) | --- Comment #5 from Luis Chamberlain <luis.chamberlain@suse.com> --- (In reply to Dan ��erm��k from comment #3)
What is the actual issue at hand here?
After a number of reboots using vagrant-libvirt on OpenSUSE Tumbleweed ssh access times out (over 600 seconds default is the ansible default).
I am honestly puzzled what actually the problem is: do your VMs time out to reboot? Do you reboot them via vagrant?
No, ansible is used to reboot, as noted in the first comment. I also looked how ansible does this and it is: /sbin/shutdown -r 0 I also noted exactly the algorithm used on comment #1 about how it determines if the reboot was successful or not.
Can you reproduce this consistently?
Yes. Absolutely. At what reboot number? Well it varies, sometimes within 10, sometimes with 80, sometimes after 600. On debian testing I have no issues and got up to 1,000,000 easy. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190632 http://bugzilla.opensuse.org/show_bug.cgi?id=1190632#c12 Dan ��erm��k <dcermak@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Flags| |needinfo?(luis.chamberlain@ | |suse.com) --- Comment #12 from Dan ��erm��k <dcermak@suse.com> --- (In reply to Luis Chamberlain from comment #5)
(In reply to Dan ��erm��k from comment #3)
What is the actual issue at hand here?
After a number of reboots using vagrant-libvirt on OpenSUSE Tumbleweed ssh access times out (over 600 seconds default is the ansible default).
I am honestly puzzled what actually the problem is: do your VMs time out to reboot? Do you reboot them via vagrant?
No, ansible is used to reboot, as noted in the first comment. I also looked how ansible does this and it is: /sbin/shutdown -r 0
I also noted exactly the algorithm used on comment #1 about how it determines if the reboot was successful or not.
Can you reproduce this consistently?
Yes. Absolutely. At what reboot number? Well it varies, sometimes within 10, sometimes with 80, sometimes after 600.
On debian testing I have no issues and got up to 1,000,000 easy.
Could you export the environment variable `VAGRANT_LOG=DEBUG` during your tests, capture stdout & stderr and upload the last logfile when the box times out? Also, could you please provide the version numbers of the respective components (vagrant, vagrant-libvirt, libvirt, the ruby gems, etc.) on openSUSE and on Debian? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190632 http://bugzilla.opensuse.org/show_bug.cgi?id=1190632#c14 Luis Chamberlain <luis.chamberlain@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Flags|needinfo?(luis.chamberlain@ | |suse.com) | --- Comment #14 from Luis Chamberlain <luis.chamberlain@suse.com> --- (In reply to Dan ��erm��k from comment #12)
(In reply to Luis Chamberlain from comment #5)
(In reply to Dan ��erm��k from comment #3)
What is the actual issue at hand here?
After a number of reboots using vagrant-libvirt on OpenSUSE Tumbleweed ssh access times out (over 600 seconds default is the ansible default).
I am honestly puzzled what actually the problem is: do your VMs time out to reboot? Do you reboot them via vagrant?
No, ansible is used to reboot, as noted in the first comment. I also looked how ansible does this and it is: /sbin/shutdown -r 0
I also noted exactly the algorithm used on comment #1 about how it determines if the reboot was successful or not.
Can you reproduce this consistently?
Yes. Absolutely. At what reboot number? Well it varies, sometimes within 10, sometimes with 80, sometimes after 600.
On debian testing I have no issues and got up to 1,000,000 easy.
Could you export the environment variable `VAGRANT_LOG=DEBUG` during your tests, capture stdout & stderr and upload the last logfile when the box times out?
There are no vagrant commands run for the reboot test. The provisioning all works fine. So do I still set this up when I run the ansible command to run the reboot-limit playbook?
Also, could you please provide the version numbers of the respective components (vagrant, vagrant-libvirt, libvirt, the ruby gems, etc.) on openSUSE and on Debian?
OpenSUSE: ruby3.0-rubygem-logging-2.3.0-1.5.x86_64 ruby3.0-rubygem-fog-json-1.2.0-1.8.x86_64 libruby3_0-3_0-3.0.1-3.1.x86_64 ruby2.7-2.7.3-3.1.x86_64 ruby-common-2.6-6.1.noarch system-group-libvirt-20170617-23.1.noarch yast2-ruby-bindings-4.4.4-1.2.x86_64 ruby-solv-0.7.19-1.2.x86_64 ruby2.7-rubygem-nokogiri-1.12.3-1.1.x86_64 ruby2.7-rubygem-gem2rpm-0.10.1-17.5.x86_64 ruby3.0-rubygem-little-plugger-1.1.4-1.17.x86_64 ruby2.7-rubygem-unf_ext-0.0.7.7-1.8.x86_64 ruby2.7-rubygem-ruby-dbus-0.16.0-1.2.x86_64 ruby2.7-rubygem-ruby-augeas-0.5.0-3.23.x86_64 ruby2.7-rubygem-fast_gettext-2.0.3-1.5.x86_64 ruby2.7-rubygem-abstract_method-1.2.1-2.21.x86_64 ruby-2.7-1.5.x86_64 ruby3.0-rubygem-listen-3.7.0-1.1.x86_64 ruby2.7-rubygem-unf-0.1.4-1.20.x86_64 ruby2.7-rubygem-cfa-1.0.2-1.8.x86_64 ruby2.7-rubygem-cheetah-0.5.2-1.8.x86_64 ruby2.7-rubygem-simpleidn-0.2.1-1.3.x86_64 ruby2.7-rubygem-cfa_grub2-2.0.0-1.8.x86_64 ruby3.0-rubygem-hashicorp-checkpoint-0.1.5-1.12.x86_64 vagrant-vim-2.2.18-1.2.noarch libstorage-ng-ruby-4.4.36-1.1.x86_64 ruby3.0-rubygem-nokogiri-1.12.3-1.1.x86_64 libvirt-7.7.0-1.1.x86_64 vagrant-bash-completion-2.2.18-1.2.noarch ruby3.0-3.0.1-3.1.x86_64 libruby2_7-2_7-2.7.3-3.1.x86_64 libvirt-daemon-7.7.0-1.1.x86_64 ruby3.0-rubygem-rubyzip-2.3.2-1.1.x86_64 ruby3.0-rubygem-rubyntlm-0.6.3-1.1.x86_64 ruby3.0-rubygem-rb-fsevent-0.11.0-1.1.x86_64 ruby3.0-rubygem-nori-2.6.0-1.16.x86_64 ruby3.0-rubygem-net-ssh-6.1-6.1.0-1.1.x86_64 ruby3.0-rubygem-multi_json-1.15.0-1.5.x86_64 ruby3.0-rubygem-mime-types-data-3.2021.0704-1.1.x86_64 ruby3.0-rubygem-log4r-1.1.10-9.23.x86_64 ruby3.0-rubygem-httpclient-2.8.3-1.21.x86_64 ruby3.0-rubygem-formatador-0.3.0-1.1.x86_64 ruby3.0-rubygem-ffi-1.15.3-1.1.x86_64 ruby3.0-rubygem-excon-0.85.0-1.1.x86_64 ruby3.0-rubygem-erubi-1.10.0-1.4.x86_64 ruby3.0-rubygem-ed25519-1.2.4-1.15.x86_64 ruby3.0-rubygem-concurrent-ruby-1.1.9-1.1.x86_64 ruby3.0-rubygem-childprocess-4.1.0-1.1.x86_64 ruby3.0-rubygem-builder-3.2.4-1.8.x86_64 ruby3.0-rubygem-bcrypt_pbkdf-1.1.0-1.4.x86_64 ruby3.0-rubygem-ruby-libvirt-0.7.1-2.5.x86_64 ruby3.0-rubygem-net-sftp-3.0.0-1.6.x86_64 ruby3.0-rubygem-net-scp-3.0.0-1.6.x86_64 ruby3.0-rubygem-mime-types-3.3.1-1.8.x86_64 ruby3.0-rubygem-rb-kqueue-0.2.6-1.1.x86_64 ruby3.0-rubygem-rb-inotify-0.10.1-1.8.x86_64 ruby3.0-rubygem-gssapi-1.3.1-1.3.x86_64 ruby3.0-rubygem-vagrant_cloud-3.0.5-1.1.x86_64 ruby3.0-rubygem-i18n-1.8.10-1.1.x86_64 ruby3.0-rubygem-gyoku-1.3.1-1.16.x86_64 ruby3.0-rubygem-fog-core-2.2.4-1.1.x86_64 ruby3.0-rubygem-winrm-2.3.6-1.1.x86_64 ruby3.0-rubygem-fog-xml-0.1.3-1.8.x86_64 ruby3.0-rubygem-winrm-fs-1.3.5-1.5.x86_64 libvirt-libs-7.7.0-1.1.x86_64 ruby3.0-rubygem-fog-libvirt-0.9.0-1.1.x86_64 ruby3.0-rubygem-winrm-elevated-1.2.3-1.3.x86_64 libvirt-daemon-driver-libxl-7.7.0-1.1.x86_64 libvirt-daemon-driver-network-7.7.0-1.1.x86_64 libvirt-daemon-driver-secret-7.7.0-1.1.x86_64 libvirt-daemon-driver-qemu-7.7.0-1.1.x86_64 libvirt-daemon-driver-storage-rbd-7.7.0-1.1.x86_64 libvirt-daemon-config-network-7.7.0-1.1.x86_64 libvirt-daemon-driver-lxc-7.7.0-1.1.x86_64 libvirt-daemon-driver-nodedev-7.7.0-1.1.x86_64 libvirt-daemon-driver-storage-mpath-7.7.0-1.1.x86_64 libvirt-daemon-driver-storage-logical-7.7.0-1.1.x86_64 libvirt-daemon-driver-storage-iscsi-direct-7.7.0-1.1.x86_64 libvirt-daemon-driver-storage-iscsi-7.7.0-1.1.x86_64 libvirt-daemon-driver-storage-disk-7.7.0-1.1.x86_64 libvirt-daemon-config-nwfilter-7.7.0-1.1.x86_64 libvirt-client-7.7.0-1.1.x86_64 libvirt-daemon-driver-storage-7.7.0-1.1.x86_64 libvirt-daemon-driver-nwfilter-7.7.0-1.1.x86_64 libvirt-daemon-driver-storage-scsi-7.7.0-1.1.x86_64 vagrant-libvirt-0.5.3-2.1.x86_64 libvirt-daemon-driver-interface-7.7.0-1.1.x86_64 vagrant-2.2.18-1.2.x86_64 libvirt-daemon-driver-storage-core-7.7.0-1.1.x86_64 Debian testing: ii libruby2.7:amd64 2.7.4-1 amd64 Libraries necessary to run Ruby 2.7 ii libsys-virt-perl 7.0.0-1 amd64 Perl module providing an extension for the libvirt library ii libvirt-clients 7.0.0-3 amd64 Programs for the libvirt library ii libvirt-daemon 7.0.0-3 amd64 Virtualization daemon ii libvirt-daemon-config-network 7.0.0-3 all Libvirt daemon configuration files (default network) ii libvirt-daemon-config-nwfilter 7.0.0-3 all Libvirt daemon configuration files (default network filters) ii libvirt-daemon-driver-lxc 7.0.0-3 amd64 Virtualization daemon LXC connection driver ii libvirt-daemon-driver-qemu 7.0.0-3 amd64 Virtualization daemon QEMU connection driver ii libvirt-daemon-driver-vbox 7.0.0-3 amd64 Virtualization daemon VirtualBox connection driver ii libvirt-daemon-driver-xen 7.0.0-3 amd64 Virtualization daemon Xen connection driver ii libvirt-daemon-system 7.0.0-3 amd64 Libvirt daemon configuration files ii libvirt-daemon-system-systemd 7.0.0-3 all Libvirt daemon configuration files (systemd) ii libvirt-dev:amd64 7.0.0-3 amd64 development files for the libvirt library ii libvirt0:amd64 7.0.0-3 amd64 library for interfacing with different virtualization systems ii rake 13.0.3-1 all ruby make-like utility ii ruby 1:2.7+2 amd64 Interpreter of object-oriented scripting language Ruby (default version) ii ruby-bcrypt-pbkdf 1.1.0-1 amd64 Ruby library for OpenBSD's bcrypt_pdkfd ii ruby-builder 3.2.4-1 all Builders for XML MarkUps ii ruby-childprocess 4.0.0-1 all Ruby library for controlling external programs running in the background ii ruby-concurrent 1.1.6+dfsg-3 all modern concurrency tools for Ruby ii ruby-domain-name 0.5.20190701-1 all Domain Name manipulation library for Ruby ii ruby-ed25519 1.2.4-2+b1 amd64 efficient digital signature library providing the Ed25519 algorithm ii ruby-erubi 1.9.0-1 all Small ERB Implementation ii ruby-erubis 2.7.0-3.1 all fast and extensible eRuby implementation which supports multi-language ii ruby-excon 0.79.0-2 all Ruby library for creating fast, persistent, HTTP(S) connections ii ruby-ffi 1.12.2+dfsg-2+b3 amd64 load dynamic libraries, bind functions from within ruby code ii ruby-fog-core 2.1.0-3.1 all shared classes and tests for fog providers and services ii ruby-fog-json 1.2.0-1 all JSON parsing for fog providers ii ruby-fog-libvirt 0.8.0-1 all Module for the 'fog' gem to support libvirt ii ruby-fog-xml 0.1.1-5.1 all XML parsing for fog providers ii ruby-formatador 0.2.5-2.1 all Ruby STDOUT text formatting library ii ruby-http-accept 1.7.0-2 all Ruby parser for Accept and Accept-Language HTTP headers ii ruby-http-cookie 1.0.3-1 all Ruby library to handle HTTP Cookies based on RFC 6265 ii ruby-i18n 1.8.8-1 all I18n and localization solution for Ruby ii ruby-json 2.3.0+dfsg-1+b3 amd64 JSON library for Ruby ii ruby-libvirt 0.7.1-1+b3 amd64 Ruby bindings for libvirt ii ruby-listen 3.2.1-1 all Ruby library listening to file modifications ii ruby-log4r 1.1.10-4.1 all Logging library for Ruby ii ruby-mime-types 3.3.1-1 all guess MIME type of files ii ruby-mime-types-data 3.2015.1120-1.1 all registry for information about MIME media type definitions ii ruby-minitest 5.13.0-1 all Ruby test tools supporting TDD, BDD, mocking, and benchmarking ii ruby-multi-json 1.14.1-1 all Ruby library to provide easy switching between different JSON backends ii ruby-net-scp 3.0.0-1 all pure ruby implementation of the SCP protocol ii ruby-net-sftp 1:3.0.0-1 all Ruby implementation of the SFTP protocol ii ruby-net-ssh 1:6.1.0-2 all Ruby implementation of the SSH protocol ii ruby-net-telnet 0.1.1-2 all telnet client library ii ruby-netrc 0.11.0-3 all Ruby library to read and write netrc files ii ruby-nokogiri 1.11.1+dfsg-2 amd64 HTML, XML, SAX, and Reader parser for Ruby ii ruby-oj:amd64 3.11.0-1 amd64 fast JSON parser and serializer for Ruby ii ruby-pkg-config 1.4.4-1 all pkg-config implementation for Ruby ii ruby-power-assert 1.1.7-2 all library showing values of variables and method calls in an expression ii ruby-rb-inotify 0.9.10-2 all Ruby wrapper for Linux's inotify, using FFI ii ruby-rest-client 2.1.0-1 all simple REST client for Ruby ii ruby-rubygems 3.2.5-2 all Package management framework for Ruby ii ruby-sqlite3 1.4.2-3 amd64 SQLite3 interface for Ruby ii ruby-test-unit 3.3.9-1 all unit testing framework for Ruby ii ruby-unf 0.1.4-2 all Wrapper library to bring Unicode Normalization Form support to Ruby ii ruby-unf-ext 0.0.7.6-1+b3 amd64 Unicode Normalization Form support library for CRuby ii ruby-vagrant-cloud 3.0.2-1 all Vagrant Cloud API Library ii ruby-xmlrpc 0.3.0-2 all XMLRPC library for Ruby ii ruby-zip 2.0.0-2 all Ruby module for reading and writing zip files ii ruby2.7 2.7.4-1 amd64 Interpreter of object-oriented scripting language Ruby ii rubygems-integration 1.18 all integration of Debian Ruby packages with Rubygems ii vagrant 2.2.14+dfsg-1 all Tool for building and distributing virtualized development environments ii vagrant-libvirt 0.3.0-1 all Vagrant plugin that adds an Libvirt provider to Vagrant -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190632 http://bugzilla.opensuse.org/show_bug.cgi?id=1190632#c16 Dan ��erm��k <dcermak@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Flags| |needinfo?(luis.chamberlain@ | |suse.com) --- Comment #16 from Dan ��erm��k <dcermak@suse.com> --- (In reply to Luis Chamberlain from comment #14)
(In reply to Dan ��erm��k from comment #12)
(In reply to Luis Chamberlain from comment #5)
(In reply to Dan ��erm��k from comment #3)
What is the actual issue at hand here?
After a number of reboots using vagrant-libvirt on OpenSUSE Tumbleweed ssh access times out (over 600 seconds default is the ansible default).
I am honestly puzzled what actually the problem is: do your VMs time out to reboot? Do you reboot them via vagrant?
No, ansible is used to reboot, as noted in the first comment. I also looked how ansible does this and it is: /sbin/shutdown -r 0
I also noted exactly the algorithm used on comment #1 about how it determines if the reboot was successful or not.
Can you reproduce this consistently?
Yes. Absolutely. At what reboot number? Well it varies, sometimes within 10, sometimes with 80, sometimes after 600.
On debian testing I have no issues and got up to 1,000,000 easy.
Could you export the environment variable `VAGRANT_LOG=DEBUG` during your tests, capture stdout & stderr and upload the last logfile when the box times out?
There are no vagrant commands run for the reboot test. The provisioning all works fine. So do I still set this up when I run the ansible command to run the reboot-limit playbook?
Ah, ok. So vagrant just performs the `vagrant up` and that's it? Then I honestly don't think that vagrant itself is the culprit here. Also, given James' tests, I would expect that neither is libvirt, which leaves Ansible. Could you retry your tests with Ansible installed from pip (preferably the same version that you are using on Debian)? If you are really motivated, you could retry the whole ordeal with the Virtualbox provider. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190632 http://bugzilla.opensuse.org/show_bug.cgi?id=1190632#c17 --- Comment #17 from Luis Chamberlain <luis.chamberlain@suse.com> --- (In reply to Dan ��erm��k from comment #16)
(In reply to Luis Chamberlain from comment #14)
(In reply to Dan ��erm��k from comment #12)
(In reply to Luis Chamberlain from comment #5)
(In reply to Dan ��erm��k from comment #3)
What is the actual issue at hand here?
After a number of reboots using vagrant-libvirt on OpenSUSE Tumbleweed ssh access times out (over 600 seconds default is the ansible default).
I am honestly puzzled what actually the problem is: do your VMs time out to reboot? Do you reboot them via vagrant?
No, ansible is used to reboot, as noted in the first comment. I also looked how ansible does this and it is: /sbin/shutdown -r 0
I also noted exactly the algorithm used on comment #1 about how it determines if the reboot was successful or not.
Can you reproduce this consistently?
Yes. Absolutely. At what reboot number? Well it varies, sometimes within 10, sometimes with 80, sometimes after 600.
On debian testing I have no issues and got up to 1,000,000 easy.
Could you export the environment variable `VAGRANT_LOG=DEBUG` during your tests, capture stdout & stderr and upload the last logfile when the box times out?
There are no vagrant commands run for the reboot test. The provisioning all works fine. So do I still set this up when I run the ansible command to run the reboot-limit playbook?
Ah, ok. So vagrant just performs the `vagrant up` and that's it?
Right.
Then I honestly don't think that vagrant itself is the culprit here. Also, given James' tests, I would expect that neither is libvirt, which leaves Ansible.
It was not clear to me if vagrant uses a different dnsmasq or not, but sure we can try this approach.
Could you retry your tests with Ansible installed from pip (preferably the same version that you are using on Debian)?
This required zypper remove ansible pip install ansible==2.10.7 Tests are on their way. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190632 http://bugzilla.opensuse.org/show_bug.cgi?id=1190632#c18 --- Comment #18 from James Fehlig <jfehlig@suse.com> --- I checked in on my test today Got pre-reboot uuid = f39b6cb8-86d5-46a3-8b6d-299630e74999 Rebooting VM... Reboot scheduled for Sun 2021-09-26 09:17:32 MDT, use 'shutdown -c' to cancel. waiting for guest to boot... VM is up!! Got post-reboot uuid = ae38d87b-8251-4205-8687-e30150185015 Boot UUIDs are different, reboot succeeded! Iteration: 12292 done After more than 12k iterations I killed it. Will reboot the host to TW, update it, and start the test again with TW hypervisor and 15 SP3 guest. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190632 http://bugzilla.opensuse.org/show_bug.cgi?id=1190632#c19 --- Comment #19 from Luis Chamberlain <luis.chamberlain@suse.com> --- (In reply to Luis Chamberlain from comment #17)
(In reply to Dan ��erm��k from comment #16)
(In reply to Luis Chamberlain from comment #14)
(In reply to Dan ��erm��k from comment #12)
(In reply to Luis Chamberlain from comment #5)
(In reply to Dan ��erm��k from comment #3)
What is the actual issue at hand here?
After a number of reboots using vagrant-libvirt on OpenSUSE Tumbleweed ssh access times out (over 600 seconds default is the ansible default).
I am honestly puzzled what actually the problem is: do your VMs time out to reboot? Do you reboot them via vagrant?
No, ansible is used to reboot, as noted in the first comment. I also looked how ansible does this and it is: /sbin/shutdown -r 0
I also noted exactly the algorithm used on comment #1 about how it determines if the reboot was successful or not.
Can you reproduce this consistently?
Yes. Absolutely. At what reboot number? Well it varies, sometimes within 10, sometimes with 80, sometimes after 600.
On debian testing I have no issues and got up to 1,000,000 easy.
Could you export the environment variable `VAGRANT_LOG=DEBUG` during your tests, capture stdout & stderr and upload the last logfile when the box times out?
There are no vagrant commands run for the reboot test. The provisioning all works fine. So do I still set this up when I run the ansible command to run the reboot-limit playbook?
Ah, ok. So vagrant just performs the `vagrant up` and that's it?
Right.
Then I honestly don't think that vagrant itself is the culprit here. Also, given James' tests, I would expect that neither is libvirt, which leaves Ansible.
It was not clear to me if vagrant uses a different dnsmasq or not, but sure we can try this approach.
Could you retry your tests with Ansible installed from pip (preferably the same version that you are using on Debian)?
This required
zypper remove ansible pip install ansible==2.10.7
Tests are on their way.
So far 100 reboots worked with this version of ansible, 2.10.7, and by removing the post_reboot_delay workaround, going to kick off a 10,000 test now. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190632 http://bugzilla.opensuse.org/show_bug.cgi?id=1190632#c20 --- Comment #20 from Luis Chamberlain <luis.chamberlain@suse.com> --- (In reply to Luis Chamberlain from comment #19)
So far 100 reboots worked with this version of ansible, 2.10.7, and by removing the post_reboot_delay workaround, going to kick off a 10,000 test now.
Hit a failure on reboot #19 a second time around due to the timeout of 600 seconds. TASK [reboot-limit : Run the reboot test using the ansible reboot module] ****** fatal: [sles15sp3]: FAILED! => {"changed": false, "elapsed": 7305, "msg": "Timed out waiting for last boot time check (timeout=600)", "rebooted": true} -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190632 Luis Chamberlain <luis.chamberlain@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Flags|needinfo?(luis.chamberlain@ | |suse.com) | -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190632 http://bugzilla.opensuse.org/show_bug.cgi?id=1190632#c22 --- Comment #22 from James Fehlig <jfehlig@suse.com> --- Created attachment 852778 --> http://bugzilla.opensuse.org/attachment.cgi?id=852778&action=edit reboot test This is my hack script for a reboot test. I don't think it needs much explanation, but feel free to check for correctness to test reboot :-). I had to tweak it a bit to work on a TW host. Previously I used ping to determine when the guest was up, and then waited 10sec for sshd to come alive in the guest before trying to ssh and extract /proc/sys/kernel/random/boot_id. This worked fine on 15 SP3, but on TW I noticed sshd might not be up within the 10sec. I adjusted the script to check for ssh connection when determining if the guest is up and restarted the test. Currently at 180 iterations. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190632 http://bugzilla.opensuse.org/show_bug.cgi?id=1190632#c23 --- Comment #23 from Luis Chamberlain <luis.chamberlain@suse.com> --- (In reply to James Fehlig from comment #22)
Created attachment 852778 [details] reboot test
This is my hack script for a reboot test. I don't think it needs much explanation, but feel free to check for correctness to test reboot :-).
I had to tweak it a bit to work on a TW host. Previously I used ping to determine when the guest was up, and then waited 10sec for sshd to come alive in the guest before trying to ssh and extract /proc/sys/kernel/random/boot_id. This worked fine on 15 SP3, but on TW I noticed sshd might not be up within the 10sec.
Ansible's default timeout is 600 seconds. It is not clear to me exactly how ansible determines if the timeout lapsed though. However, it would seem what you are seeing might be related to the failure that ansible picks up on. *Why* does it seem that ssh does not come up as quickly on Tumblweed Vs SLE15-SP3? -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190632 http://bugzilla.opensuse.org/show_bug.cgi?id=1190632#c24 James Fehlig <jfehlig@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |jose.ziviani@suse.com --- Comment #24 from James Fehlig <jfehlig@suse.com> --- (In reply to Luis Chamberlain from comment #23)
Ansible's default timeout is 600 seconds. It is not clear to me exactly how ansible determines if the timeout lapsed though. However, it would seem what you are seeing might be related to the failure that ansible picks up on.
Perhaps. Let me word my observations another way: In a SLES15 SP3 guest, the time it takes for sshd to service requests once the network has come to life (responds to ping) is greater on a TW host than a SLES15 SP3 host.
*Why* does it seem that ssh does not come up as quickly on Tumblweed Vs SLE15-SP3?
Not sure. The primary components in play that are different between the two are qemu and/or the host kernel. Maybe some vcpu scheduling or other small performance issue in TW. That said, I can't confirm 600 seconds was ever exceeded. With the original version of my script that ran on SLES15 SP3 host, I can only confirm the guest's network was up within 600 seconds. The updated version of the script in #22 will bail out if the guest's ssh service does not respond within 600 seconds. BTW, the updated script has now run 2318 iterations on the TW host. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190632 http://bugzilla.opensuse.org/show_bug.cgi?id=1190632#c25 --- Comment #25 from Luis Chamberlain <luis.chamberlain@suse.com> --- I'm under the impression the issue is ssh command failures. It would seem by default there are no retries done by ansible. Using the following seems to be a bit promising: diff --git a/ansible.cfg b/ansible.cfg index b513ff9..4889be0 100644 --- a/ansible.cfg +++ b/ansible.cfg @@ -1,2 +1,3 @@ [defaults] display_skipped_hosts = no +retries = 2000 In the sense that at least for some hypervisors were I was seeing some really odd failures for other tests (blktests), I am not seeing the issues anymore. I'll be testing this with the reboot-limit workflow however, if this fixes the issue, this would seem to be a workaround for a smooth ride. On debian this is not needed, for instance. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190632 http://bugzilla.opensuse.org/show_bug.cgi?id=1190632#c26 --- Comment #26 from Luis Chamberlain <luis.chamberlain@suse.com> --- (In reply to James Fehlig from comment #24)
(In reply to Luis Chamberlain from comment #23)
Ansible's default timeout is 600 seconds. It is not clear to me exactly how ansible determines if the timeout lapsed though. However, it would seem what you are seeing might be related to the failure that ansible picks up on.
Perhaps. Let me word my observations another way: In a SLES15 SP3 guest, the time it takes for sshd to service requests once the network has come to life (responds to ping) is greater on a TW host than a SLES15 SP3 host.
OK it would be nice to know why, but I get it, we have other priorities.
*Why* does it seem that ssh does not come up as quickly on Tumblweed Vs SLE15-SP3?
Not sure. The primary components in play that are different between the two are qemu and/or the host kernel. Maybe some vcpu scheduling or other small performance issue in TW.
That said, I can't confirm 600 seconds was ever exceeded.
I'm under the impression certain ssh packets get lost somehow, and since it would seem ansible does not retry by default, this is fatal condition with default ansible settings. I have 4 different systems using TW where this issue is present and sporadic, so it is not clear *why* the packets might be lost but this is my current suspicion.
With the original version of my script that ran on SLES15 SP3 host, I can only confirm the guest's network was up within 600 seconds. The updated version of the script in #22 will bail out if the guest's ssh service does not respond within 600 seconds. BTW, the updated script has now run 2318 iterations on the TW host.
Nice thanks so much for testing this. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190632 http://bugzilla.opensuse.org/show_bug.cgi?id=1190632#c28 --- Comment #28 from James Fehlig <jfehlig@suse.com> --- I need to use the test machine for other purposes so killed my reboot test. It was on iteration 15447. I think it is safe to say reboot of VMs is working fine on both SLE15 SP3 and TW. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190632 http://bugzilla.opensuse.org/show_bug.cgi?id=1190632#c29 --- Comment #29 from Luis Chamberlain <luis.chamberlain@suse.com> --- (In reply to Shung-Hsi Yu from comment #27)
Randomly throwing my 2�
If Ansible(In reply to Luis Chamberlain from comment #26)
[snip]
*Why* does it seem that ssh does not come up as quickly on Tumblweed Vs SLE15-SP3?
Not sure. The primary components in play that are different between the two are qemu and/or the host kernel. Maybe some vcpu scheduling or other small performance issue in TW.
That said, I can't confirm 600 seconds was ever exceeded.
I'm under the impression certain ssh packets get lost somehow, and since it would seem ansible does not retry by default, this is fatal condition with default ansible settings.
If ansible does not retry by default, then perhaps tuning sysctl net.ipv4.tcp_syn_retries on TW host may help, as it allows the host OS to wait a bit longer during initial TCP connection to guest sshd.
Thanks for the tip! Question which is pretty relevant here is if this really should be needed for a virtual network for libvirt? FWIW the default seems to be 6 for both TW and SLES15-SP3.
The default seems to be 6 retires, which timeouts around 130s[1], `sudo sysctl -w net.ipv4.tcp_syn_retries=8` should tell to OS to await for a bit less to 600s. (Note: the setting is reset when rebooted)
I will try this if increasing the retries for ansible fails.
It may also help to capture the network traffic between the host and guest with something like `sudo tcpdump -i $VIRT_INTERFACE -W 3 -G 600 -w dump`, which only captures packets within last 30 minutes split in 3 files, interrupt it when timeout has occurred and upload them here for analysis.
Thanks this can be a next step here for sure! -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190632 http://bugzilla.opensuse.org/show_bug.cgi?id=1190632#c30 --- Comment #30 from Luis Chamberlain <luis.chamberlain@suse.com> --- (In reply to Luis Chamberlain from comment #25)
diff --git a/ansible.cfg b/ansible.cfg index b513ff9..4889be0 100644 --- a/ansible.cfg +++ b/ansible.cfg @@ -1,2 +1,3 @@ [defaults] display_skipped_hosts = no +retries = 2000
This setting only got me to 8 reboots... so going to try the sysctl -w net.ipv4.tcp_syn_retries=8 now... -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190632 http://bugzilla.opensuse.org/show_bug.cgi?id=1190632#c31 --- Comment #31 from Luis Chamberlain <luis.chamberlain@suse.com> --- (In reply to Luis Chamberlain from comment #30)
(In reply to Luis Chamberlain from comment #25)
diff --git a/ansible.cfg b/ansible.cfg index b513ff9..4889be0 100644 --- a/ansible.cfg +++ b/ansible.cfg @@ -1,2 +1,3 @@ [defaults] display_skipped_hosts = no +retries = 2000
This setting only got me to 8 reboots... so going to try the sysctl -w net.ipv4.tcp_syn_retries=8 now...
This only helps slightly, got up to 11 reboots. Will try to capture tcpdump... -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190632 http://bugzilla.opensuse.org/show_bug.cgi?id=1190632#c32 --- Comment #32 from Luis Chamberlain <luis.chamberlain@suse.com> --- I have the tcpdump ready, its 13 MiB, I've compressed it with bzip2. You can find it on my home directory on wotan:bsc1190632.dump.bz2 This was captured during a run of: make reboot-limit-baseline-kernelci The command used to capture the dump: sudo tcpdump -i vnet44 -W 3 -G 600 -w dump I used vnet44 because: mcgrof@ives /data1/reboot-limit/sles15sp3/sle-kdevops/kdevops (git::master)$ sudo virsh dumpxml vagrant_sles15sp3 | grep vnet <target dev='vnet44'/> <target dev='vnet45'/> And the extra network on the guest should be idle. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190632 http://bugzilla.opensuse.org/show_bug.cgi?id=1190632#c33 --- Comment #33 from Luis Chamberlain <luis.chamberlain@suse.com> --- Finally, I came to an ansible release for which the timeout issues are not happening. I used as base OpenSUSE Tumbleweed. The version of ansible: mcgrof@gehrke ~ $ sudo pip show ansible Name: ansible Version: 4.6.0 Summary: Radically simple IT automation Home-page: https://ansible.com/ Author: Ansible, Inc. Author-email: info@ansible.com License: GPLv3+ Location: /usr/local/lib/python3.8/site-packages Requires: ansible-core Required-by: mcgrof@gehrke ~ $ sudo pip show ansible-base Name: ansible-base Version: 2.10.14 Summary: Radically simple IT automation Home-page: https://ansible.com/ Author: Ansible, Inc. Author-email: info@ansible.com License: GPLv3+ Location: /usr/local/lib/python3.8/site-packages Requires: PyYAML, cryptography, jinja2, packaging Required-by: -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190632 http://bugzilla.opensuse.org/show_bug.cgi?id=1190632#c34 Dan ��erm��k <dcermak@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution|--- |INVALID --- Comment #34 from Dan ��erm��k <dcermak@suse.com> --- Closing this as resolved then. -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190632 http://bugzilla.opensuse.org/show_bug.cgi?id=1190632#c35 Luis Chamberlain <luis.chamberlain@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|RESOLVED |REOPENED Resolution|INVALID |--- --- Comment #35 from Luis Chamberlain <luis.chamberlain@suse.com> --- (In reply to Dan ��erm��k from comment #34)
Closing this as resolved then.
Unfortunately the issue is not resolved yet. The reboot limit workflow does not pass on two of the hosts: ives (failed on reboot 2) and huber (failed on reboot 4). -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190632 http://bugzilla.opensuse.org/show_bug.cgi?id=1190632#c36 --- Comment #36 from Luis Chamberlain <luis.chamberlain@suse.com> --- (In reply to Luis Chamberlain from comment #35)
(In reply to Dan ��erm��k from comment #34)
Closing this as resolved then.
Unfortunately the issue is not resolved yet. The reboot limit workflow does not pass on two of the hosts: ives (failed on reboot 2) and huber (failed on reboot 4).
So there are 4 kernel-ci hosts, and below is their status on this test: * gehrke: passed 100 loops (so 10,000 reboots) * hadar: on reboot loop number 64 (ie, so 6,400 reboots) * ives: failed on test loop 2 (200 reboots) * huber: failed on test loop 4 (400 reboots) -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190632 http://bugzilla.opensuse.org/show_bug.cgi?id=1190632#c37 --- Comment #37 from Luis Chamberlain <luis.chamberlain@suse.com> --- (In reply to Luis Chamberlain from comment #36)
* hadar: on reboot loop number 64 (ie, so 6,400 reboots)
And to be clear this test is ongoing, and so we'll see if it reaches loop 100 successfully -- You are receiving this mail because: You are on the CC list for the bug.
http://bugzilla.opensuse.org/show_bug.cgi?id=1190632 http://bugzilla.opensuse.org/show_bug.cgi?id=1190632#c38 --- Comment #38 from Luis Chamberlain <luis.chamberlain@suse.com> --- You may be interested in these findings too: https://bugzilla.kernel.org/show_bug.cgi?id=216040 -- You are receiving this mail because: You are on the CC list for the bug.
participants (1)
-
bugzilla_noreply@suse.com