[Bug 1212816] New: salt-minions from some machines do not return "Minion did not return. [No response]" since salt-3006.0-150400.8.34.2
https://bugzilla.suse.com/show_bug.cgi?id=1212816 Bug ID: 1212816 Summary: salt-minions from some machines do not return "Minion did not return. [No response]" since salt-3006.0-150400.8.34.2 Classification: openSUSE Product: openSUSE Distribution Version: Leap 15.4 Hardware: Other OS: Other Status: NEW Severity: Major Priority: P5 - None Component: Salt Assignee: salt-maintainers@suse.de Reporter: okurz@suse.com QA Contact: qa-bugs@suse.de Target Milestone: --- Found By: --- Blocker: --- ## Observation We observed that in our salt managed infrastructure the salt-minion from some machines do not return any result reproducibly since the upgrade salt-3004-150400.8.25.1->salt-3006.0-150400.8.34.2 salt-minions from some machines do not return "Minion did not return. [No response]" since salt-3006.0-150400.8.34.2. In our infrastructure we currently have 31 machines controlled by salt. Out of those 31 machines reproducibly 6 machines do not return with a response after some time regardless of the salt command used. The other 25 machines are not affected, all have the most up-to-date salt package version installed and an up-to-date Leap 15.4. We run machines with architectures x86_64, aarch64, ppc64le and have affected as well as non-affected machines of each architecture so this can not be an architecture-specific issue. The issue is visible when executing any salt command like `test.ping` as well as in `salt-run manage.down` showing the unresponsive salt minion nodes. ## Steps to reproduce In our infrastructure we reproduce the problem with: ``` for i in {1..7200}; do echo "### Run $i -- $(date -Is)" && salt --no-color \* test.ping ; df -i / ; salt-run jobs.list_jobs | wc -l && salt --no-color \* saltutil.kill_all_jobs && sleep 60 && rm -rf /var/cache/salt/master/jobs/*; done | tee -a log_salt_test_ping_poo131249_$(date -Is).log ``` showing the problem for the affected machines after roughly one hour. The command can be simplified by removing counting, logging, cleanup: ``` while :; do salt \* test.ping; done ``` Note that without sleeping between each loop iteration this eventually exhausted free inodes on our salt master node. ## Problem We could narrow down the problem to the salt packages upgrade salt-3004-150400.8.25.1 to salt-3006.0-150400.8.34.2 . Full changelog of the package for the above mentioned upgrade step: ``` * Mon Jun 19 2023 pablo.suarezhernandez@suse.com - Make master_tops compatible with Salt 3000 and older minions (bsc#1212516) (bsc#1212517) - Added: * make-master_tops-compatible-with-salt-3000-and-older.patch * Mon May 29 2023 yeray.gutierrez@suse.com - Avoid failures due transactional_update module not available in Salt 3006.0 (bsc#1211754) - Added: * define-__virtualname__-for-transactional_update-modu.patch * Wed May 24 2023 pablo.suarezhernandez@suse.com - Avoid conflicts with Salt dependencies versions (bsc#1211612) - Added: * avoid-conflicts-with-dependencies-versions-bsc-12116.patch * Fri May 05 2023 alexander.graul@suse.com - Update to Salt release version 3006.0 (jsc#PED-4360) * See release notes: https://docs.saltproject.io/en/latest/topics/releases/3006.0.html - Add missing patch after rebase to fix collections Mapping issues - Add python3-looseversion as new dependency for salt - Add python3-packaging as new dependency for salt - Allow entrypoint compatibility for "importlib-metadata>=5.0.0" (bsc#1207071) - Create new salt-tests subpackage containing Salt tests - Drop conflictive patch dicarded from upstream - Fix SLS rendering error when Jinja macros are used - Fix version detection and avoid building and testing failures - Prevent deadlocks in salt-ssh executions - Require python3-jmespath runtime dependency (bsc#1209233) - Added: * 3005.1-implement-zypper-removeptf-573.patch * control-the-collection-of-lvm-grains-via-config.patch * fix-version-detection-and-avoid-building-and-testing.patch * make-sure-the-file-client-is-destroyed-upon-used.patch * skip-package-names-without-colon-bsc-1208691-578.patch * use-rlock-to-avoid-deadlocks-in-salt-ssh.patch - Modified: * activate-all-beacons-sources-config-pillar-grains.patch * add-custom-suse-capabilities-as-grains.patch * add-environment-variable-to-know-if-yum-is-invoked-f.patch * add-migrated-state-and-gpg-key-management-functions-.patch * add-publish_batch-to-clearfuncs-exposed-methods.patch * add-salt-ssh-support-with-venv-salt-minion-3004-493.patch * add-sleep-on-exception-handling-on-minion-connection.patch * add-standalone-configuration-file-for-enabling-packa.patch * add-support-for-gpgautoimport-539.patch * allow-vendor-change-option-with-zypper.patch * async-batch-implementation.patch * avoid-excessive-syslogging-by-watchdog-cronjob-58.patch * bsc-1176024-fix-file-directory-user-and-group-owners.patch * change-the-delimeters-to-prevent-possible-tracebacks.patch * debian-info_installed-compatibility-50453.patch * dnfnotify-pkgset-plugin-implementation-3002.2-450.patch * do-not-load-pip-state-if-there-is-no-3rd-party-depen.patch * don-t-use-shell-sbin-nologin-in-requisites.patch * drop-serial-from-event.unpack-in-cli.batch_async.patch * early-feature-support-config.patch * enable-passing-a-unix_socket-for-mysql-returners-bsc.patch * enhance-openscap-module-add-xccdf_eval-call-386.patch * fix-bsc-1065792.patch * fix-for-suse-expanded-support-detection.patch * fix-issue-2068-test.patch * fix-missing-minion-returns-in-batch-mode-360.patch * fix-ownership-of-salt-thin-directory-when-using-the-.patch * fix-regression-with-depending-client.ssh-on-psutil-b.patch * fix-salt-ssh-opts-poisoning-bsc-1197637-3004-501.patch * fix-salt.utils.stringutils.to_str-calls-to-make-it-w.patch * fix-the-regression-for-yumnotify-plugin-456.patch * fix-traceback.print_exc-calls-for-test_pip_state-432.patch * fixes-for-python-3.10-502.patch * include-aliases-in-the-fqdns-grains.patch * info_installed-works-without-status-attr-now.patch * let-salt-ssh-use-platform-python-binary-in-rhel8-191.patch * make-aptpkg.list_repos-compatible-on-enabled-disable.patch * make-setup.py-script-to-not-require-setuptools-9.1.patch * pass-the-context-to-pillar-ext-modules.patch * prevent-affection-of-ssh.opts-with-lazyloader-bsc-11.patch * prevent-pkg-plugins-errors-on-missing-cookie-path-bs.patch * prevent-shell-injection-via-pre_flight_script_args-4.patch * read-repo-info-without-using-interpolation-bsc-11356.patch * restore-default-behaviour-of-pkg-list-return.patch * return-the-expected-powerpc-os-arch-bsc-1117995.patch * revert-fixing-a-use-case-when-multiple-inotify-beaco.patch * run-salt-api-as-user-salt-bsc-1064520.patch * run-salt-master-as-dedicated-salt-user.patch * save-log-to-logfile-with-docker.build.patch * switch-firewalld-state-to-use-change_interface.patch * temporary-fix-extend-the-whitelist-of-allowed-comman.patch * update-target-fix-for-salt-ssh-to-process-targets-li.patch * use-adler32-algorithm-to-compute-string-checksums.patch * use-salt-bundle-in-dockermod.patch * x509-fixes-111.patch * zypperpkg-ignore-retcode-104-for-search-bsc-1176697-.patch - Removed: * 3003.3-do-not-consider-skipped-targets-as-failed-for.patch * 3003.3-postgresql-json-support-in-pillar-423.patch * add-amazon-ec2-detection-for-virtual-grains-bsc-1195.patch * add-missing-ansible-module-functions-to-whitelist-in.patch * add-rpm_vercmp-python-library-for-version-comparison.patch * add-support-for-name-pkgs-and-diff_attr-parameters-t.patch * adds-explicit-type-cast-for-port.patch * align-amazon-ec2-nitro-grains-with-upstream-pr-bsc-1.patch * backport-syndic-auth-fixes.patch * batch.py-avoid-exception-when-minion-does-not-respon.patch * check-if-dpkgnotify-is-executable-bsc-1186674-376.patch * clarify-pkg.installed-pkg_verify-documentation.patch * detect-module.run-syntax.patch * do-not-crash-when-unexpected-cmd-output-at-listing-p.patch * enhance-logging-when-inotify-beacon-is-missing-pyino.patch * fix-62092-catch-zmq.error.zmqerror-to-set-hwm-for-zm.patch * fix-crash-when-calling-manage.not_alive-runners.patch * fixes-pkg.version_cmp-on-openeuler-systems-and-a-few.patch * fix-exception-in-yumpkg.remove-for-not-installed-pac.patch * fix-for-cve-2022-22967-bsc-1200566.patch * fix-inspector-module-export-function-bsc-1097531-481.patch * fix-ip6_interface-grain-to-not-leak-secondary-ipv4-a.patch * fix-issues-with-salt-ssh-s-extra-filerefs.patch * fix-jinja2-contextfuntion-base-on-version-bsc-119874.patch * fix-multiple-security-issues-bsc-1197417.patch * fix-salt-call-event.send-call-with-grains-and-pillar.patch * fix-salt.states.file.managed-for-follow_symlinks-tru.patch * fix-state.apply-in-test-mode-with-file-state-module-.patch * fix-test_ipc-unit-tests.patch * fix-the-regression-in-schedule-module-releasded-in-3.patch * fix-wrong-test_mod_del_repo_multiline_values-test-af.patch * fixes-56144-to-enable-hotadd-profile-support.patch * fopen-workaround-bad-buffering-for-binary-mode-563.patch * force-zyppnotify-to-prefer-packages.db-than-packages.patch * ignore-erros-on-reading-license-files-with-dpkg_lowp.patch * ignore-extend-declarations-from-excluded-sls-files.patch * ignore-non-utf8-characters-while-reading-files-with-.patch * implementation-of-held-unheld-functions-for-state-pk.patch * implementation-of-suse_ip-execution-module-bsc-10999.patch * improvements-on-ansiblegate-module-354.patch * include-stdout-in-error-message-for-zypperpkg-559.patch * make-pass-renderer-configurable-other-fixes-532.patch * make-sure-saltcacheloader-use-correct-fileclient-519.patch * mock-ip_addrs-in-utils-minions.py-unit-test-443.patch * normalize-package-names-once-with-pkg.installed-remo.patch * notify-beacon-for-debian-ubuntu-systems-347.patch * refactor-and-improvements-for-transactional-updates-.patch * retry-if-rpm-lock-is-temporarily-unavailable-547.patch * set-default-target-for-pip-from-venv_pip_target-envi.patch * state.apply-don-t-check-for-cached-pillar-errors.patch * state.orchestrate_single-does-not-pass-pillar-none-4.patch * support-transactional-systems-microos.patch * wipe-notify_socket-from-env-in-cmdmod-bsc-1193357-30.patch ``` ## Workaround Restarting the salt-minion systemd service on affected machines mitigates the problem for one to multiple hours until the minion becomes unresponsive again. `systemctl restart salt-minion`. For now we have downgraded the affected machines except for one which we keep as purposely broken machine which we can offer with some limitations applied to anyone interested in investigating further. ## Further details Please find our internal investigation issue on https://progress.opensuse.org/issues/131249 -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1212816 https://bugzilla.suse.com/show_bug.cgi?id=1212816#c4 Oliver Kurz <okurz@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Flags|needinfo?(okurz@suse.com) | --- Comment #4 from Oliver Kurz <okurz@suse.com> --- We have all those machines covered in monitoring using a grafana instance on https://monitor.qa.suse.de . We have not observed any OOM condition leading to that during the observed time period. Please see https://bugzilla.opensuse.org/attachment.cgi?id=867935 for logs. As Marius Kittler noted the problem is reproducible on Leap 15.5 as well with the according package version pointing to a clear regression since 150400.8.25.1 -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1212816 https://bugzilla.suse.com/show_bug.cgi?id=1212816#c5 --- Comment #5 from Oliver Kurz <okurz@suse.com> --- We have observed that multiple machines running Leap 15.5 with salt-3005 show the same problem eventually of "No response". A forced install of the Leap 15.4 salt-3004 package on Leap 15.5 seems to work fine. -- You are receiving this mail because: You are on the CC list for the bug.
https://bugzilla.suse.com/show_bug.cgi?id=1212816 https://bugzilla.suse.com/show_bug.cgi?id=1212816#c9 Oliver Kurz <okurz@suse.com> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|RESOLVED |VERIFIED --- Comment #9 from Oliver Kurz <okurz@suse.com> --- https://progress.opensuse.org/issues/131249#note-54 -- You are receiving this mail because: You are on the CC list for the bug.
participants (1)
-
bugzilla_noreply@suse.com