Bug ID 1212816
Summary salt-minions from some machines do not return "Minion did not return. [No response]" since salt-3006.0-150400.8.34.2
Classification openSUSE
Product openSUSE Distribution
Version Leap 15.4
Hardware Other
OS Other
Status NEW
Severity Major
Priority P5 - None
Component Salt
Assignee salt-maintainers@suse.de
Reporter okurz@suse.com
QA Contact qa-bugs@suse.de
Target Milestone ---
Found By ---
Blocker ---

## Observation

We observed that in our salt managed infrastructure the salt-minion from some
machines do not return any result reproducibly since the upgrade
salt-3004-150400.8.25.1->salt-3006.0-150400.8.34.2
salt-minions from some machines do not return "Minion did not return. [No
response]" since salt-3006.0-150400.8.34.2. In our infrastructure we currently
have 31 machines controlled by salt. Out of those 31 machines reproducibly 6
machines do not return with a response after some time regardless of the salt
command used. The other 25 machines are not affected, all have the most
up-to-date salt package version installed and an up-to-date Leap 15.4. We run
machines with architectures x86_64, aarch64, ppc64le and have affected as well
as non-affected machines of each architecture so this can not be an
architecture-specific issue. The issue is visible when executing any salt
command like `test.ping` as well as in `salt-run manage.down` showing the
unresponsive salt minion nodes.

## Steps to reproduce

In our infrastructure we reproduce the problem with:

```
for i in {1..7200}; do echo "### Run $i -- $(date -Is)" && salt --no-color \*
test.ping ; df -i / ; salt-run jobs.list_jobs | wc -l && salt --no-color \*
saltutil.kill_all_jobs && sleep 60 && rm -rf /var/cache/salt/master/jobs/*;
done | tee -a log_salt_test_ping_poo131249_$(date -Is).log
```

showing the problem for the affected machines after roughly one hour.

The command can be simplified by removing counting, logging, cleanup:

```
while :; do salt \* test.ping; done
```

Note that without sleeping between each loop iteration this eventually
exhausted free inodes on our salt master node.


## Problem

We could narrow down the problem to the salt packages upgrade
salt-3004-150400.8.25.1 to salt-3006.0-150400.8.34.2 .
Full changelog of the package for the above mentioned upgrade step:

```
* Mon Jun 19 2023 pablo.suarezhernandez@suse.com
- Make master_tops compatible with Salt 3000 and older minions (bsc#1212516)
(bsc#1212517)
- Added:
  * make-master_tops-compatible-with-salt-3000-and-older.patch

* Mon May 29 2023 yeray.gutierrez@suse.com
- Avoid failures due transactional_update module not available in Salt 3006.0
(bsc#1211754)
- Added:
  * define-__virtualname__-for-transactional_update-modu.patch

* Wed May 24 2023 pablo.suarezhernandez@suse.com
- Avoid conflicts with Salt dependencies versions (bsc#1211612)
- Added:
  * avoid-conflicts-with-dependencies-versions-bsc-12116.patch

* Fri May 05 2023 alexander.graul@suse.com
- Update to Salt release version 3006.0 (jsc#PED-4360)
  * See release notes:
https://docs.saltproject.io/en/latest/topics/releases/3006.0.html
- Add missing patch after rebase to fix collections Mapping issues
- Add python3-looseversion as new dependency for salt
- Add python3-packaging as new dependency for salt
- Allow entrypoint compatibility for "importlib-metadata>=5.0.0" (bsc#1207071)
- Create new salt-tests subpackage containing Salt tests
- Drop conflictive patch dicarded from upstream
- Fix SLS rendering error when Jinja macros are used
- Fix version detection and avoid building and testing failures
- Prevent deadlocks in salt-ssh executions
- Require python3-jmespath runtime dependency (bsc#1209233)
- Added:
  * 3005.1-implement-zypper-removeptf-573.patch
  * control-the-collection-of-lvm-grains-via-config.patch
  * fix-version-detection-and-avoid-building-and-testing.patch
  * make-sure-the-file-client-is-destroyed-upon-used.patch
  * skip-package-names-without-colon-bsc-1208691-578.patch
  * use-rlock-to-avoid-deadlocks-in-salt-ssh.patch
- Modified:
  * activate-all-beacons-sources-config-pillar-grains.patch
  * add-custom-suse-capabilities-as-grains.patch
  * add-environment-variable-to-know-if-yum-is-invoked-f.patch
  * add-migrated-state-and-gpg-key-management-functions-.patch
  * add-publish_batch-to-clearfuncs-exposed-methods.patch
  * add-salt-ssh-support-with-venv-salt-minion-3004-493.patch
  * add-sleep-on-exception-handling-on-minion-connection.patch
  * add-standalone-configuration-file-for-enabling-packa.patch
  * add-support-for-gpgautoimport-539.patch
  * allow-vendor-change-option-with-zypper.patch
  * async-batch-implementation.patch
  * avoid-excessive-syslogging-by-watchdog-cronjob-58.patch
  * bsc-1176024-fix-file-directory-user-and-group-owners.patch
  * change-the-delimeters-to-prevent-possible-tracebacks.patch
  * debian-info_installed-compatibility-50453.patch
  * dnfnotify-pkgset-plugin-implementation-3002.2-450.patch
  * do-not-load-pip-state-if-there-is-no-3rd-party-depen.patch
  * don-t-use-shell-sbin-nologin-in-requisites.patch
  * drop-serial-from-event.unpack-in-cli.batch_async.patch
  * early-feature-support-config.patch
  * enable-passing-a-unix_socket-for-mysql-returners-bsc.patch
  * enhance-openscap-module-add-xccdf_eval-call-386.patch
  * fix-bsc-1065792.patch
  * fix-for-suse-expanded-support-detection.patch
  * fix-issue-2068-test.patch
  * fix-missing-minion-returns-in-batch-mode-360.patch
  * fix-ownership-of-salt-thin-directory-when-using-the-.patch
  * fix-regression-with-depending-client.ssh-on-psutil-b.patch
  * fix-salt-ssh-opts-poisoning-bsc-1197637-3004-501.patch
  * fix-salt.utils.stringutils.to_str-calls-to-make-it-w.patch
  * fix-the-regression-for-yumnotify-plugin-456.patch
  * fix-traceback.print_exc-calls-for-test_pip_state-432.patch
  * fixes-for-python-3.10-502.patch
  * include-aliases-in-the-fqdns-grains.patch
  * info_installed-works-without-status-attr-now.patch
  * let-salt-ssh-use-platform-python-binary-in-rhel8-191.patch
  * make-aptpkg.list_repos-compatible-on-enabled-disable.patch
  * make-setup.py-script-to-not-require-setuptools-9.1.patch
  * pass-the-context-to-pillar-ext-modules.patch
  * prevent-affection-of-ssh.opts-with-lazyloader-bsc-11.patch
  * prevent-pkg-plugins-errors-on-missing-cookie-path-bs.patch
  * prevent-shell-injection-via-pre_flight_script_args-4.patch
  * read-repo-info-without-using-interpolation-bsc-11356.patch
  * restore-default-behaviour-of-pkg-list-return.patch
  * return-the-expected-powerpc-os-arch-bsc-1117995.patch
  * revert-fixing-a-use-case-when-multiple-inotify-beaco.patch
  * run-salt-api-as-user-salt-bsc-1064520.patch
  * run-salt-master-as-dedicated-salt-user.patch
  * save-log-to-logfile-with-docker.build.patch
  * switch-firewalld-state-to-use-change_interface.patch
  * temporary-fix-extend-the-whitelist-of-allowed-comman.patch
  * update-target-fix-for-salt-ssh-to-process-targets-li.patch
  * use-adler32-algorithm-to-compute-string-checksums.patch
  * use-salt-bundle-in-dockermod.patch
  * x509-fixes-111.patch
  * zypperpkg-ignore-retcode-104-for-search-bsc-1176697-.patch
- Removed:
  * 3003.3-do-not-consider-skipped-targets-as-failed-for.patch
  * 3003.3-postgresql-json-support-in-pillar-423.patch
  * add-amazon-ec2-detection-for-virtual-grains-bsc-1195.patch
  * add-missing-ansible-module-functions-to-whitelist-in.patch
  * add-rpm_vercmp-python-library-for-version-comparison.patch
  * add-support-for-name-pkgs-and-diff_attr-parameters-t.patch
  * adds-explicit-type-cast-for-port.patch
  * align-amazon-ec2-nitro-grains-with-upstream-pr-bsc-1.patch
  * backport-syndic-auth-fixes.patch
  * batch.py-avoid-exception-when-minion-does-not-respon.patch
  * check-if-dpkgnotify-is-executable-bsc-1186674-376.patch
  * clarify-pkg.installed-pkg_verify-documentation.patch
  * detect-module.run-syntax.patch
  * do-not-crash-when-unexpected-cmd-output-at-listing-p.patch
  * enhance-logging-when-inotify-beacon-is-missing-pyino.patch
  * fix-62092-catch-zmq.error.zmqerror-to-set-hwm-for-zm.patch
  * fix-crash-when-calling-manage.not_alive-runners.patch
  * fixes-pkg.version_cmp-on-openeuler-systems-and-a-few.patch
  * fix-exception-in-yumpkg.remove-for-not-installed-pac.patch
  * fix-for-cve-2022-22967-bsc-1200566.patch
  * fix-inspector-module-export-function-bsc-1097531-481.patch
  * fix-ip6_interface-grain-to-not-leak-secondary-ipv4-a.patch
  * fix-issues-with-salt-ssh-s-extra-filerefs.patch
  * fix-jinja2-contextfuntion-base-on-version-bsc-119874.patch
  * fix-multiple-security-issues-bsc-1197417.patch
  * fix-salt-call-event.send-call-with-grains-and-pillar.patch
  * fix-salt.states.file.managed-for-follow_symlinks-tru.patch
  * fix-state.apply-in-test-mode-with-file-state-module-.patch
  * fix-test_ipc-unit-tests.patch
  * fix-the-regression-in-schedule-module-releasded-in-3.patch
  * fix-wrong-test_mod_del_repo_multiline_values-test-af.patch
  * fixes-56144-to-enable-hotadd-profile-support.patch
  * fopen-workaround-bad-buffering-for-binary-mode-563.patch
  * force-zyppnotify-to-prefer-packages.db-than-packages.patch
  * ignore-erros-on-reading-license-files-with-dpkg_lowp.patch
  * ignore-extend-declarations-from-excluded-sls-files.patch
  * ignore-non-utf8-characters-while-reading-files-with-.patch
  * implementation-of-held-unheld-functions-for-state-pk.patch
  * implementation-of-suse_ip-execution-module-bsc-10999.patch
  * improvements-on-ansiblegate-module-354.patch
  * include-stdout-in-error-message-for-zypperpkg-559.patch
  * make-pass-renderer-configurable-other-fixes-532.patch
  * make-sure-saltcacheloader-use-correct-fileclient-519.patch
  * mock-ip_addrs-in-utils-minions.py-unit-test-443.patch
  * normalize-package-names-once-with-pkg.installed-remo.patch
  * notify-beacon-for-debian-ubuntu-systems-347.patch
  * refactor-and-improvements-for-transactional-updates-.patch
  * retry-if-rpm-lock-is-temporarily-unavailable-547.patch
  * set-default-target-for-pip-from-venv_pip_target-envi.patch
  * state.apply-don-t-check-for-cached-pillar-errors.patch
  * state.orchestrate_single-does-not-pass-pillar-none-4.patch
  * support-transactional-systems-microos.patch
  * wipe-notify_socket-from-env-in-cmdmod-bsc-1193357-30.patch
```



## Workaround

Restarting the salt-minion systemd service on affected machines mitigates the
problem for one to multiple hours until the minion becomes unresponsive again.
`systemctl restart salt-minion`.

For now we have downgraded the affected machines except for one which we keep
as purposely broken machine which we can offer with some limitations applied to
anyone interested in investigating further.

## Further details

Please find our internal investigation issue on
https://progress.opensuse.org/issues/131249


You are receiving this mail because: