SUSE-RU-2023:4332-1: moderate: Recommended update for slurm
# Recommended update for slurm Announcement ID: SUSE-RU-2023:4332-1 Rating: moderate References: * bsc#1215437 Affected Products: * HPC Module 15-SP5 * openSUSE Leap 15.5 * SUSE Linux Enterprise Desktop 15 SP5 * SUSE Linux Enterprise High Performance Computing 15 SP5 * SUSE Linux Enterprise Micro 5.5 * SUSE Linux Enterprise Real Time 15 SP5 * SUSE Linux Enterprise Server 15 SP5 * SUSE Linux Enterprise Server for SAP Applications 15 SP5 * SUSE Package Hub 15 15-SP5 An update that has one fix can now be installed. ## Description: This update for slurm fixes the following issues: * Updated to version 23.02.5 with the following changes: * Bug Fixes: * Revert a change in 23.02 where `SLURM_NTASKS` was no longer set in the job's environment when `--ntasks-per-node` was requested. The method that is is being set, however, is different and should be more accurate in more situations. * Change pmi2 plugin to honor the `SrunPortRange` option. This matches the new behavior of the pmix plugin in 23.02.0. Note that neither of these plugins makes use of the `MpiParams=ports=` option, and previously were only limited by the systems ephemeral port range. * Fix regression in 23.02.2 that caused slurmctld -R to crash on startup if a node features plugin is configured. * Fix and prevent reoccurring reservations from overlapping. * `job_container/tmpfs` \- Avoid attempts to share BasePath between nodes. * With `CR_Cpu_Memory`, fix node selection for jobs that request gres and `--mem-per-cpu`. * Fix a regression from 22.05.7 in which some jobs were allocated too few nodes, thus overcommitting cpus to some tasks. * Fix a job being stuck in the completing state if the job ends while the primary controller is down or unresponsive and the backup controller has not yet taken over. * Fix `slurmctld` segfault when a node registers with a configured `CpuSpecList` while `slurmctld` configuration has the node without `CpuSpecList`. * Fix cloud nodes getting stuck in `POWERED_DOWN+NO_RESPOND` state after not registering by `ResumeTimeout`. * `slurmstepd` \- Avoid cleanup of `config.json-less` containers spooldir getting skipped. * Fix scontrol segfault when 'completing' command requested repeatedly in interactive mode. * Properly handle a race condition between `bind()` and `listen()` calls in the network stack when running with SrunPortRange set. * Federation - Fix revoked jobs being returned regardless of the `-a`/`--all` option for privileged users. * Federation - Fix canceling pending federated jobs from non-origin clusters which could leave federated jobs orphaned from the origin cluster. * Fix sinfo segfault when printing multiple clusters with `--noheader` option. * Federation - fix clusters not syncing if clusters are added to a federation before they have registered with the dbd. * `node_features/helpers` \- Fix node selection for jobs requesting changeable. features with the `|` operator, which could prevent jobs from running on some valid nodes. * `node_features/helpers` \- Fix inconsistent handling of `&` and `|`, where an AND'd feature was sometimes AND'd to all sets of features instead of just the current set. E.g. `foo|bar&baz` was interpreted as `{foo,baz}` or `{bar,baz}` instead of how it is documented: `{foo} or {bar,baz}`. * Fix job accounting so that when a job is requeued its allocated node count is cleared. After the requeue, sacct will correctly show that the job has 0 `AllocNodes` while it is pending or if it is canceled before restarting. * `sacct` \- `AllocCPUS` now correctly shows 0 if a job has not yet received an allocation or if the job was canceled before getting one. * Fix intel OneAPI autodetect: detect the `/dev/dri/renderD[0-9]+` GPUs, and do not detect `/dev/dri/card[0-9]+`. * Fix node selection for jobs that request `--gpus` and a number of tasks fewer than GPUs, which resulted in incorrectly rejecting these jobs. * Remove `MYSQL_OPT_RECONNECT` completely. * Fix cloud nodes in `POWERING_UP` state disappearing (getting set to `FUTURE`) when an `scontrol reconfigure` happens. * `openapi/dbv0.0.39` \- Avoid assert / segfault on missing coordinators list. * `slurmrestd` \- Correct memory leak while parsing OpenAPI specification templates with server overrides. * Fix overwriting user node reason with system message. * Prevent deadlock when `rpc_queue` is enabled. * `slurmrestd` \- Correct OpenAPI specification generation bug where fields with overlapping parent paths would not get generated. * Fix memory leak as a result of a partition info query. * Fix memory leak as a result of a job info query. * For step allocations, fix `--gres=none` sometimes not ignoring gres from the job. * Fix `--exclusive` jobs incorrectly gang-scheduling where they shouldn't. * Fix allocations with `CR_SOCKET`, gres not assigned to a specific socket, and block core distribion potentially allocating more sockets than required. * Revert a change in 23.02.3 where Slurm would kill a script's process group as soon as the script ended instead of waiting as long as any process in that process group held the stdout/stderr file descriptors open. That change broke some scripts that relied on the previous behavior. Setting time limits for scripts (such as `PrologEpilogTimeout`) is strongly encouraged to avoid Slurm waiting indefinitely for scripts to finish. * Fix `slurmdbd -R` not returning an error under certain conditions. * `slurmdbd` \- Avoid potential NULL pointer dereference in the mysql plugin. * Fix regression in 23.02.3 which broken X11 forwarding for hosts when MUNGE sends a localhost address in the encode host field. This is caused when the node hostname is mapped to 127.0.0.1 (or similar) in `/etc/hosts`. * `openapi/[db]v0.0.39` \- fix memory leak on parsing error. * `data_parser/v0.0.39` \- fix updating qos for associations. * `openapi/dbv0.0.39` \- fix updating values for associations with null users. * Fix minor memory leak with `--tres-per-task` and licenses. * Fix cyclic socket cpu distribution for tasks in a step where `--cpus-per-task` < usable threads per core. * `slurmrestd` \- For `GET /slurm/v0.0.39/node[s]`, change format of node's energy field `current_watts` to a dictionary to account for unset value instead of dumping 4294967294. * `slurmrestd` \- For `GET /slurm/v0.0.39/qos`, change format of QOS's field "priority" to a dictionary to account for unset value instead of dumping 4294967294. * slurmrestd - For `GET /slurm/v0.0.39/job[s]`, the 'return code' code field in `v0.0.39_job_exit`_code will be set to -127 instead of being left unset where job does not have a relevant return code. * Other Changes: * Remove --uid / --gid options from salloc and srun commands. These options did not work correctly since the CVE-2022-29500 fix in combination with some changes made in 23.02.0. * Add the `JobId` to `debug()` messages indicating when `cpus_per_task/mem_per_cpu` or `pn_min_cpus` are being automatically adjusted. * Change the log message warning for rate limited users from verbose to info. * `slurmstepd` \- Cleanup per task generated environment for containers in spooldir. * Format batch, extern, interactive, and pending step ids into strings that are human readable. * `slurmrestd` \- Reduce memory usage when printing out job CPU frequency. * `data_parser/v0.0.39` \- Add `required/memory_per_cpu` and `required/memory_per_node` to `sacct --json` and `sacct --yaml` and `GET /slurmdb/v0.0.39/jobs` from slurmrestd. * `gpu/oneapi` \- Store cores correctly so CPU affinity is tracked. * Allow `slurmdbd -R` to work if the root assoc id is not 1. * Limit periodic node registrations to 50 instead of the full `TreeWidth`. Since unresolvable `cloud/dynamic` nodes must disable fanout by setting `TreeWidth` to a large number, this would cause all nodes to register at once. ## Patch Instructions: To install this SUSE update use the SUSE recommended installation methods like YaST online_update or "zypper patch". Alternatively you can run the command listed for your product: * openSUSE Leap 15.5 zypper in -t patch SUSE-2023-4332=1 openSUSE-SLE-15.5-2023-4332=1 * HPC Module 15-SP5 zypper in -t patch SUSE-SLE-Module-HPC-15-SP5-2023-4332=1 * SUSE Package Hub 15 15-SP5 zypper in -t patch SUSE-SLE-Module-Packagehub-Subpackages-15-SP5-2023-4332=1 ## Package List: * openSUSE Leap 15.5 (aarch64 ppc64le s390x x86_64) * slurm-pam_slurm-debuginfo-23.02.5-150500.5.9.2 * libpmi0-23.02.5-150500.5.9.2 * slurm-hdf5-debuginfo-23.02.5-150500.5.9.2 * slurm-hdf5-23.02.5-150500.5.9.2 * slurm-munge-debuginfo-23.02.5-150500.5.9.2 * slurm-cray-23.02.5-150500.5.9.2 * slurm-sview-debuginfo-23.02.5-150500.5.9.2 * slurm-plugin-ext-sensors-rrd-debuginfo-23.02.5-150500.5.9.2 * slurm-torque-debuginfo-23.02.5-150500.5.9.2 * slurm-rest-debuginfo-23.02.5-150500.5.9.2 * slurm-lua-23.02.5-150500.5.9.2 * slurm-slurmdbd-23.02.5-150500.5.9.2 * slurm-lua-debuginfo-23.02.5-150500.5.9.2 * slurm-auth-none-debuginfo-23.02.5-150500.5.9.2 * slurm-auth-none-23.02.5-150500.5.9.2 * slurm-slurmdbd-debuginfo-23.02.5-150500.5.9.2 * slurm-node-debuginfo-23.02.5-150500.5.9.2 * slurm-pam_slurm-23.02.5-150500.5.9.2 * slurm-devel-23.02.5-150500.5.9.2 * perl-slurm-debuginfo-23.02.5-150500.5.9.2 * slurm-plugins-debuginfo-23.02.5-150500.5.9.2 * slurm-sql-debuginfo-23.02.5-150500.5.9.2 * libpmi0-debuginfo-23.02.5-150500.5.9.2 * slurm-munge-23.02.5-150500.5.9.2 * slurm-node-23.02.5-150500.5.9.2 * perl-slurm-23.02.5-150500.5.9.2 * slurm-plugin-ext-sensors-rrd-23.02.5-150500.5.9.2 * slurm-plugins-23.02.5-150500.5.9.2 * libnss_slurm2-23.02.5-150500.5.9.2 * slurm-torque-23.02.5-150500.5.9.2 * libslurm39-debuginfo-23.02.5-150500.5.9.2 * slurm-sql-23.02.5-150500.5.9.2 * libnss_slurm2-debuginfo-23.02.5-150500.5.9.2 * slurm-23.02.5-150500.5.9.2 * slurm-sview-23.02.5-150500.5.9.2 * slurm-rest-23.02.5-150500.5.9.2 * slurm-cray-debuginfo-23.02.5-150500.5.9.2 * slurm-debugsource-23.02.5-150500.5.9.2 * slurm-testsuite-23.02.5-150500.5.9.2 * slurm-debuginfo-23.02.5-150500.5.9.2 * libslurm39-23.02.5-150500.5.9.2 * openSUSE Leap 15.5 (noarch) * slurm-webdoc-23.02.5-150500.5.9.2 * slurm-config-23.02.5-150500.5.9.2 * slurm-seff-23.02.5-150500.5.9.2 * slurm-doc-23.02.5-150500.5.9.2 * slurm-sjstat-23.02.5-150500.5.9.2 * slurm-openlava-23.02.5-150500.5.9.2 * slurm-config-man-23.02.5-150500.5.9.2 * HPC Module 15-SP5 (aarch64 x86_64) * slurm-pam_slurm-debuginfo-23.02.5-150500.5.9.2 * libpmi0-23.02.5-150500.5.9.2 * slurm-munge-debuginfo-23.02.5-150500.5.9.2 * slurm-cray-23.02.5-150500.5.9.2 * slurm-sview-debuginfo-23.02.5-150500.5.9.2 * slurm-plugin-ext-sensors-rrd-debuginfo-23.02.5-150500.5.9.2 * slurm-torque-debuginfo-23.02.5-150500.5.9.2 * slurm-rest-debuginfo-23.02.5-150500.5.9.2 * slurm-lua-23.02.5-150500.5.9.2 * slurm-slurmdbd-23.02.5-150500.5.9.2 * slurm-lua-debuginfo-23.02.5-150500.5.9.2 * slurm-auth-none-debuginfo-23.02.5-150500.5.9.2 * slurm-auth-none-23.02.5-150500.5.9.2 * slurm-slurmdbd-debuginfo-23.02.5-150500.5.9.2 * slurm-node-debuginfo-23.02.5-150500.5.9.2 * slurm-pam_slurm-23.02.5-150500.5.9.2 * slurm-devel-23.02.5-150500.5.9.2 * perl-slurm-debuginfo-23.02.5-150500.5.9.2 * slurm-plugins-debuginfo-23.02.5-150500.5.9.2 * slurm-sql-debuginfo-23.02.5-150500.5.9.2 * libpmi0-debuginfo-23.02.5-150500.5.9.2 * slurm-munge-23.02.5-150500.5.9.2 * slurm-node-23.02.5-150500.5.9.2 * perl-slurm-23.02.5-150500.5.9.2 * slurm-plugin-ext-sensors-rrd-23.02.5-150500.5.9.2 * slurm-plugins-23.02.5-150500.5.9.2 * libnss_slurm2-23.02.5-150500.5.9.2 * slurm-torque-23.02.5-150500.5.9.2 * libslurm39-debuginfo-23.02.5-150500.5.9.2 * slurm-sql-23.02.5-150500.5.9.2 * libnss_slurm2-debuginfo-23.02.5-150500.5.9.2 * slurm-23.02.5-150500.5.9.2 * slurm-sview-23.02.5-150500.5.9.2 * slurm-rest-23.02.5-150500.5.9.2 * slurm-cray-debuginfo-23.02.5-150500.5.9.2 * slurm-debugsource-23.02.5-150500.5.9.2 * slurm-debuginfo-23.02.5-150500.5.9.2 * libslurm39-23.02.5-150500.5.9.2 * HPC Module 15-SP5 (noarch) * slurm-webdoc-23.02.5-150500.5.9.2 * slurm-config-man-23.02.5-150500.5.9.2 * slurm-config-23.02.5-150500.5.9.2 * slurm-doc-23.02.5-150500.5.9.2 * SUSE Package Hub 15 15-SP5 (ppc64le s390x) * slurm-pam_slurm-debuginfo-23.02.5-150500.5.9.2 * libpmi0-23.02.5-150500.5.9.2 * slurm-hdf5-debuginfo-23.02.5-150500.5.9.2 * slurm-hdf5-23.02.5-150500.5.9.2 * slurm-munge-debuginfo-23.02.5-150500.5.9.2 * slurm-cray-23.02.5-150500.5.9.2 * slurm-sview-debuginfo-23.02.5-150500.5.9.2 * slurm-torque-debuginfo-23.02.5-150500.5.9.2 * slurm-rest-debuginfo-23.02.5-150500.5.9.2 * slurm-lua-23.02.5-150500.5.9.2 * slurm-slurmdbd-23.02.5-150500.5.9.2 * slurm-lua-debuginfo-23.02.5-150500.5.9.2 * slurm-auth-none-debuginfo-23.02.5-150500.5.9.2 * slurm-auth-none-23.02.5-150500.5.9.2 * slurm-slurmdbd-debuginfo-23.02.5-150500.5.9.2 * slurm-node-debuginfo-23.02.5-150500.5.9.2 * slurm-pam_slurm-23.02.5-150500.5.9.2 * slurm-devel-23.02.5-150500.5.9.2 * perl-slurm-debuginfo-23.02.5-150500.5.9.2 * slurm-plugins-debuginfo-23.02.5-150500.5.9.2 * slurm-sql-debuginfo-23.02.5-150500.5.9.2 * libpmi0-debuginfo-23.02.5-150500.5.9.2 * slurm-munge-23.02.5-150500.5.9.2 * slurm-node-23.02.5-150500.5.9.2 * perl-slurm-23.02.5-150500.5.9.2 * slurm-plugins-23.02.5-150500.5.9.2 * libnss_slurm2-23.02.5-150500.5.9.2 * slurm-torque-23.02.5-150500.5.9.2 * slurm-sql-23.02.5-150500.5.9.2 * libnss_slurm2-debuginfo-23.02.5-150500.5.9.2 * slurm-23.02.5-150500.5.9.2 * slurm-sview-23.02.5-150500.5.9.2 * slurm-rest-23.02.5-150500.5.9.2 * slurm-cray-debuginfo-23.02.5-150500.5.9.2 * slurm-debugsource-23.02.5-150500.5.9.2 * slurm-debuginfo-23.02.5-150500.5.9.2 * SUSE Package Hub 15 15-SP5 (noarch) * slurm-webdoc-23.02.5-150500.5.9.2 * slurm-config-23.02.5-150500.5.9.2 * slurm-seff-23.02.5-150500.5.9.2 * slurm-doc-23.02.5-150500.5.9.2 * slurm-sjstat-23.02.5-150500.5.9.2 * slurm-openlava-23.02.5-150500.5.9.2 * slurm-config-man-23.02.5-150500.5.9.2 ## References: * https://bugzilla.suse.com/show_bug.cgi?id=1215437
participants (1)
-
maintenance@opensuse.org