Announcement ID: | SUSE-RU-2023:4332-1 |
---|---|
Rating: | moderate |
References: | |
Affected Products: |
|
An update that has one fix can now be installed.
This update for slurm fixes the following issues:
Updated to version 23.02.5 with the following changes:
Bug Fixes:
SLURM_NTASKS
was no longer set in the
job's environment when --ntasks-per-node
was requested.
The method that is is being set, however, is different and should be more
accurate in more situations.SrunPortRange
option. This matches the
new behavior of the pmix plugin in 23.02.0. Note that neither of these
plugins makes use of the MpiParams=ports=
option, and previously
were only limited by the systems ephemeral port range.job_container/tmpfs
- Avoid attempts to share BasePath between nodes.CR_Cpu_Memory
, fix node selection for jobs that request gres and
--mem-per-cpu
.slurmctld
segfault when a node registers with a configured
CpuSpecList
while slurmctld
configuration has the node without
CpuSpecList
.POWERED_DOWN+NO_RESPOND
state after
not registering by ResumeTimeout
.slurmstepd
- Avoid cleanup of config.json-less
containers spooldir
getting skipped.bind()
and listen()
calls
in the network stack when running with SrunPortRange set.-a
/--all
option for privileged users.--noheader
option.node_features/helpers
- Fix node selection for jobs requesting
changeable.
features with the |
operator, which could prevent jobs from
running on some valid nodes.node_features/helpers
- Fix inconsistent handling of &
and |
,
where an AND'd feature was sometimes AND'd to all sets of features
instead of just the current set. E.g. foo|bar&baz
was interpreted
as {foo,baz}
or {bar,baz}
instead of how it is documented:
{foo} or {bar,baz}
.AllocNodes
while it is pending or if it is canceled
before restarting.sacct
- AllocCPUS
now correctly shows 0 if a job has not yet
received an allocation or if the job was canceled before getting one./dev/dri/renderD[0-9]+
GPUs,
and do not detect /dev/dri/card[0-9]+
.--gpus
and a number of
tasks fewer than GPUs, which resulted in incorrectly rejecting these
jobs.MYSQL_OPT_RECONNECT
completely.POWERING_UP
state disappearing (getting set
to FUTURE
)
when an scontrol reconfigure
happens.openapi/dbv0.0.39
- Avoid assert / segfault on missing coordinators
list.slurmrestd
- Correct memory leak while parsing OpenAPI specification
templates with server overrides.rpc_queue
is enabled.slurmrestd
- Correct OpenAPI specification generation bug where
fields with overlapping parent paths would not get generated.--gres=none
sometimes not ignoring gres
from the job.--exclusive
jobs incorrectly gang-scheduling where they shouldn't.CR_SOCKET
, gres not assigned to a specific
socket, and block core distribion potentially allocating more sockets
than required.PrologEpilogTimeout
) is strongly encouraged to avoid Slurm waiting
indefinitely for scripts to finish.slurmdbd -R
not returning an error under certain conditions.slurmdbd
- Avoid potential NULL pointer dereference in the mysql
plugin./etc/hosts
.openapi/[db]v0.0.39
- fix memory leak on parsing error.data_parser/v0.0.39
- fix updating qos for associations.openapi/dbv0.0.39
- fix updating values for associations with null
users.--tres-per-task
and licenses.--cpus-per-task
< usable threads per core.slurmrestd
- For GET /slurm/v0.0.39/node[s]
, change format of
node's energy field current_watts
to a dictionary to account for
unset value instead of dumping 4294967294.slurmrestd
- For GET /slurm/v0.0.39/qos
, change format of QOS's
field "priority" to a dictionary to account for unset value instead of
dumping 4294967294.GET /slurm/v0.0.39/job[s]
, the 'return code'
code field in v0.0.39_job_exit
_code will be set to -127 instead of
being left unset where job does not have a relevant return code.Other Changes:
JobId
to debug()
messages indicating when
cpus_per_task/mem_per_cpu
or pn_min_cpus
are being automatically
adjusted.slurmstepd
- Cleanup per task generated environment for containers in
spooldir.slurmrestd
- Reduce memory usage when printing out job CPU frequency.data_parser/v0.0.39
- Add required/memory_per_cpu
and
required/memory_per_node
to sacct --json
and sacct --yaml
and
GET /slurmdb/v0.0.39/jobs
from slurmrestd.gpu/oneapi
- Store cores correctly so CPU affinity is tracked.slurmdbd -R
to work if the root assoc id is not 1.TreeWidth
.
Since unresolvable cloud/dynamic
nodes must disable fanout by setting
TreeWidth
to a large number, this would cause all nodes to register at
once.
To install this SUSE update use the SUSE recommended
installation methods like YaST online_update or "zypper patch".
Alternatively you can run the command listed for your product:
zypper in -t patch SUSE-2023-4332=1 openSUSE-SLE-15.5-2023-4332=1
zypper in -t patch SUSE-SLE-Module-HPC-15-SP5-2023-4332=1
zypper in -t patch SUSE-SLE-Module-Packagehub-Subpackages-15-SP5-2023-4332=1