NOTE: when erasing portions of configuration files, try to comment the line prior to erasing it. This will help in case you need to retrieve the line erased. Another option is to make a copy of the configuration file before making any change on it.

A. Review if all the slurmd daemon are up and running on the cluster nodes

[root@HPCHN ~]#systemctld status slurmd

B. Slurmctld (controller daemon) must only be running on the head node

[root@HPCHN ~]#systemctld status slurmctld

C. Try to kill the process if slurmctld is running on compute nodes

[root@node02 ~]# systemctl stop slurmctld

[root@node02 ~]# pidof slurmctld

[root@node02 ~]# killall -9 <PID number>

D. After making any change on the slurm.conf config file you have two options:

Manually copy the file to the nodes

[root@HPCHN ~]# #wwsh file sync

And on the nodes:

[root@HPCHN ~]# ssh 10.0.2.2

[root@node02 ~]# /warewulf/bin/wwgetfiles

[root@node02 ~]# exit

logout

Connection to 10.0.2.2 closed.

[root@HPCHN ~]# ssh 10.0.2.3

[root@node03 ~]# /warewulf/bin/wwgetfiles

[root@node03 ~]# exit

logout

Connection to 10.0.2.3 closed.

[root@HPCHN ~]# ssh 10.0.2.4

[root@node04 ~]# /warewulf/bin/wwgetfiles

Execute:

[root@HPCHN ~]# wwsh provision set node[02-04] --bootstrap=`uname -r` --vnfs=centos7.3 --files=dynamic_hosts,passwd,group,shadow,slurm.conf,munge.key,network

[root@HPCHN ~]# #wwsh file sync

And reboot the nodes

E. Ion the head node try to initialize the slurmctld manually by executing:

[root@HPCHN ~]# /usr/sbin/slurmctld -D

F. Sometimes it is necessary to erase the log file slurmctld.log.

Appendix A. Basic Warewulf troubleshooting

results matching ""

No results matching ""