AWS ParallelCluster Slurm Constraints
In previous posts we discussed adding multiple instances to same Slurm queue and enabling Fast Failover, this is great when you’re flexible on the specific instance type used to run your job, but what if you want to choose the instance type at job submission time?
In this blogpost we look at how to use the Slurm --constraint
flag to pick the specific instance type at runtime.
Setup
To setup, refer to the Fast Failover Setup section. We’ll assume you have a cluster setup with a queue and multiple compute resources in it.
You should see something like the following when you run sinfo
:
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
queue0* up infinite 300 idle~ queue0-dy-queue0-c6i32xlarge-[1-100],queue0-dy-queue0-m6i32xlarge-[1-100],queue0-dy-queue0-r6i32xlarge-[1-100]
Job Submission
Now when we submit a job, we can select the instance type by simply specifying it with the --constraint
flag:
salloc --constraint "m6i.32xlarge"
This can also be done in an sbatch script like so:
#!/bin/bash
#SBATCH --constraint m6i.32xlarge
# rest of job script ...
Now when the job is submitted you’ll see that instance type gets spun up.
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
queue0* up infinite 1 mix~ queue0-dy-queue0-r6i32xlarge-1
queue0* up infinite 299 idle~ queue0-dy-queue0-c6i32xlarge-[1-100],queue0-dy-queue0-m6i32xlarge-[1-100],queue0-dy-queue0-r6i32xlarge-[2-100]
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
9 queue0 interact ec2-user CF 0:09 1 queue0-dy-queue0-r6i32xlarge-1
What happens when that instance type can’t be launched? I did an experiment to find out:
$ salloc: error: Node failure on queue0-dy-queue0-m6i32xlarge-1
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
queue0* up infinite 1 down# queue0-dy-queue0-m6i32xlarge-1
queue0* up infinite 200 idle~ queue0-dy-queue0-c6i32xlarge-[1-100],queue0-dy-queue0-r6i32xlarge-[1-100]
queue0* up infinite 99 down~ queue0-dy-queue0-m6i32xlarge-[2-100]
It’ll set that node to down#
and the rest of the compute resource will go into down~
for 10 minutes if it detects one of the following responses from the EC2 API:
InsufficientInstanceCapacity
InsufficientHostCapacity
InsufficientReservedInstanceCapacity
MaxSpotInstanceCountExceeded
SpotMaxPriceTooLow
Unsupported
You can read more about it in Slurm cluster fast insufficient capacity fail-over docs.