Enable All-or-Nothing Scaling with AWS ParallelCluster 🖥

Posted on Jun 21, 2022

tl;dr: Launch N instances or none at all

Update: This has been turned into an official AWS Blogpost: Minimize HPC compute costs with all-or-nothing instance launching

Enable All-or-Nothing Scaling with AWS ParallelCluster

All or nothing scaling is useful when you need to run MPI jobs that can’t start until all N instances have joined the cluster.

The way Slurm launches instances is in a best-effort fashion, i.e. if you request 10 instances but it can only get 9, it’ll provision 9 then keep trying to get the last instance. This incurs cost for jobs that need all 10 instances before starting.

For example, if you submit a job like:

sbatch -N 10 ...

It can’t start until all 10 instances join the cluster.

However if you were to run 10 jobs that each require a single instance, like so:

sbatch -N 1 --exclusive --array=0-9 ...

Each job would get kicked off as capacity gets added, with jobs finishing and potentially returning capacity for later jobs.

Setup

The simplest way to set this up is the run the following command on the HeadNode:

sudo su -
echo "all_or_nothing_batch = True" >> /etc/parallelcluster/slurm_plugin/parallelcluster_slurm_resume.conf

If you’d like to automate this process you can create a CustomAction to run this on each new cluster.

Create a file called all-or-nothing.sh with the following content

#!/bin/bash

echo "all_or_nothing_batch = True" >> /etc/parallelcluster/slurm_plugin/parallelcluster_slurm_resume.conf

Upload to S3

aws s3 cp all-or-nothing.sh s3://bucket

Modify config to specify the all-or-nothing.sh script in the HeadNode > CustomActions section.

Note: I used the multi-runner.sh script here, this allows you to specify multiple scripts, each as an arg, however you can just specify it under Script if you so desire.

HeadNode:
  InstanceType: c5a.xlarge
  Ssh:
    KeyName: keypair
  Networking:
    SubnetId: subnet-12345678
  Iam:
    AdditionalIamPolicies:
      - Policy: arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
      - Policy: arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess
  Dcv:
    Enabled: true
  CustomActions:
    OnNodeConfigured:
      Script: >-
        https://swsmith.cc/scripts/multi-runner.sh
      Args:
        - s3://bucket/all-or-nothing.sh
Scheduling:
  Scheduler: slurm
  SlurmQueues:
    - Name: queue1
      ComputeResources:
        - Name: queue1-c6i32xlarge
          MinCount: 0
          MaxCount: 200
          InstanceType: c6i.32xlarge
          Efa:
            Enabled: true
            GdrSupport: true
      Networking:
        SubnetIds:
          - subnet-12345678
        PlacementGroup:
          Enabled: true
Region: us-east-2
Image:
  Os: alinux2

Testing

Before

Before setting all_or_nothing_batch = True, I purposely chose and instance type with low capacity and submitted a large job:

$ sbatch -N 200 job.sh

This went into pending and then by monitoring sinfo I was able to see that it got 106 instances, not 200:

$ watch sinfo

Every 2.0s: sinfo                                                                                                                                                  Tue Jun 21 23:20:26 2022

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
queue0       up   infinite     94  idle% queue1-dy-queue1-c6i32xlarge-[107-200]
queue0       up   infinite    106  idle# queue1-dy-queue1-c6i32xlarge-[1-106]

After

Next I set all_or_nothing_batch = True and tried the same thing again (after waiting for all 106 allocated nodes to get terminated):

$ sbatch -N 200 job.sh

I see that the instances go into idle! state after they fail to launch:

$ watch sinfo


PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up   infinite     64  idle~ compute-dy-hpc6a-[1-64]
queue1       up   infinite    200  idle! queue1-dy-queue1-c6i32xlarge-[1-200]

If I look in /var/log/parallelcluster/slurm_resume.log, I can confirm 0 instances were launched:

2022-06-21 23:37:55,845 - [slurm_plugin.instance_manager:_launch_ec2_instances] - ERROR - Failed RunInstances request: b816b2b4-fe15-4037-a116-678bfa187a44
2022-06-21 23:37:55,845 - [slurm_plugin.instance_manager:add_instances_for_nodes] - ERROR - Encountered exception when launching instances for nodes (x200) [ ... ]: An error occurred (InsufficientInstanceCapacity) when calling the RunInstances operation (reached max retries: 1): We currently do not have sufficient c6i.32xlarge capacity in the Availability Zone you requested (us-east-2b). Oursystem will be working on provisioning additional capacity. You can currently get c6i.32xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-2a, us-east-2c.
2022-06-21 23:37:55,846 - [slurm_plugin.resume:_resume] - INFO - Successfully launched nodes (x0) []

Now I can change my instance job size or use another Availability Zone to accomodate this job.

–no-requeue flag

If you add in the --no-requeue flag and the job doesn’t get capacity, it’ll automatically drop off the queue, i.e.

$ sbatch -p queue1 -N 4 --exclusive --no-requeue submit.sh
$ watch squeue
Every 2.0s: squeue                                                                                                                 Fri Mar 24 15:21:23 2023

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 2    queue1 submit.s ec2-user PD       0:00      4 (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority parti
tions)

After about 5 minutes, the job will drop off the queue. If you want the job to continue pending add in the --requeue flag (this is the default).