HPC7g instances in AWS ParallelCluster 👽

Posted on Jul 10, 2023

tl;dr: The latest generation ARM instance, hpc7g.16xlarge, allows cost-effective HPC simulations with AWS ParallelCluster.

HPC7g instances

HPC7g instances are the first ARM based HPC instances in AWS. These instances combine excellent per-core pricing, deep capacity pools and 200 GB EFA networking in order to create the perfect HPC instance for large-scale cost-effective simulations. There’s three different sizes:

Instance Size	Cores	Memory (GiB)	EFA Network Bandwidth	Price (On-Demand in us-east-1)
hpc7g.4xlarge	16	128	200 GBps	1.683
hpc7g.8xlarge	32	128	200 GBps	1.683
hpc7g.16xlarge	64	128	200 GBps	1.683

The first thing you’ll notice is the t-shirt sizes (i.e. 4xlarge or 8xlarge) don’t differ in terms of memory or price, they only differ by the number of cores. Think of this as similar to restricting cores in order to get better memory bandwidth or higher total memory per-core.

In the next section we’ll show how to setup these instances with AWS ParallelCluster.

Setup

To deploy hpc7g instances we’ll need to create a ARM-specific cluster due to a restriction in AWS ParallelCluster that each cluster needs to share the same architecture i.e. arm64 or x86_64. See Multi-Cluster Mode for an example of how to link the two clusters.

The next important caveat is that the HPC7g instances can only be deployed in a private subnet in a single-AZ, at launch that’s only N. Virginia (us-east-1), use1-az6 Availability Zone.

Create a VPC and Subnet in N. Virginia. For simplicity I’ve provided a template that creates a private subnet in each Availability Zone. You’ll need to use the private subnet created in use1-az6 to get capacity.

Quick Create Link 🚀

Download the following example template: hpc7g.yaml. You’ll need to substitute your SSH keyname, the subnet id’s for both a public subnet (to connect) and a private subnet (for compute nodes), and anything else specific to your account.

The template creates the following resources:

Field	Value	Description
Head Node	`c7g.xlarge`	This instance is responsible for running slurm scheduler and allowing users to login. This is the smallest possible size slurmctld can use (4 cores and 8 GB RAM), it also matches the same neoversev1 core-architecture that the `hpc7g.16xlarge` instances has so if you compile code and it does micro-architecture detection it’ll have the proper flags set.
Compute Node	`hpc7g.16xlarge`	These are the compute nodes, make sure they’re in the subnet created in `use1-az6`.
Filesystem	FSx Lustre	We recommend using FSx Lustre as the shared filesystem.

Once the template is modified, you can create the cluster like so:

pcluster create-cluster -n arm64 -c hpc7g.yaml

Install Arm Performance Libraries

Once the cluster is CREATE_COMPLETE we can install Arm Performance Libraries and a version of gcc that supports neoversev1 cores using Spack.

First install Spack on the FSx Lustre filesystem following instructions here
we’ll first install ARM Compiler for Linux (acfl) which has support for the Neoversev1 core:
```
spack install acfl
spack load acfl
spack compiler find
spack unload acfl
```
Next we’ll install ARM Performance Libraries (armpl) a set of performance libraries including BLAS, LAPACK, FFT, ect.
```
spack install armpl-gcc
```
You’ll now see a arm compiler when you list out compilers:
```
spack compilers
```

Restricting Cores

If we look back at the t-shirt sizes we’ll see that they’re already restricted in terms of cores.

Instance Size	Cores	Memory (GiB)	EFA Network Bandwidth	Price (On-Demand in us-east-1)
hpc7g.4xlarge	16	128	200 GBps	1.683
hpc7g.8xlarge	32	128	200 GBps	1.683
hpc7g.16xlarge	64	128	200 GBps	1.683

To submit jobs that use a specific instance type, specify the constraint flag and the instance name:

salloc --constraint "hpc7g.4xlarge"
# wait for the instance to come up

Once the instance is running, you can ssh in and see that it does indeed have fewer cores but equivalent memory.

ssh hpc7g-dy-hpc7g-4xlarge-1
$ lscpu
Architecture:        aarch64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              16
On-line CPU(s) list: 0-15
Thread(s) per core:  1
Core(s) per socket:  16
Socket(s):           1
NUMA node(s):        1
Vendor ID:           ARM
Model:               1
Stepping:            r1p1
BogoMIPS:            2100.00
L1d cache:           64K
L1i cache:           64K
L2 cache:            1024K
L3 cache:            32768K
NUMA node0 CPU(s):   0-15
Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs paca pacg dcpodp svei8mm svebf16 i8mm bf16 dgh rng

Slurm Multi-Cluster Mode

Slurm supports a feature called multi-cluster mode this allows you to submit jobs across multiple clusters, making it possible to submit from the x86 cluster to the arm cluster. For example, I could submit a job and specify the aarch64 cluster with the hpc7g partition like so:

sbatch --cluster aarch64 ...

This setup of this is outside of the scope of this blogpost, but please see Configure Slurm Multi-Cluster Mode for instructions.