Setup Licensing with AWS ParallelCluster and Slurm

Posted on Dec 3, 2021
tl;dr: Check licenses in Slurm before starting compute instances.

Setup Licensing with AWS ParallelCluster and Slurm

Slurm has the ability to track licenses, for example if you have 100 LS-Dyna licenses available, you can have jobs that would exceed that amount stay in pending until some of the licenses free up. Slurm has two ways of doing this:

  • Local Licenses - Local licenses are local to the cluster in the slurm.conf. Use this if you have only one cluster.
  • Remote Licenses - this isn’t actually checking the license server, all this means is it’s tracked in the slurmd database instead of locally on the cluster.

In this guide we’ll assume you have AWS ParallelCluster setup and running with Slurm accounting enabled.

Static License Checking

First we’re going to add a static amount of licenses to Slurm, this will let us increment and decrement the counter when jobs are submitted. This approach is enough if you only have a single cluster using these licenses, however when you have multiple clusters or other users consuming licenses not via Slurm you’ll need to also implement license checking in part 2.

  1. First add the licenses to /opt/slurm/etc/slurm.conf:
cat <<EOF > /opt/slurm/etc/slurm.conf
# Licenses
Licenses=lsdyna:100
EOF
  1. Restart slurmctld
systemctl restart slurmctld
  1. Now check scontrol to see new licenses:
$ scontrol show lic
LicenseName=lsdyna
    Total=100 Used=0 Free=100 Remote=no

Dynamic License Updates

  1. First grab the cluster name by running:
sacctmgr show clusters format=cluster,controlhost
  1. Next add the license using the cluster name from before:
sacctmgr add resource name=lsdyna type=license count=100 server=flex_host servertype=flexlm cluster=parallelcluster
  1. Now we’re going to write a script /shared/scripts/license-update.py that’ll query the license server and fetch the current number of available licenses and then dynamically update the license server.:
#!/usr/bin/env python

import subprocess
import time
import sys
import os

# Hard code the total number of licenses available to on-prem and AWS
total_lic = 100

print('Total licenses: %s' % total_lic)

# Query LSTC to get license count
out = os.popen('/fsx/sw/LSTC_LicenseManager/lstc_qrun -s 10.0.0.10').read()
print(out)
out_lines = out.split('\n')
n_lic = 0
for lines in out_lines:
    try:
        if lines[:10].strip(' ') != 'centos':
            n_lic += int(lines[72:76])
            print(lines[:10].strip(' '))
    except:
        pass
print('Licenses in use: %s' % n_lic)

avail = total_lic - n_lic
print(avail)

# Update slurm resource with new value
# Be sure to change server
os.system('/opt/slurm/bin/sacctmgr -i modify resource name=lstc server=parallelcluster set count=' + str(int(avail)))
  1. Create a crontab to run this script every minute:
crontab -e
* * * * * /shared/scripts/license-update.py

Using License Constraints

  1. In your sbatch file add the following line, the number should be equal to the number of cores requested by the job:
#SBATCH -L lsdyna:100
  1. Then submit the job, if there’s insufficient licenses available, the job will go into PD (pending) state until licenses free up.
sbatch submit.sh