GPU Monitoring with Grafana 📊

Posted on Oct 26, 2023
tl;dr: Using Grafana you can setup custom dashboards to monitor your Slurm cluster including stats like number of jobs running, gpu utilization, memory consumption, efa traffic ect.

Grafana Screenshot

Grafana is an open source tool that allows us to create dashboards and monitor our cluster. In the following guide we’ll show you how to setup Grafana, Prometheus, Slurm exporter and DCGM Exporter to monitor a cluster. This will help you answer questions like:

  • how many jobs/instances are running
  • CPU utilization
  • GPU Utilization
  • Memory usage
  • EFA (Network) Traffic
  • Disk iops

We’ll setup the following exporters but don’t limit yourself to just these. There’s thousands of useful Prometheus exporters that can be plugged into this same architecture.

Prometheus Exporter Description
Slurm Prometheus Exporter Slurm scheduler metrics such as number of jobs, instances in DOWN state, number of users ect.
DCGM Metrics GPU Metrics
EFA Exporter EFA traffic metrics such as packets sent and received.
Node Exporter General instance information such as CPU Utilization, memory utilization, ect.

In the following sections we set this up for AWS ParallelCluster, however the same steps apply to any Slurm based cluster.

Setup Cluster

The first step is to setup a cluster with AWS ParallelCluster, to aide in this process you can use the following template:

Template 🚀

If you’re unfamiliar with AWS ParallelCluster and want more context, see my workshop.

If you don’t want to use the linked template make sure you include the policy arn:aws:iam::aws:policy/AmazonPrometheusFullAccess in the AdditionalIamPolicies section of the HeadNode and Compute Nodes like so:

    - arn:aws:iam::aws:policy/AmazonPrometheusFullAccess

Setup Grafana

In this step we’ll setup Amazon Managed Grafana, this is a hosted version of Grafana that will plot metrics collected from Amazon Prometheus and Cloudwatch.

  1. First navigate to the Grafana Console > click Create.

  2. Next give it a name like aws-parallelcluster

    Grafana Setup

  3. On the next screen select the following options:

    • Select IAM Identity Center as the authentication access
    • Click Create User

    Grafana Setup

  4. Next enter a valid email as well as First name and Last name

    Grafana Setup

  5. Enable the following two data sources

    • Select Amazon Managed Service for Prometheus
    • Select Amazon CloudWatch

    Grafana Setup

  6. On the next screen click Create Workspace

  7. After the workshop creates click on the login link to sign in. It’ll look something like You should have received an email with a password, enter that and your email to connect:

    Grafana home

Congrats! you just created a managed Grafana Workspace. In the next section we’ll setup Prometheus, a time series database that’ll act as the data store for everything we want to graph on Grafana.

Setup Prometheus

In this step we’ll setup Prometheus using Amazon Managed Prometheus (AMP) a fully managed, serverless prometheus.

  1. Create prometheus workspace using the AWS CLI

    export AWS_DEFAULT_REGION=us-east-1
    WORKSPACE_ID=$(sudo aws amp create-workspace --region $AWS_DEFAULT_REGION --alias aws-parallelcluster --query workspaceId)
    echo $WORKSPACE_ID
    echo "export WORKSPACE_ID=$WORKSPACE_ID" >> ~/.bashrc
    echo "export AWS_DEFAULT_REGION=$AWS_DEFAULT_REGION" >> ~/.bashrc


  2. Install prometheus server and setup a config file:

    tar xvfz prometheus-*.tar.gz
    cd prometheus-*
    sudo mv prometheus /usr/bin/
    sudo mv promtool /usr/bin/

    `` ​

  3. Create a prometheus config file, make sure that AWS_DEFAULT_REGION and WORKSPACE_ID are set.


cat > prometheus.yml << EOF
scrape_interval: 15s
evaluation_interval: 15s
scrape_timeout: 15s

- job_name: 'slurm_exporter'
	scrape_interval:  30s
	scrape_timeout:   30s
	- targets: ['localhost:8080']

- job_name: 'ec2_instances'
	scrape_interval: 5s
	- port: 9100
		refresh_interval: 10s
	- port: 9400
		refresh_interval: 10s
		- name: instance-state-name
			- running
		- name: tag:Name
			- Compute
		- name: instance-type
			- p2.xlarge
			- p2.8xlarge
			- p2.16xlarge
			- p3.2xlarge
			- p3.8xlarge
			- p3.16xlarge
			- p3dn.24xlarge
			- p4d.24xlarge
			- g3s.xlarge
			- g3.4xlarge
			- g3.8xlarge
			- g3.16xlarge
			- g4dn.xlarge
			- g4dn.2xlarge
			- g4dn.4xlarge
			- g4dn.8xlarge
			- g4dn.16xlarge
			- g4dn.12xlarge
			- g4dn.metal
	- source_labels: [__meta_ec2_tag_Name]
		target_label: instance_name
	- source_labels: [__meta_ec2_tag_Application]
		target_label: instance_grafana
	- source_labels: [__meta_ec2_instance_id]
		target_label: instance_id
	- source_labels: [__meta_ec2_availability_zone]
		target_label: instance_az
	- source_labels: [__meta_ec2_instance_state]
		target_label: instance_state
	- source_labels: [__meta_ec2_instance_type]
		target_label: instance_type
	- source_labels: [__meta_ec2_vpc_id]
		target_label: instance_vpc

- url: https://aps-workspaces.${AWS_DEFAULT_REGION}${WORKSPACE_ID}/api/v1/remote_write
		max_samples_per_send: 1000
		max_shards: 200
		capacity: 2500
sudo mkdir -p /etc/prometheus
sudo mv prometheus.yml /etc/prometheus/prometheus.yml
  1. You should now be able to test the prometheus install by running:

    prometheus --config.file /etc/prometheus/prometheus.yml


If this works, it’ll start a process listening on http://localhost:9090, you can go ahead and Ctrl-C out of it. Next we’ll setup a systemctl service to run this process automatically in the background.

  1. Create a systemctl service file like so:

    sudo su
    cat > /etc/systemd/system/prometheus.service << EOF
    Description=Prometheus Exporter
    ExecStart=/usr/bin/prometheus --config.file=/etc/prometheus/prometheus.yml


  2. Enable the prometheus service, you should see status Running.

    sudo systemctl daemon-reload
    sudo systemctl enable --now prometheus
    sudo systemctl status prometheus


  3. Test by querying for current metrics:

    curl http://localhost:9090/metrics


Congrats! We just setup a managed Prometheus server. In the next section we’ll add useful data to prometheus, starting with Slurm exporters.

Setup Exporters

Now that we have the base infrastructure in place we can start setting up exporters, these serve to collect information and send it to Prometheus. We’ll start with the Slurm exporter.

Setup Slurm Prometheus Exporter

  1. Install and Compile Slurm exporter on HeadNode:

    sudo yum install -y golang
    git clone -b 0.20
    cd prometheus-slurm-exporter
    make && sudo cp bin/prometheus-slurm-exporter /usr/bin/


  2. Start the systemctl service on the HeadNode:

    sudo su
    cat > /etc/systemd/system/prometheus-slurm-exporter.service << EOF
    Description=Prometheus SLURM Exporter
    sudo systemctl daemon-reload
    sudo systemctl enable --now prometheus-slurm-exporter
    sudo systemctl status prometheus-slurm-exporter


  3. Test by querying for current metrics:

    curl http://localhost:8080/metrics


Setup Node Exporter

  1. Similar to the Slurm exporter, we’ll also setup Node exporter a tool that publishes stats about each instance. To get started we’ll download and run it:

    tar xvfz node_exporter-*.*-amd64.tar.gz
    cd node_exporter-*.*-amd64
    sudo mv node_exporter /usr/bin


  2. Next setup a systemctl service to automatically run the service. Make sure the status says Running.

    sudo su
    cat > /etc/systemd/system/node-exporter.service << EOF
    Description=Prometheus Node Exporter
    sudo systemctl daemon-reload
    sudo systemctl enable --now node-exporter
    sudo systemctl status node-exporter


  3. Test by querying for current metrics:

    curl http://localhost:9100/metrics


Setup DCGM Exporter

DCGM Exporter is a tool for exporting GPU Metrics from Nvidia GPU’s. Here’s an example of the stats you can monitor:

DCGM Exporter

You’ll need to have DCGM installed on the AMI - it’s pre-installed on the Deep Learning AMI so I’ll assume you already have it. To check for it run:

systemctl status nvidia-dcgm
  1. Now we can install & build dcgm-exporter:

    git clone
    cd dcgm-exporter/
    make binary
    sudo make install


  2. Enable it:

    sudo su
    cat > /etc/systemd/system/dcgm-exporter.service << EOF
    Description=dcgm Exporter


Import Dashboards

Next we can add dashboards to Grafana to display all this information!

comments powered by Disqus