Deploying a HPC SLURM cluster

Introduction

In this tutorial you will learn how to deploy a high performance computing (HPC) cluster on the Catalyst Cloud using elasticluster and SLURM.

ElastiCluster is an open source tool to create and manage compute clusters on cloud infrastructures. The project was originally created by the Grid Computing Competence Center from the University of Zurich.

SLURM is a highly scalable cluster management and resource manager, used by many of the world’s supercomputers and computer clusters (it is the workload manager on about 60% of the TOP500 supercomputers).

The following video outlines what you will learn in this tutorial. It shows a SLURM HPC cluster being deployed automatically by ElastiCluster on the Catalyst Cloud, a data set being uploaded, the cluster being scaled on demand from 2 to 10 nodes, the execution of an embarrassingly parallel job, the results being downloaded, and finally, the cluster being destroyed.

Warning

This tutorial assumes you are starting with a blank project and using your VPC only for ElastiCluster. You may need to adjust things (e.g. create a dedicated elasticluster security group), if you are doing this in a shared VPC.

Prerequisites

Install Python development tools:

sudo apt-get install python-dev

Create a virtual environment to install the software:

cd ~
virtualenv elasticluster
source elasticluster/bin/activate

Install Elasticluster on the virtual environment:

pip install elasticluster pyopenssl ndg-httpsclient pyasn1 ecdsa

Install the Catalyst Cloud OpenStack client tools:

pip install python-keystoneclient python-novaclient python-cinderclient python-glanceclient python-ceilometerclient python-heatclient python-neutronclient python-swiftclient

Configuring ElastiCluster

Create template configuration files for ElastiCluster:

elasticluster list-templates 1> /dev/null 2>&1

Edit the ElastiCluster configuration file (~/.elasticluster/config). A sample configuration file compatible with the Catalyst Cloud is provided below:

[cloud/catalyst]
provider=openstack
auth_url=https://api.cloud.catalyst.net.nz:5000/v2.0
username=username
password=password
project_name=projectname
region_name=nz-por-1
request_floating_ip=True

[login/ubuntu]
image_user=ubuntu
image_user_sudo=root
image_sudo=True
user_key_name=elasticluster
user_key_private=~/elasticluster/id_rsa
user_key_public=~/elasticluster/id_rsa.pub

[setup/ansible-slurm]
provider=ansible
frontend_groups=slurm_master
compute_groups=slurm_clients

[cluster/slurm]
cloud=catalyst
login=ubuntu
setup_provider=ansible-slurm
security_group=default
# Ubuntu image
image_id=fe2a52bd-1881-45a6-8c16-d0a1005a1a4e
flavor=c1.c1r1
frontend_nodes=1
compute_nodes=2
ssh_to=frontend

Configuring the cloud

Create SSH keys for ElastiCluster (no passphrase):

ssh-keygen -t rsa -b 4096 -f ~/elasticluster/id_rsa

Source your openrc file, as explained on Command line interface (CLI).

Allow ElastiCluster to connect to instances over SSH:

nova secgroup-add-group-rule default default tcp 22 22

Using ElastiCluster

The following commands are provided as examples of how to use ElastiCluster to create and interact with a simple SLURM cluster. For more information on ElastiCluster, please refer to https://elasticluster.readthedocs.org/.

Deploy a SLURM cluster on the cloud using the configuration provided:

elasticluster start slurm -n cluster

List information about the cluster:

elasticluster list-nodes cluster

Connect to the front-end node of the SLURM cluster over SSH:

elasticluster ssh cluster

Connect to the front-end node of the SLURM cluster over SFTP, to upload (put file-name) or download (get file-name) data sets:

elasticluster sftp cluster

Grow the cluster to 10 nodes (add another 8 nodes):

elasticluster resize cluster -a 8:compute

Terminate (destroy) the cluster:

elasticluster stop cluster

Using SLURM

Connect to the front-end node of the SLURM cluster over SSH as described on the previous section.

The following example demonstrates how to create a simple, embarrassingly parallel workload job that will trigger four tasks and write its output to results.txt.

#!/bin/bash
#
#SBATCH --job-name=test
#SBATCH --output=results.txt
#
#SBATCH --ntasks=4
#SBATCH --time=10:00
#SBATCH --mem-per-cpu=100

srun hostname
srun printenv SLURM_PROCID
srun sleep 15

Submit a job:

sbatch job.sh

List the jobs in the queue:

squeue