Compute Cluster

Last modified: 2022-02-10

This is a short introduction on how to use our compute cluster. If you have any suggestions on how to improve this or if in your opinion something is missing, please let us know and we will try to incorporate this such that future students (and colleagues) can profit from this.

Introduction

The general idea of a computational grid, or compute cluster, is to have a unique interface to several possibly heterogeneous machines. From the software point of view, the system consists of one large machine with multiple cores/CPUs providing all the same technical infrastructure with respect to environmental settings, e.g., paths to libraries.

The Purpose of a Batch-Queuing System

The purpose of a batch-queuing system like Sun Grid Engine (SGE) is to provide a unique user interface to be able to efficiently handle a crowd of (heterogenous) hardware. On the one side this implies that the user needs only a few commands to perform all tasks. On the other side it is necessary that some restrictions are accepted. These restrictions are, however, almost minimal and acceptable.

IMPORTANT

Things you should always remember:

  • Do not use more memory than you asked for. Using more memory as you asked for leads to swapping, which will affect the runtime of your jobs and the jobs of other users. See qsub options (mem_free, s_vmem, h_vmem).
  • Do not write to and read from /home1 extensively. If you do, that will lead to high i/o, which will affect the runtime of your jobs and the jobs of other users. See $TMPDIR for a solution.

Clusternode Hardware


Nodes Host-
names
Host-
group
Cores/
Node
SMT Slots/
Node
RAM/
Node
$TMPDIR/
Node
Description/
Node
14 bs01-bs14 bc1 8 OFF 8 24GB 38GB 2× Intel Xeon E5540, 2.53 GHz Quad Core
1 bs15 bc2 12 OFF 12 72GB 105GB 2× Intel Xeon E5649, 2.53 GHz 6-core
2 bs16-bs17 bc2 12 OFF 12 60GB 105GB 2× Intel Xeon E5649, 2.53 GHz 6-core
2 bs31-bs32 bc3 12 OFF 12 48GB 80GB 2× Intel Xeon E5-2630 v2, 2.60GHz 6-core
16 b401-b416 bc4 20 OFF 20 160GB 188GB 2× Intel Xeon E5-2640 v4, 2.40GHz 10-core
3 b501-b503 bc5 48 ON 96 1024GB 1TB 2× AMD EPYC 7402, 2.80GHz 24-core

Clusternode Overview

To get an overview of the clusternodes and their load you can use the command:

qhost

Running Jobs

This sections provides a short introduction on how to submit one or several jobs to the SGE. In fact, this is easy as long as you understand a little bit of shell scripting. For submitting there are two different kinds of jobs, simple jobs and array jobs:

Simple Jobs

Always keep in mind that a single job should be single threaded, i.e. it should use one available slot and therefore not more than 100% CPU time.
A simple job can be submitted by first creating a shell script and then typing one command on your command line. First create a shell script looking something like the following:

#!/bin/bash

your_program your_parameters

Let us assume you saved your script into the file script.sh. Then you simply type

qsub -N JobName -l h_vmem=2G -l variables_list -r y -e /dev/null -o /dev/null path-to-script/script.sh parameters 

on your console, with the following meanings:

  • -N JobName -- assigns a name to your job
  • -l h_vmem=2G -- request a memory limit of 2GB for the job
  • -l variables_list -- see below at Variables to be Requested
  • -r y -- ensures that a job is rescheduled (restarted) when the execution host the job was currently running on crashes
  • -e /dev/null -- writes error to the given path (i.e. currently to /dev/null)
  • -o /dev/null -- writes output to the given path (i.e. currently to /dev/null)
  • path-to-script/script.sh parameters -- calls your script with the given command line parameters

Multithreaded Jobs

Submitting a job that uses multiple threads is possible.
For example if you submit a job "job01.sh" that will use 4 threads, you would request 4 slots when submitting a job like the following:

qsub -pe pthreads 4 -l h_vmem=2G job01.sh

The gridengine does not check how many threads are started, it is just a value important for the scheduler so the CPU load will not rise above 100%!
The requested memory is multiplied by the number of requested slots. If you start a job requesting 4 slots/threads and 2GB of memory the scheduler will reserve 8GB of memory for this job.

Array Jobs

Array jobs are useful when performing hundreds of runs of the same parameter setting. In fact, this is just a shortcut for iteratively submitting simple jobs. The job itself is then split into tasks numbered according to the parameters given (see below). Anyhow, you again need your script.sh specified above. Submitting an array job is then done using the following syntax:

qsub -N JobName -l h_vmem=2G -l variables_list -r y -e /dev/null -o /dev/null -t 1-10:1 path-to-script/script.sh parameters

with the same meanings as above. Only thing changes is:

  • -t 1-10:1 -- specifies that your job is split into tasks numbered 1 to 10 using step size 1, i.e. 5-20:5 would generate tasks with the ids 5, 10, 15 and 20
The variable $SGE_TASK_ID is known and set in script.sh, i.e. the current task id can be accessed by this variable. This might be useful for writing log files etc.

 

Example Job

Assume you want to test two different algorithms on 5 instances of graphs of 4 different sizes.
First a script script.sh looking something like

#!/bin/bash

logFile=`printf "%s_r%02d.log" $1 $SGE_TASK_ID`;
your_program -log $logFile;

is defined (of course your parameter for logging might be named different). Then a second script callingLoop.sh is specified looking like

#!/bin/bash

for alg in alg1 alg2
do
  for size in 005 010 020 040
  do
    for (( i=1; i < 6; ++i ))
    do
      instName=`printf "inst_%s_%02d.graph" $size $i`;
      logName=`printf "log_%s_%s_%02d" $alg $size $i`;
      
      qsub -N jobName -l h_vmem=2G -l variables_list -t 1-10:1 -r y -e /dev/null -o /dev/null path-to-script/script.sh $logName $instName $alg;
    done;
  done;
done;

Then you need to type

./callingLoop.sh

into your console.

$TMPDIR

If you want to use local disk space on the compute nodes you can use the variable $TMPDIR as a directory name in your job script. It will create a temporary directory on the node for each job when the job script is started and it will automatically delete the directory when the job script finishes. Please do not use /tmp directly!

Job Status

To gather information on the status of your jobs simply type

qstat

on the console.

man qstat

might be useful if you are interested in more details.

Deleting Jobs

Sometimes you might want to delete (possible) wrongly submitted jobs. This can be easily done by typing

qdel <job_id>

The parameter <job_id> corresponds to the id of the job displayed either during submission or by the qstat-command.

Variables to be Requested

As indicated above there is the possibility to request some variables or also called attributes. This is useful to ensure that the requested features are provided by the executing host, i.e., the CPU used to compute the submitted job. In addition to the standard variables as implemented by each SGE the following variables can be requested.

Variable Meaning
noX requests that jobs are executed on members of the noX cluster. i.e. no1-5 = 5x (1× Intel Core 2 Quad, 2.83GHz; 8GB RAM)
bc1 requests that jobs are executed on members of the blade center 1 cluster. i.e. bs01-14 = 14x (2× Intel Xeon E5540, 2.53 GHz Quad Core; 24GB RAM)
bc2 requests that jobs are executed on members of the blade center 2 cluster. i.e. bs15 = (2× Intel Xeon E5649, 2.53 GHz 6-core; 72GB RAM), bs16-bs17 = 2x (2× Intel Xeon E5649, 2.53 GHz 6-core; 60GB RAM)
bc3 requests that jobs are executed on members of the blade center 3 cluster. i.e. bs31-bs32 = 2x (2× Intel Xeon E5-2630 v2, 2.60GHz 6-core; 48GB RAM)
bc4 requests that jobs are executed on members of the blade center 4 cluster. i.e. b401-b416 = 16x (2× Intel Xeon E5-2640 v4, 2.40GHz 10-core; 160GB RAM)
bladeX requests that jobs are executed on members of the blade center 1, 2, 3 and 4 clusters.
longrun If you expect your job to take more than 10 hours to complete please submit with option longrun: -l longrun=1. It ensures that there are no more than 250 long running jobs running at the same time, i.e. there are nodes free for normal jobs.
mem_free (consumable=yes default=1.9G)
requests that jobs are only submitted to nodes having at least the specified amount of space left. The specified amount is subtracted from the available memory of the node but it is not checked whether the job actually uses more memory! Jobs that exceed their requested memory amount might have a negative influence on other jobs! If you are unsure w.r.t. the memory consumption of your jobs, it is strongly recommended to use h_vmem and/or s_vmem.
s_vmem (consumable=no default=0)
requests a soft memory limit, i.e., if a job exceeds the specified memory limit it receives signal SIGXCPU that can be used to terminate gracefully (and write some last logging information). Should be set slightly lower than h_vmem to take effect. (default=0 means no limit)
h_vmem (consumable=no default=0)
requests a hard memory limit, i.e., if a job exceeds the specified memory limit it is aborted via a SIGKILL signal. (default=0 means no limit)

Requesting the variables should be done similar to the following exemplary statement

qsub -l noX -l bladeX -l mem_free=1.9G -l s_vmem=1.8G -l h_vmem=1.9G script.sh

By the way, if using exactly this statement your jobs will never be processed, since no CPU is (and will be) at the same member of noX and bladeX cluster!

Troubleshooting

In some situations it might happen that a program perfectly runs on your machine but fails when submitted to the grid. In this case it is sometimes helpful if you know on which actual machine the (failing) job was executed. For this purpose you can add the following line in your calling script (script.sh in the above examples):

echo "running this job on $HOSTNAME"

The effect of this line is that on standard out a line similar to

running this job on eowyn.ac.tuwien.ac.at

will appear which, obviously, indicates the machine used to execute your job. This information might be requested by your advisor (or our technician) when trying to find the error(s).

Possible Mistakes

Make sure that the option "-r y" is provided for your submitted jobs (this can be changed using qalter - even for running jobs). If this option is not set then your jobs will not be (automatically) rescheduled (restarted) if the execution host the job was running on crashes.

Problems and Questions

If you have any questions please contact your supervisor. S/He will try to assist you as far as possible. You can also contact Andreas Müller if there are some technical issues.