VLF Group compute cluster – Job submission

Job submission

All computation on the cluster should be done through the job submission queue. Do not run any compute-intensive jobs on the cluster machines (cluster0XX) except through the job submission queue. If you have short interactive jobs to run or general data viewing, do it on the head node, nansen. Running compute-intensive jobs outside the job submission engine will impact other users and your process will likely be killed.

Job submission

All computation on the cluster should be done through the job submission queue. Do not run any compute-intensive jobs on the cluster machines (cluster0XX) except through the job submission queue. If you have short interactive jobs to run or general data viewing, do it on the head node, nansen. Running compute-intensive jobs outside the job submission engine will impact other users and your process will likely be killed.

The scheduler we are using is called PBS Torque. To submit a job on the command line, first create a shell script. This sample is named
myjob.pbs:


  #!/bin/sh
  
  mpirun /shared/users/username/executable

Generally use the full path to the executable. Change into the directory containing this script and run:


  qsub -j oe -o log.txt -l nodes=2:ppn=8 -l walltime=01:00:00 myjob.pbs

In the above example, we have requested 2 compute nodes and 8 cores per node, for a total of 16 cores. walltime indicates that the job should run no longer than 1 hour, after which it will be killed. The command returns a job number:


  1279.nansen

Job stats

To view all jobs in the queue, type in qstat:


  Job id                    Name             User            Time Use S Queue
  ------------------------- ---------------- --------------- -------- - -----
  1279.nansen                myjob.pbs        username        00:00:00 C batch
  1280.nansen                myjob.pbs        username               0 R batch

This shows 2 jobs in the queue, one named 1279.nansen, which is already completed (the C in the right field) and one named 1280.nansen, which is currently executing (the R field). If you wish to know which cluster machines the job is executing on, type in qstat -n 1280.nansen, which will return something like this:


  1280.nansen         username  batch    myjob.pbs         16327     2  16    --  01:00 C 00:00
     cluster002/7+cluster002/6+cluster002/5+cluster002/4+cluster002/3
     +cluster002/2+cluster002/1+cluster002/0+cluster001/7+cluster001/6
     +cluster001/5+cluster001/4+cluster001/3+cluster001/2+cluster001/1
     +cluster001/0

This shows that the job is running on cluster001
and cluster002, 8 processes per node.

Interactive jobs

You can also request an interactive job, which allocates a number of nodes and CPUs for your exclusive, interactive use.


  qsub -I -X -l nodes=1:ppn=1 

This command gives you an interactive shell on one of the nodes, from which you can start jobs like Matlab. The -X forwards your X11 connection (graphical interface). Use this feature only if you intend to run compute-heavy jobs interactively, otherwise you will be wasting cycles that could be available for other users.

Performance tips

Use the -l nice=-20 option to qsub to run your jobs with the highest possible priority. This can provide a significant speedup for MPI jobs.

Keeping MPI processes local to a node is usually faster than spreading them out, e.g., don’t request nodes=8,ppn=1, rather, request nodes=1,ppn=8.

Cancelling jobs

To cancel a job, type in qdel jobname. In the above example, the jobname is 1280.nansen.

Getting more help

There is much more to Torque than what is listed here. Visit http://www.clusterresources.com/products/torque-resource-manager.php for more information.