Job submission
All computation on the cluster should be done through the job submission queue. Do not run any compute-intensive jobs on the cluster machines (cluster0XX) except through the job submission queue. If you have short interactive jobs to run or general data viewing, do it on the head node, nansen. Running compute-intensive jobs outside the job submission engine will impact other users and your process will likely be killed.
Job submission
All computation on the cluster should be done through the job submission queue. Do not run any compute-intensive jobs on the cluster machines (cluster0XX) except through the job submission queue. If you have short interactive jobs to run or general data viewing, do it on the head node, nansen. Running compute-intensive jobs outside the job submission engine will impact other users and your process will likely be killed.
The scheduler we are using is called PBS Torque. To submit a job on the command line, first create a shell script. This sample is named
myjob.pbs:
#!/bin/sh mpirun /shared/users/username/executable
Generally use the full path to the executable. Change into the directory containing this script and run:
qsub -j oe -o log.txt -l nodes=2:ppn=8 -l walltime=01:00:00 myjob.pbs
In the above example, we have requested 2 compute nodes and 8 cores per node, for a total of 16 cores. walltime indicates that the job should run no longer than 1 hour, after which it will be killed. The command returns a job number:
1279.nansen
Job stats
To view all jobs in the queue, type in qstat:
Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 1279.nansen myjob.pbs username 00:00:00 C batch 1280.nansen myjob.pbs username 0 R batch
This shows 2 jobs in the queue, one named 1279.nansen, which is already completed (the C in the right field) and one named 1280.nansen, which is currently executing (the R field). If you wish to know which cluster machines the job is executing on, type in qstat -n 1280.nansen, which will return something like this:
1280.nansen username batch myjob.pbs 16327 2 16 -- 01:00 C 00:00 cluster002/7+cluster002/6+cluster002/5+cluster002/4+cluster002/3 +cluster002/2+cluster002/1+cluster002/0+cluster001/7+cluster001/6 +cluster001/5+cluster001/4+cluster001/3+cluster001/2+cluster001/1 +cluster001/0
This shows that the job is running on cluster001
and cluster002, 8 processes per node.
Interactive jobs
You can also request an interactive job, which allocates a number of nodes and CPUs for your exclusive, interactive use.
qsub -I -X -l nodes=1:ppn=1
This command gives you an interactive shell on one of the nodes, from which you can start jobs like Matlab. The -X forwards your X11 connection (graphical interface). Use this feature only if you intend to run compute-heavy jobs interactively, otherwise you will be wasting cycles that could be available for other users.
Performance tips
Use the -l nice=-20 option to qsub to run your jobs with the highest possible priority. This can provide a significant speedup for MPI jobs.
Keeping MPI processes local to a node is usually faster than spreading them out, e.g., don’t request nodes=8,ppn=1, rather, request nodes=1,ppn=8.
Cancelling jobs
To cancel a job, type in qdel jobname. In the above example, the jobname is 1280.nansen.
Getting more help
There is much more to Torque than what is listed here. Visit http://www.clusterresources.com/products/torque-resource-manager.php for more information.