Running jobs on the HPC 

Introduction to Job Submission



The HPC frontend machine acts as the interface between you and the HPC proper (the cluster of execution nodes). Rather than running jobs directly on the frontend, they must be submitted to the SGE (Sun Grid Engine) job scheduling system. Batch jobs may be submitted with the qsub command, and interactive jobs via qlogin or qrsh. Computationally intensive and/or large memory jobs must not be run directly on the frontend machine. If you need to test such jobs before submitting them, please use an interactive session.


Batch jobs

A batch job is one which does not require any input from the keyboard and does not send any output to the user's screen. Typically a file of input instructions will be read by the program being run and it will create one or more output files in the user's filestore.

Batch jobs are run on the HPC by creating a batch job control script (or command file) and "submitting" it to the system using the command qsub, e.g.:

qsub  my_program.com

Assuming that there is at least one processor-slot free, the system will select an execution node on which to run your job. This ensures that the combined load of all users' jobs is spread evenly over the entire HPC array. If no suitable slot is available at the time then the job will wait in a "pending" queue until one becomes free. To see how busy the HPC is, use the qslots command, which reports the number of available job slots along with the number of currently waiting jobs.

At present, the system uses a Fair Share scheduling strategy; users may submit any number of jobs, however priority will be given to those who are currently running fewer jobs. Please check the message of the day for changes to scheduling.

The majority of execution nodes on the HPC have 8 gigabytes of memory, and can run 4 jobs simultaneously. However, if your job requires more than 1/2 gigabyte of memory, you are required to submit your jobs with a memory resource request to allow the scheduler to assign jobs to exec nodes without the risk of memory oversubscription - see the advanced jobs section.


Example of a batch job control script

#!/bin/bash

#$ -o $HOME/my_program_directory/my_program.stdout
#$ -e $HOME/my_program_directory/my_program.stderr
#$ -S /bin/bash

. /etc/profile
module add dot
cd my_program_directory
time my_program < my_program.input

Explanation

Batch job scripts are simply standard shell scripts with extra lines (beginning with "#$") containing instructions for the scheduler.

The first line:

#!/bin/bash

specifies the shell which is to interpret this script. Leave this line exactly as shown unless you need a different shell to interpret your job.

The next two lines:

#$ -o $HOME/my_program_directory/my_program.stdout
#$ -e $HOME/my_program_directory/my_program.stderr

are SGE directives used to specify the destination for your job's standard output and standard error respectively. (You don't have to specify these files. If you don't, default standard output and standard error files will be created in your HPC home directory, with names based upon the job id and job name.)

This line:

#$ -S /bin/bash

Instructs SGE to execute commands in this script according to the bash shell. This is the recommended shell for all batch jobs.

The next line:

. /etc/profile

Sets up your environment to ensure that programs can find the relevant libraries, and that the modules system is available. This line should always be included in your job scripts.

The next line:

cd my_program_directory

specifies the current working directory for your job. Note that when your batch job starts, its current working directory will be your HPC home directory and not the current directory of the interactive session from which you submit the batch job.

The last line:

time my_program < my_program.input

is the command to run your program. This is normally the same as the command you would type if you were running it interactively. In this example the command to run the program (my_program < my_program.input) is prefixed by the system command time. This causes a timing summary to be printed to the standard error file when the job finishes. The time command is not neccesary for job scripts; it simply provides a useful summary of the length of time your program took to run.

Note that any standard input to the program (what you would type at the keyboard if you were running it interactively) must be put into a file, my_program.input in this case. The redirection operator, <, then makes the program read this file for its input.


Job Submission

A batch job script is submitted for running by the qsub command. The script above could be run by typing:

qsub my_program.com

A response like the following should be displayed on the screen:

your job 5702 ("my_program.com") has been submitted

The number given is the job number - a unique ID to allow you to identify your job among the hundreds running on the cluster. The progress of your job(s) can be monitored with the qstat command. For more details see the Job Monitoring page.

If for any reason you wish to cancel a job, perhaps because it is giving the wrong output or because you submitted it by mistake, you can do so with the command qdel. It takes as its argument the job-ID provided when you first submit the job (which is also displayed by qstat). So to kill the job submitted in the above example, with job ID 5702, you would enter:

qdel 5702


Interactive jobs

User programs and packages must not be run interactively on the front-end machine (the one you log into). You can, however, request an interactive session on one of the other machines in the HPC by using the command qrsh or qlogin.

If the qlogin request can be satisfied you will receive a response like this:

Your job 894 ("QLOGIN") has been submitted
waiting for interactive job to be scheduled ...
Your interactive job 894 has been successfully scheduled.
Establishing /usr/local/packages/sge/qlogin_wrapper session to host
sun108.hpc.lancs.ac.uk ...

The session will use X11 forwarding to provide a consistant forwarding chain to your desktop.

Don't forget to logout from your interactive session when you have finished your tasks - your job slot is not available to anyone else until you do so.

If all job slots are currently full, the interactive request will report that no slots are available and will terminate. If you want an interactive session as soon as possible, you can add the arguments -now n. Your request will be queued like a batch job, and will wait until a slot becomes available.


Lancaster HPC home page | ISS home page | University home page

  
To the Top

©Lancaster University   Computer User Agreement   Privacy Statement  

©Lancaster University   ISS Governance   Computer User Agreement   Privacy & Cookies Notice  

Lancaster University
Bailrigg
LancasterLA1 4YW United Kingdom
+44 (0) 1524 65201