The HEC login node acts as the interface between you and the HEC proper (the cluster of compute nodes). Rather than running compute-intensive applications directly on the frontend, they must be submitted to the LSF (Load Sharing Facility) job scheduling system as jobs. The two basic types of job are Batch and Interactive, both submitted with the bsub command. Computationally intensive and/or large memory jobs must not be run directly on the login node. If you need to test such jobs before submitting them, please use an interactive session.
Batch jobs
A batch job is one which does not require any input from the keyboard and does not send any output to the user's screen. Typically a batch job will send its output to one or more files in the user's directory.
Batch jobs are run on the HEC by creating a batch job control script (or command file) and "submitting" it to the system using the command bsub, e.g.:
bsub < my_program.bsub
Assuming that there is at least one processor-slot free, the system will select a compute node on which to run your job. This ensures that the combined load of all users' jobs is spread evenly over the entire cluster. If no suitable slot is available at the time then the job will wait in a "pending" queue until one becomes free. To see how busy the HEC is, use the qslots command, which reports the number of available job slots.
At present, the system uses a Fair Share scheduling strategy; users may submit any number of jobs, and jobs over a certain number will be held waiting, while priority will be given to those who are currently running fewer jobs. Please check the message of the day for changes to scheduling.
The majority of compute nodes on the HEC have 24 gigabytes of memory, and can run 8 jobs simultaneously. However, if your job requires more than 1/2 gigabyte of memory, you are required to submit your jobs with a memory resource request to allow the scheduler to assign jobs to compute nodes without the risk of memory oversubscription - see the advanced jobs section.
Example of a batch job control script
#BSUB -J myjobname
#BSUB -oo myjob.stdout
#BSUB -eo myjob.stderr
. /etc/profile
echo Job running on compute node `uname -n`
Explanation
Batch job scripts are simply standard shell scripts with extra lines (beginning with "#BSUB") containing instructions for the scheduler. The first line:
#BSUB -L /bin/bash
Instructs the scheduler to run the job in a fresh (bash) shell. This is strongly recommended as best practice - without this line the scheduler will inherit the current environment from your login shell, which may not contain the correct settings to run your job.
The next line:
#BSUB -J myjobname
Sets a name for your job, so that you can identify it while it's running.
The next two lines:
#BSUB -oo myjob.stdout
#BSUB -eo myjob.stderr
Allow you to specify files you want to direct standard output and standard error to. Most applications send output to standard output, and use standard error only report errors or warnings. Unless specified as a full path, the output files will be placed in the directory from which you call bsub. Output files will overwrite any existing files of the same name.
Note: The job output will be placed into these files once the job has completed. To see the partial output, use the bpeek command along with the job's ID.
The final job setup line reads:
. /etc/profile
This will set up the shell environment of the job so that it matches the functionality you see on the login node.
Once the batch job environment has been specified, subsequent lines should contain the commands needed to run your job. The job will effectively run as shell script, and will process any of the usual commands permitted from the specified shell. The example command line above:
echo Job running on compute node `uname -n`
simply prints a short message to say which compute node the job was run on. See the Software section of these web pages for templates of job scripts for popular packages.
Job Submission
A batch job script is submitted for running by the bsub command. The script above could be run by typing:
bsub < my_program.bsub
NOTE: The use of the input redirect ("<") is vital. Without it, the scheduler will accept the job without processing any of the #BSUB commands contained in the file. Once the job is submitted, you should see a response like this displayed on the screen:
Job <4873> is submitted to default queue <normal>.
The number given is the job number - a unique ID to allow you to identify your job among the hundreds running on the cluster. The progress of your job(s) can be monitored with the bjobs command. For more details see the Job Monitoring page.
If for any reason you wish to cancel a job, perhaps because it is giving the wrong output or because you submitted it by mistake, you can do so with the command bkill. It takes as its argument the job-ID provided when you first submit the job (which is also displayed by qstat). So to kill the job submitted in the above example, with job ID 4873, you would enter:
bkill 4873
Interactive jobs
While batch jobs are the most efficient type of job to submit, some applications require regular user input, making them unsuitable for batch job submission. In such cases, jobs can be submitted interactively, giving you a command line shell on a compute node with sufficient free resources to run your application. You can submit an interactive job with one of the following commands:
bsub -Is tcsh -l
for a tcsh shell (the default user login shell) or
bsub -Is bash -l
for a bash login shell.
If the interactive job request can be satisfied you will receive a response like this:
elysium> bsub -Is tcsh -l Job <6676> is submitted to default queue <interactive>. <<Waiting for dispatch ...>> <<Starting on comp000>> comp000>
To invoke X11-forwarding, add the argument -XF to bsub
Don't forget to logout from your interactive session when you have finished your tasks - your job slot is not available to anyone else until you do so.
The test queue
The test queue exists to allow quick-turnaround testing of jobs during normal business hours when the cluster is otherwise busy by dedicating a single compute node for this purpose. It can be frustrating to wait a few hours for a job to launch on a busy cluster only to have it fail immediately on launch due to a typo in the submission script. If you want to do a quick sanity check of a new or altered job submission script, or if you want to try out some small jobs to get the hang of the job submission system, then the test queue is recommended.
To use this queue, simply add -q test to your bsub job submission command to divert the job to the test queue. This queue is usually lightly loaded, and should give very fast turnaround. NB: To ensure fast turnaround, jobs submitted to the test queue are limited to a maximum of 5 minutes run time. Jobs running for more than 5 minutes in this queue will be automatically terminated.
The night queue
Outside of normal business hours, the compute node dedicated to the test queue also offers a further queue. The night queue has been set up to offer reasonable turnaround for short-duration pilot or test jobs. Jobs of up to 30 minutes run time can be submitted to this queue. To prevent undue delays to users of the test queue, the night queue only activates between 18:00 and 08:00. Jobs submitted to the night queue outside of these hours will wait until the activation window.
To submit jobs to the night queue, simply at -q night to your bsub job submission command.
©Lancaster University Computer User Agreement Privacy Statement
©Lancaster University ISS Governance Computer User Agreement Privacy & Cookies Notice