Monitoring jobs on the HPC
- Monitoring job status using qstat
- Monitoring memory and CPU usage with qtop
- Email notification of job completion
Monitoring job status with qstat
The simplest way to monitor your HPC jobs is with the qstat command. Used on its own, this will output a list of all jobs currently running or waiting to run on the cluster:
job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
520 0.55500 mol.43.com grace r 08/10/2006 17:39:56 serial.q@sun001.hpc.lancs.ac.u 1
519 0.55500 mol.42.com grace r 08/10/2006 17:39:56 serial.q@sun011.hpc.lancs.ac.u 1
506 0.55500 mol.29.com grace r 08/10/2006 17:39:55 serial.q@sun013.hpc.lancs.ac.u 1
505 0.55500 mol.28.com grace r 08/10/2006 17:39:55 serial.q@sun023.hpc.lancs.ac.u 1
410 0.55500 a20.com sirichan r 08/10/2006 10:36:25 serial.q@sun025.hpc.lancs.ac.u 1
411 0.55500 a21.com sirichan r 08/10/2006 10:38:40 serial.q@sun058.hpc.lancs.ac.u 1
483 0.55500 mol.6.com grace r 08/10/2006 17:39:40 serial.q@sun072.hpc.lancs.ac.u 1
409 0.55500 a22.com sirichan r 08/10/2006 10:34:41 serial.q@sun089.hpc.lancs.ac.u 1
406 0.55500 a17.com sirichan r 08/10/2006 10:25:55 serial.q@sun092.hpc.lancs.ac.u 1
499 0.55500 mol.22.com grace r 08/10/2006 17:39:55 serial.q@sun094.hpc.lancs.ac.u 1
408 0.55500 a19.com sirichan r 08/10/2006 10:31:25 serial.q@sun102.hpc.lancs.ac.u 1
503 0.55500 mol.26.com grace r 08/10/2006 17:39:55 serial.q@sun112.hpc.lancs.ac.u 1
The output columns give the following information:
| job-ID | A number used to uniquely identify your job within SGE system. Use this number when you want to halt a job via the qdel command. |
| prior | The job's fixed priority within the cluster. Job priorities are automatically calculated to give users with the fewest number of running jobs a higher priority |
| name | The name of the job submission script |
| user | The username under which the job was submitted |
| state | The job's current state. See the job lifecycle section below for details |
| submit/start at | The time at which the job was submitted (for waiting jobs), or when it was launched (for running jobs) |
| queue | The queue to which a running job is assigned. The queuename is composed of two components, separated by an @ symbol. The first is the cluster queue (either serial or parallel), the second is the name of the execution host the job is running on. |
| slots | The number of slots the job occupies. For serial jobs, this value will be one. |
| ja-task-ID | For task arrays, this will specify the job's task ID |
The default action for qstat is to output basic information on all jobs. If you wish to view only your own, then you can specify an individual username with:
Job lifecycle
A job's lifecycle can be tracked via the state field in the qstat output. All jobs start start with a state of "qw" (queued and waiting). If the cluster is busy, or the job has requested a resource which is currently fully utilised, then a job may spend some time in this state. Once an approporiate job slot is available, the job's state changes briefly to "t" (transfer), followed by "r" (running). When a job no longer appears on the qstat output, it has finished or has been deleted.
Note: A job state of "Eqw" is an error state. See the troubleshooting page for more details.
qstat -u username
Monitoring memory and CPU usage with qtop
The qstat command gives basic information of the status of a job. Sometimes, though, it's useful to have a more detailed look at how a job is running; for example, to see how large a program is when running, or to check that it hasn't stalled. On a single platform system, the top command provides a more in-depth view of program status. On the HPC, you can use the qtop command to collect and display the output from top for all your currently submitted jobs.
Consider the following output, from qstat -u sirichan:
job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
410 0.55500 a20.com sirichan r 08/10/2006 10:36:25 serial.q@sun025.hpc.lancs.ac.u 1
411 0.55500 a21.com sirichan r 08/10/2006 10:38:40 serial.q@sun058.hpc.lancs.ac.u 1
409 0.55500 a22.com sirichan r 08/10/2006 10:34:41 serial.q@sun089.hpc.lancs.ac.u 1
We can generate relevant qtop information for these jobs by running qtop -u sirichan:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND sun025: 27942 sirichan 25 0 1586m 1.5g 3548 R 99.5 19.8 1422:34 siesta.2.0 27911 sirichan 16 0 7876 1312 1020 S 0.0 0.0 0:00.00 410 sun058: 27861 sirichan 25 0 1652m 1.6g 3548 R 99.3 20.3 1420:30 siesta.2.0 27830 sirichan 16 0 7876 1312 1020 S 0.0 0.0 0:00.00 411 sun089: 28103 sirichan 25 0 1586m 1.5g 3548 R 99.9 19.8 1424:34 siesta.2.0 28072 sirichan 15 0 7876 1312 1020 S 0.0 0.0 0:00.00 409
The output fields are identical to those for the standard linux top command executed in batch mode - see the man page for an in-depth description of the meaning of each field. This description will cover only the more relevant ones.
The first thing to note is that the information provided by qtop is very different from that of qstat. qtop is not integrated into the SGE system, so it will output process information, not job information - a single job will involve executing a number of processes. You'll need to compare qtop and qstat output to work out just what's going on. For example, qtop doesn't always give you the job-ID number, and it often lists two or more processes where qstat lists just one job.
Because jobs are submitted to the cluster in the form of job scripts, the job script itself becomes a process, which is often named after the job ID in qtop's COMMAND field. For most purposes, you'll be interested in the other process listed - the main process that the job script is currently running.
The four most relevant fields are labelled "COMMAND", "VIRT", "RES" and "CPU". COMMAND gives the name of the process running. This will be the name of the program called from within your job script - not the name of the job script itself (which is given by the qstat output).
The VIRT and RES fields give the total and resident memory size of each process. Smaller process sizes are listed in (k)ilobytes, larger ones in (m)egabytes, or even (g)igabytes. In the above example, all the main processes - named siesta.2.0 - have a total size of nearly 1.5 to 1.6 gigabytes. (As jobs which consume more than 0.5 gigabytes are classified as large memory jobs, the user has submitted these jobs with valied memory resource requirement to qsub, in order to ensure that the job scheduler does not inadvertantly overload any particular execution node with too many large memory jobs).
The other useful field in the qtop output is "CPU", which describes how much of a single CPU the process is consuming. Typically a running job should be consuming very close to 100% of a CPU's resources. Values considerably lower than this will likely indicate some problem; it might be spending a disproportionate amount of time performing file reads or writes; or in the case of badly balanced parallel programs, it might be idle while waiting for a communication from another process.
Email notification of job completion
When a batch job completes, it will no longer be listed in the qstat output. To avoid the chore of repeatedly accessing the HPC to check if your jobs have finished, you can instruct qsub to notify you by email. To do this, simply add the -m e flag to qsub:
qsub -m e myjob.com
When your job ends (or is killed with the qdel command), you will be sent an email like this:
Job 54238 (myjob.com) Complete User = pacey Queue = serial.q@sun018.hpc.lancs.ac.uk Host = sun018.hpc.lancs.ac.uk Start Time = 11/29/2006 10:23:07 End Time = 11/29/2006 13:23:17 User Time = 03:00:04 System Time = 00:00:00 Wallclock Time = 03:00:10 CPU = 03:00:04 Max vmem = 213.691M Exit Status = 0
By default, this email will be sent to your normal IUS account, and will be subject to whatever forwarding criteria you have set up there. If you wish the notification to be sent elsewhere, you need to also add the -M switch to the qsub command:
qsub -m e -M username@hostname myjob.com
These two options can be used in conjuction with other qsub switches, as described in the Running Advanced Jobs and the SCore parallel environment pages.
If you use these switches with job arrays you will receive a separate notification for every job in the array. If you use them with the SCore parallel environment you will be notified only of the termination of the master job.
©Lancaster University Computer User Agreement Privacy Statement
©Lancaster University ISS Governance Computer User Agreement Privacy & Cookies Notice