Monitoring jobs on the HPC 

Monitoring jobs on the HPC



Monitoring job status with qstat

The simplest way to monitor your HPC jobs is with the qstat command. Used on its own, this will output a list of all jobs currently running or waiting to run on the cluster:

job-ID  prior   name       user         state submit/start at     queue                         slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------

    520 0.55500 mol.43.com grace        r     08/10/2006 17:39:56 serial.q@sun001.hpc.lancs.ac.u     1        
    519 0.55500 mol.42.com grace        r     08/10/2006 17:39:56 serial.q@sun011.hpc.lancs.ac.u     1        
    506 0.55500 mol.29.com grace        r     08/10/2006 17:39:55 serial.q@sun013.hpc.lancs.ac.u     1        
    505 0.55500 mol.28.com grace        r     08/10/2006 17:39:55 serial.q@sun023.hpc.lancs.ac.u     1        
    410 0.55500 a20.com    sirichan     r     08/10/2006 10:36:25 serial.q@sun025.hpc.lancs.ac.u     1        
    411 0.55500 a21.com    sirichan     r     08/10/2006 10:38:40 serial.q@sun058.hpc.lancs.ac.u     1        
    483 0.55500 mol.6.com  grace        r     08/10/2006 17:39:40 serial.q@sun072.hpc.lancs.ac.u     1        
    409 0.55500 a22.com    sirichan     r     08/10/2006 10:34:41 serial.q@sun089.hpc.lancs.ac.u     1        
    406 0.55500 a17.com    sirichan     r     08/10/2006 10:25:55 serial.q@sun092.hpc.lancs.ac.u     1        
    499 0.55500 mol.22.com grace        r     08/10/2006 17:39:55 serial.q@sun094.hpc.lancs.ac.u     1        
    408 0.55500 a19.com    sirichan     r     08/10/2006 10:31:25 serial.q@sun102.hpc.lancs.ac.u     1        
    503 0.55500 mol.26.com grace        r     08/10/2006 17:39:55 serial.q@sun112.hpc.lancs.ac.u     1        

The output columns give the following information:

job-IDA number used to uniquely identify your job within SGE system. Use this number when you want to halt a job via the qdel command.
priorThe job's fixed priority within the cluster. Job priorities are automatically calculated to give users with the fewest number of running jobs a higher priority
nameThe name of the job submission script
userThe username under which the job was submitted
state The job's current state. See the job lifecycle section below for details
submit/start atThe time at which the job was submitted (for waiting jobs), or when it was launched (for running jobs)
queueThe queue to which a running job is assigned. The queuename is composed of two components, separated by an @ symbol. The first is the cluster queue (either serial or parallel), the second is the name of the execution host the job is running on.
slotsThe number of slots the job occupies. For serial jobs, this value will be one.
ja-task-IDFor task arrays, this will specify the job's task ID

The default action for qstat is to output basic information on all jobs. If you wish to view only your own, then you can specify an individual username with:

Job lifecycle

A job's lifecycle can be tracked via the state field in the qstat output. All jobs start start with a state of "qw" (queued and waiting). If the cluster is busy, or the job has requested a resource which is currently fully utilised, then a job may spend some time in this state. Once an approporiate job slot is available, the job's state changes briefly to "t" (transfer), followed by "r" (running). When a job no longer appears on the qstat output, it has finished or has been deleted.

Note: A job state of "Eqw" is an error state. See the troubleshooting page for more details.

qstat -u username


Monitoring memory and CPU usage with qtop

The qstat command gives basic information of the status of a job. Sometimes, though, it's useful to have a more detailed look at how a job is running; for example, to see how large a program is when running, or to check that it hasn't stalled. On a single platform system, the top command provides a more in-depth view of program status. On the HPC, you can use the qtop command to collect and display the output from top for all your currently submitted jobs.

Consider the following output, from qstat -u sirichan:

job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------

    410 0.55500 a20.com    sirichan     r     08/10/2006 10:36:25 serial.q@sun025.hpc.lancs.ac.u     1        
    411 0.55500 a21.com    sirichan     r     08/10/2006 10:38:40 serial.q@sun058.hpc.lancs.ac.u     1        
    409 0.55500 a22.com    sirichan     r     08/10/2006 10:34:41 serial.q@sun089.hpc.lancs.ac.u     1        

We can generate relevant qtop information for these jobs by running qtop -u sirichan:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
sun025:
27942 sirichan  25   0 1586m 1.5g 3548 R 99.5 19.8   1422:34 siesta.2.0 
27911 sirichan  16   0  7876 1312 1020 S  0.0  0.0   0:00.00 410        
sun058:
27861 sirichan  25   0 1652m 1.6g 3548 R 99.3 20.3   1420:30 siesta.2.0 
27830 sirichan  16   0  7876 1312 1020 S  0.0  0.0   0:00.00 411        
sun089:
28103 sirichan  25   0 1586m 1.5g 3548 R 99.9 19.8   1424:34 siesta.2.0 
28072 sirichan  15   0  7876 1312 1020 S  0.0  0.0   0:00.00 409

The output fields are identical to those for the standard linux top command executed in batch mode - see the man page for an in-depth description of the meaning of each field. This description will cover only the more relevant ones.

The first thing to note is that the information provided by qtop is very different from that of qstat. qtop is not integrated into the SGE system, so it will output process information, not job information - a single job will involve executing a number of processes. You'll need to compare qtop and qstat output to work out just what's going on. For example, qtop doesn't always give you the job-ID number, and it often lists two or more processes where qstat lists just one job.

Because jobs are submitted to the cluster in the form of job scripts, the job script itself becomes a process, which is often named after the job ID in qtop's COMMAND field. For most purposes, you'll be interested in the other process listed - the main process that the job script is currently running.

The four most relevant fields are labelled "COMMAND", "VIRT", "RES" and "CPU". COMMAND gives the name of the process running. This will be the name of the program called from within your job script - not the name of the job script itself (which is given by the qstat output).

The VIRT and RES fields give the total and resident memory size of each process. Smaller process sizes are listed in (k)ilobytes, larger ones in (m)egabytes, or even (g)igabytes. In the above example, all the main processes - named siesta.2.0 - have a total size of nearly 1.5 to 1.6 gigabytes. (As jobs which consume more than 0.5 gigabytes are classified as large memory jobs, the user has submitted these jobs with valied memory resource requirement to qsub, in order to ensure that the job scheduler does not inadvertantly overload any particular execution node with too many large memory jobs).

The other useful field in the qtop output is "CPU", which describes how much of a single CPU the process is consuming. Typically a running job should be consuming very close to 100% of a CPU's resources. Values considerably lower than this will likely indicate some problem; it might be spending a disproportionate amount of time performing file reads or writes; or in the case of badly balanced parallel programs, it might be idle while waiting for a communication from another process.


Email notification of job completion

When a batch job completes, it will no longer be listed in the qstat output. To avoid the chore of repeatedly accessing the HPC to check if your jobs have finished, you can instruct qsub to notify you by email. To do this, simply add the -m e flag to qsub:

qsub -m e myjob.com

When your job ends (or is killed with the qdel command), you will be sent an email like this:

Job 54238 (myjob.com) Complete
 User             = pacey
 Queue            = serial.q@sun018.hpc.lancs.ac.uk
 Host             = sun018.hpc.lancs.ac.uk
 Start Time       = 11/29/2006 10:23:07
 End Time         = 11/29/2006 13:23:17
 User Time        = 03:00:04
 System Time      = 00:00:00
 Wallclock Time   = 03:00:10
 CPU              = 03:00:04
 Max vmem         = 213.691M
 Exit Status      = 0

By default, this email will be sent to your normal IUS account, and will be subject to whatever forwarding criteria you have set up there. If you wish the notification to be sent elsewhere, you need to also add the -M switch to the qsub command:

qsub -m e -M username@hostname myjob.com

These two options can be used in conjuction with other qsub switches, as described in the Running Advanced Jobs and the SCore parallel environment pages.

If you use these switches with job arrays you will receive a separate notification for every job in the array. If you use them with the SCore parallel environment you will be notified only of the termination of the master job.


Lancaster HPC home page | ISS home page | University home page

  
To the Top

©Lancaster University   Computer User Agreement   Privacy Statement  

©Lancaster University   ISS Governance   Computer User Agreement   Privacy & Cookies Notice  

Lancaster University
Bailrigg
LancasterLA1 4YW United Kingdom
+44 (0) 1524 65201