New User Warnings

Warning 1) If you have never used a Cluster, or are not familiar with this cluster, YOU WILL WANT to read and follow the examples below to become familiar with how to run jobs on HPC. It is a common practice by new users to ignore this manual and simply try to run jobs without understanding what they are doing. Such carelessness can and WILL easily impact hundreds/thousands of critical jobs and users currently running on the cluster. If your actions compromise the health of the HPC cluster, your account will be LOCKED so please make sure you run through the examples below before you embark on running jobs.
Warning 2) Do NOT use the login nodes for work. If everyone does this, the login nodes will crash keeping 700+ HPC users from being able to login to the cluster.
Warning 3) Do NOT use your home-directory (/data/users/$USER) for any serious work. See http://hpc.oit.uci.edu/data-storage
Warning 4) Never submit large number of jobs (greater than 5) without first running a small test case to make sure all works as expected. Start slow and then ramp up with jobs once you are familiar with how things work.
Warning 5) We have a SKELETON crew running the UCI HPC cluster. Before you use the cluster and/or ask for help, PLEASE make sure you run the Serial Job example explained below before contacting HPC support.

How to use HPC

Using a High Performance Computing Cluster such as the UCI HPC Cluster requires at a minimum some basic understanding of the Linux Operating System.

It is outside the scope of this manual to explain Linux commands and/or how parallel programs such as MPI work. This manual simply explains how to run jobs on the HPC cluster.

When you login to hpc.oit.uci.edu, you are connected to what is called a login node. The HPC Cluster has several major components:

  • A Head Node

  • Login Nodes ( hpc.oit.uci.edu )

  • Interactive Nodes

  • Compute Nodes

  • I/O Nodes

  • Data Servers

The head node runs all of the cluster critical services. The traditional cluster "Head Node" on this cluster is hidden. It is hidden from users for two major reasons. To lesson external Internet attacks and to prevent accidental user’s jobs gone-wrong from taking entire cluster down with them :-)

The login node hpc.oit.uci.edu is the node you get when you first log into HPC. The login nodes are meant for simple tasks such as submitting jobs, checking on job status, editing (emacs, vi) and performing simple tasks.

The interactive nodes are used for when you need to compile, test your code, and run a few 1-2 interactive sessions.

The compute nodes are the workhorse of the cluster. For computational work both Serial or Parallel, in Batch mode or Interactive mode, you will be using the compute nodes.

The I/O nodes (ionodes) are used to transfer data to & from HPC. When you need to transfer a lot of data, you will be using an ionode.

The Data Servers are just that. For a complete list of all Data Servers available on HPC, please see:

Grid Engine Scheduler

The HPC cluster is using Son of Grid Engine (GE for short), to manage all of the resources (the nodes) on the cluster.

GE provides user commands like qsub, qdel, qstat and qrsh which are used to submit, delete, check the status of the jobs and to request interactive nodes on the cluster.

Learning how to use Grid Engine can be a major challenge for those who have never used it before, so we will explain only the basics here and only enough for you to get jobs running on HPC. If you like to learn more about Grid Engine, there are lots of good examples on the web. A couple of links are:

Here is a list of the most common GE commands you will be using:

Grid Engine Command What It Does

qstat

List ALL jobs on the cluster (running, waiting to run, etc ).

qstat -u $USER

List your jobs ( $USER )

qstat -u $USER -s r

List your jobs ( $USER ) that are running.

qsub script

To submit a job.

qdel job_id

To delete a job.

qalter -q pub64,free64 job-id

To alter waiting job_id to use both free64 & pub64 queues. (This only works for jobs waiting to run)

qrsh

To request an Interactive node with one core (defaults to interactive queue)

qrsh -q pub64 -pe openmp 64

Request an Interactive compute node on the public queue (pub64) with all 64 cores.

qrsh -q pub64 -pe openmp 2-64

Request an interactive compute node on the public queue (pub64) requesting from 2 to 64 cores.

qhost

To see all hosts nodes on the cluster including load, memory usage, etc

watch -d "qstat -u $USER"

Continuously watch your running HPC jobs

Queues

The cluster resources (the nodes) are grouped into Queues. Queues on HPC are either Public or Private. For a full explanation of the queues, please see:

You submit jobs to Queues and the jobs will be either:

  • Batch Jobs (scripts), or

  • Interactive Jobs (you interact with the program / shell)

Batch Jobs

Batch jobs are jobs that contain all of necessary information and instructions to run inside a script. You create a script with your favorite editor (like emacs) and then submit the script to the scheduler to run.

Some jobs can run for days, weeks, or longer so batch is the way to go for such work. Once you submit a job to the scheduler, you can log off and come back at a later time and check on the results.

Batch jobs run in one of two modes.

  • Serial

  • Parallel ( OpenMP, MPI, other )

Batch Job Serial

Serial batch jobs are usually the simplest to use. Serial jobs run with only one core and are also the slowest since they only consume 1-core per job.

Consider the following serial job script available from the HPC demo account.

  • cat ~demo/serial.sh

#!/bin/bash
#$ -N TEST
#$ -q free64
#$ -m beas

date  > out
Grid Engine Directive What It Does

#!/bin/bash

Running shell to use ( the bash shell )

#$ -N TEST

Our Job Name is TEST. If output is produced to standard out, you will see a file name TEST.o<jobid> and TEST.e<jobid> for errors (if any occurred)

#$ -q free64

Request the free64 queue

#$ -m beas

Send you email of job status (b)egin, (e)rror, (a)bort, (s)suspend

The first line #!/bin/bash is the shell to use. Grid Engine (GE) directives start with #$. GE directives are needed in order to tell the scheduler what queue to use, how many cores to use, whether to send email or not, etc.

The last line in our serial.sh script is the program to run. In this example it is a simple date program writing the output to out file.

date > out

Now that we have a basic understanding let’s run our first serial batch job on the HPC Cluster. First create a test directory, change to the test directory, copy the demo serial.sh script to our new directory and submit the job.

From your HPC account, do the following:

$ mkdir serial-test
$ cd serial-test
$ cp ~demo/serial.sh .
$ qsub serial.sh
$ qstat -u $USER

After you submit the job (qsub), GE will respond with a job ID:

Your job 1961 ("TEST") has been submitted

and qstat will display something similar to this:

job-ID  prior   name   user     state submit/start  queue       slots

  1961 0.00000  TEST  jfarran   qw    08/16/2012                 1

The state of our job is qw queue wait (meaning the job is sitting in the queue waiting for a compute node). The core count (slots) shows as 1 (this is the default which is one core).

When we run qstat -u $USER again a few seconds later, we see:

job-ID  prior   name   user    state submit/start  queue               slots

  1961 0.50659  TEST  jfarran   r   08/16/2012    free64@compute-7-11   1

The scheduler found compute-7-11 on free64 queue available with 1 core (slots) and started our job #1961 on it. The job state changed from queue wait qw to running r.

Note Once you submit your job (qsub), things happen rather quickly so you may need to type qstat repeatedly and fast to see your job. Or open a new window and run: watch -d "qstat -u $USER"

Once the job completes you will get an email notification and the qstat output will be empty.

Now do an ls and you will see the following files:

out  serial.sh

The serial.sh is the batch job we submitted and file out is the output from the date program. To see the output type:

$ cat out

Congratulations! You just ran your first serial batch job on the HPC cluster.

Note To make life easier use ~demo/serial.sh as a template for other serial batch jobs and modify it to your needs. This way you will get the syntax correct.

If interested there are several template scripts available in the HPC demo account:

$ ls ~demo

Batch Job Parallel

Parallel jobs are much more complex than serial jobs. Parallel jobs are used when you need to speed up the computational process to cut down on the time it takes to run.

For example, if a program normally takes 64 days to complete using 1-core, you can theoretically cut the time by running the same job in parallel using all 64-cores on a 64-core node and thus cut the wall-clock run time from 64 days down to 1 day, or if using two nodes, cut the wall-clock run time down to 1/2 a day. :-)

Note As of now there is nothing that will take a serial program and magically and transparently convert it to run in parallel, specially over several nodes. This is a common question by new users who are introduced to parallel concepts.

Another way to speed up the process is to run lots of single 1-core jobs using job-arrays. The idea is that you take a huge problem and cut it down into hundreds/thousands of small little manageable tasks with each task using one core and then run the job on the cluster. The HPC Cluster has thousands of cores available through the free64 queue so it can get a lot of computing done.

Grid Engine has several Parallel Environments (-pe) you can run with. Two of the major Parallel Environments on HPC are:

  • openmp ( job runs with ONE node max )

  • mpi ( job runs using multiple nodes. Two or more nodes )

OpenMP programs typically use threads and are easier to program than say MPI parallel programs.

OpenMP jobs usually runs faster than MPI jobs since all communication happens inside the same node (motherboard). The maximum number of cores you can run with OpenMP is limited by the number of physical cores on a node, so on a 64-core node you can only run with up to 64-cores and no more. This is the limitation of OpenMP.

MPI is the most complicated method of running in parallel but has the advantage of running over multiple nodes and so you are not limited by the core count on a node. With MPI you run with 256 cores, 512 cores or however many nodes the cluster allows. MPI uses message passing for it’s communication over the Infiniband network on HPC.

OpenMP

The following will illustrate how to compile, submit and monitor a parallel OpenMP batch job. The demo account has a simple OpenMP parallel Hello World program which we are going to use for this illustration.

The GE scheduler is very flexible in that you can request a range of queues and cores to allocate. Consider the following GE script available at:

  • $ cat ~demo/hello-openmp.sh

#!/bin/bash
#$ -N TEST
#$ -q free*,pub64
#$ -pe openmp 8-64
#$ -m beas
Grid Engine Directive What It Does

#!/bin/bash

Running shell to use ( the bash shell )

#$ -N TEST

Our Job Name is TEST. If output is produced to standard out, you will see a file name TEST.o<jobid> and TEST.e<jobid> for errors (if any occurred)

#$ -q free*,pub64

Request cores from all free* queues and from the pub64 queue.

#$ -pe openmp 8-64

Request Parallel Environment (-pe) openmp. Minimum and maximum core count we are willing to run with 8-64 cores.

#$ -m beas

Send you email of job status (b)egin, (e)rror, (a)bort, (s)suspend

Here you can see the power of Grid Engine. The queues being requested are the free* queues (which means any cores from the free32, free48 or free64 queues), and also from the pub64 queue.

The openmp core range 8-64 means to look for any node having 8 to 64 cores available searching for the largest core count first. So if the scheduler finds a node with all 64 cores available, it will pick that node and assign our job to that compute node. If no node is available with 64 cores, it then decreases the search core count from 64 to 63 and repeats the search process all the way down to 8 cores. If no node is found, the job goes into the queue waiting area until a node becomes available.

OK let’s run our parallel batch job. First create a test directory, change to the test directory, copy the GE hello script to it and then submit the job.

From your HPC account, do the following:

$ mkdir openmp-test
$ cd openmp-test
$ cp ~demo/hello-openmp.sh .
$ qsub hello-openmp.sh

After you submit the job (qsub), GE will respond with a job ID:

Your job 1962 ("TEST") has been submitted

Now run qstat ( qstat -u $USER ) to see the status of the job:

job-ID  prior   name   user     state submit/start at  queue         slots

  1962 0.00000  TEST   jfarran   qw    08/16/2012                      8

The state of our job is qw queue wait. The core count (slots) shows as 8 because that is the minimum amount of cores we are willing to run with. Once the scheduler finds a node, the job core count will be updated.

Running qstat again a few moments later, we see:

job-ID  prior   name   user    state submit/start  queue               slots

  1962 0.50659  TEST   jfarran   r   08/16/2012    free64@compute-2-7   64

The scheduler found compute-2-7 available on the free64 queue with 64 cores and started our job #1962. The job state changed from queue wait qw to running r and updated the number of cores (slots) allocated to 64.

Once the job completes you will get an email notification and the qstat output will be empty.

Now do an ls to list the files in your directory and you will see something similar to this:

hello  hello-openmp.c  hello-openmp.sh  out

The hello program is our OpenMP executable which was compiled by the GE script hello-openmp.sh. The output from the hello program is in the out file. To see the output:

$ cat out

MPI

This is only meant for users who will be using MPI. You can skip this section if you will not be running with MPI.

Running MPI parallel jobs is the most difficult of all types of jobs as they usually run over multiple nodes and the process is complicated. MPI jobs on HPC use the Infiniband Mellanox Switches for fast node-to-node communication which is critical for MPI programs in order to run as fast as possible.

On HPC you run MPI jobs in one of two modes:

  • Multiple Nodes ( Two or more nodes )

  • Single Node

MPI Using Two Or More Nodes:

Before we start, here are some notes on MPI when using two or more nodes:

Note Grid Engine parallel environment (-pe mpi) is used for running with two or more nodes. For running MPI jobs with one node, please see the next section.
Note When running with mpi (-pe mpi), you have to run with whole-nodes and cannot run with partial nodes.
Note NOT available on suspend-able queues like the free queues.
Note When using the public queues, the higher count of nodes you request the longer your job may sit in the queue waiting for the scheduler to locate all of the cores requested.

There is a hello world MPI program available if you like to try out an MPI job on HPC.

  • cat ~demo/hello-mpi.sh

#!/bin/bash
#$ -N TEST
#$ -q pub64
#$ -pe mpi 128
#$ -R y
#$ -m beas
Grid Engine Directive What it does

#$ -q pub64

Request cores from the public pub64 queue.

#$ -pe mpi 128

Request Parallel Environment (-pe) mpi requesting 128 cores ( two nodes ).

#$ -R y

Job reservation. Needed for MPI jobs.

When requesting cores with mpi, you HAVE to request cores in amounts that equals whole-nodes, that is, using all cores on the nodes.

For example if you run on the pub64 queue which has 64-cores per node, to run with two nodes you request 128 cores (2 x 64):

#$ -pe mpi 128

To run with three nodes request 192 cores (3 x 64):

#$ -pe mpi 192

To run with four nodes request 256 cores (4 x 64) and so on.

#$ -pe mpi 256

You may also specify a range as in:

#$ -pe mpi 128-256

And the scheduler will try to allocate as many whole nodes as it can find up to 4 nodes (256 cores).

Note If you request 96 cores on the pub64 queue for example, the scheduler will NOT be able to run your job because it cannot allocate partial nodes 1.5 nodes ( 64 + 32 cores ).

To see how many Cores Per Node each queue has on HPC, run the "q" command.

$ q

The -R y Job Reservation is important for mpi jobs. If you don’t use it your job will suffer from what is classically know as job-starvation, meaning that smaller single-core jobs will sneak in and prevent the scheduler from finding all available cores for your job, so your job will sit in the queue for a very long time waiting to run if many single core jobs are also waiting to run.

For compiling MPI programs, HPC has several MPI flavors and compilers to choose from. To see a list of all of the OpenMPI flavors, do:

$ module available openmpi

For a complete list, do:

$ module available

And look for the MPI-MESSAGE_PASSING_INTERFACE section.

MPI Using One Node Only

When running an MPI job on HPC that requires one node or less (partial node), you use the "one-node-mpi" Grid Engine parallel environment instead of the mpi parallel environment.

Note You can run on all HPC queues including the free queues.
Note You can request all cores or specify only portions of cores from the node on the queue. For example, you may request 16 cores from the pub64/free64 queues.

Consider the following one node MPI job available at:

  • cat ~demo/hello-mpi-one-node.sh

#!/bin/bash
#$ -N TEST
#$ -q pub64,free64
#$ -pe one-node-mpi 2-64
#$ -R y
#$ -m beas
Grid Engine Directive What it does

#$ -q pub64,free64

Request cores from the public pub64 or free64 queues.

#$ -pe one-node-mpi 2-64

Request Parallel Environment (-pe) one-node-mpi requesting anywhere from 2 to 64 cores.

#$ -R y

Job reservation. Needed for MPI jobs.

Using the -pe one-node-mpi range of 2-64, Grid Engine will search for the biggest core count possible for your job from the specified queues on one node.

If you want to run with a specific core count, like 16 cores for example, use:

#$ -pe one-node-mpi 16

Interactive Job

Just as with batch jobs, interactive jobs can run in one of two modes.

  • Serial

  • Parallel ( OpenMP, MPI, other )

Interactive job, or better said, an interactive session is one in which you need to run a program that you will be interacting with in real time.

For example you can run Matlab or Mathematica in interactive mode. When you do this, you will see a window or windows displayed on your server waiting for your input and you then use the program accordingly (you interact with the program).

Another reason for an interactive node is if you are compiling or testing your code. You need a shell command line on an interactive/compute node so that you can do things.

The most common mistake on HPC is to run programs on the login nodes. Don’t do that please.

Warning Do NOT run interactive programs on the HPC Login nodes.

The login nodes as explained above are meant for very light and simple tasks. We have over 700 accounts on HPC with 100-200 users logged in at any given time. If all 100 users (or even a fraction) were to run Matlab on the login node, the login node will quickly come to a halt and crash since the login nodes cannot handle that kind of work. If the login nodes crash, nobody will be able to get on HPC and the HPC staff will then get inundated with phone calls and emails and that is not a good thing.

The HPC cluster has thousands of cores and it can handle a lot of work, but that work CANNOT be done on the login nodes. Heavy work is done by compute nodes and the GE scheduler does the work of distributing the work among the available compute nodes.

Interactive Job Serial

To request an interactive node with 1-core, do:

$ qrsh -q interactive

Or simply by doing qrsh since interactive is the default queue on HPC and one core is also the default:

$ qrsh

Now you can now run your interactive session.

Interactive Job Parallel

When you need to run an interactive program that will use more than one core (run in parallel) , you have to use a (-pe) parallel environment. Compute nodes are the workhorse of the cluster so you will want to request a compute node for such work.

Although HPC has an mpi parallel environment, we will not go into that with interactive jobs as that can get very complicated.

To request an interactive compute node on the public pub64 queue with the OpenMP (-pe openmp) parallel environment using all 64 cores, do:

$ qrsh -q pub64 -pe openmp 64

The example above will request one node on the pub64 queue.

If a node is NOT currently available on the pub64 queue with all 64 cores, you will get a message similar to this:

Your "qrsh" request could not be scheduled, try again later.

You can try a different queue, or wait until enough cores become available, or you can try requesting fewer cores with a range. For example:

$ qrsh -q pub64 -pe openmp 4-64

Here you are requesting anywhere from 4 to 64 cores and thus you will have much better luck of getting a compute node.

You can also try the free64 queue which has (thousands of cores) but realize that your session may suspend (your interactive session will freeze) since the free64 queue is a suspend-able queue.

If you have access to private queues, you should request those first. For example, if you have access to the bio queue, you can request one bio node with all 64-cores with:

$ qrsh -q bio -pe openmp 64

If you have access to multiple queues, you can also request cores from multiple queues separating each queue with a comma. Example:

$ qrsh -q bio,pub64,free64 -pe openmp 64

In the command above, we are requesting the parallel environment openmp with all 64 cores one node from the bio, pub64 or free64 queues. The order of queues does not matter, the scheduler has been configured to first pick from private queues, then from the public queues and lastly from the free64 queue.

Nodes (cores) may or may not be available for you to run with right now depending on current usage. The way you can tell what cores are available to you right now is by running the q command on HPC:

$ q

For a full explanation of the q command, please see:

Running MATLAB Jobs on HPC:

If running MATLAB batch jobs on HPC, you HAVE TO READ the following:

Once you compile your code to native, reference the executable AND NOT the wrapper shell. Otherwise, Grid Engine will suspend the shell, not necessarily the executing program and the executing program will continue to run.

If you are running on the freeQ on HPC and Grid Engine cannot suspend your jobs, we will be force to kill your jobs without warning. So make sure your MATLAB jobs can be suspended before running on the Free queues on HPC.

Transferring Data To & From HPC:

Q & A:

QUESTION: Why is my Job is taking so long to run?

  • ANSWER: Did you add Job Reservation?

  • #$ -R y

When requesting multiple cores for a job, you will want to add job reservation. If you do not, then single 1-core jobs will sneak-in and prevent the scheduler from finding all of the requested cores on a node and your job will sit in the queue waiting for cores. The long wait is due to 1-core jobs sneaking in. This is classically known as Job Starvation.

Note: When using job arrays, please use job reservation with care. Do not use it for small number of cores, like 4-cores or less. Job reservation takes a hit on Grid Engine when there are lots of jobs to run.

QUESTION: What is the difference between -R y and -r y?

  • #$ -R y ← means Job Reservatoin

  • #$ -r y ← means Restart the Job in case the node crashes

QUESTION: What are Job Arrays?

  • ANSWER: Please see this link (written by Prof. Kevin Thornton):

  • You will want to use job arrays on HPC whenever possible for large jobs (say over 50 jobs). By using one 2,000-task job array instead of 2,000 separate individual jobs, the Grid Engine scheduler can run a lot more efficient and be more responsive.

QUESTION: I requested 64 cores but is my job REALLY using all 64 cores?

  • ANSWER: Good question! Just because you requested 64 cores and the scheduler allocated 64 cores to your job does NOT mean that your program is using all 64 cores. The only way to know how your job is performing is to go to the node (ssh) and run htop or top. Ssh-ing to a node is prohibited because you can offset the load on the node. This is the only exception and it is to be done for no more than 10 minutes at at time.

QUESTION: Can I ssh to a node?

  • ANSWER: No, unless you are checking on the status of your job (see above) and only for a few minutes (like 10 minutes max). It is a common practice for new users who don’t understand how to run jobs to simply ssh to a node. This is a big NO-NO because you are circumventing the whole purpose of having a scheduler.

QUESTION: What software is installed on the Cluster?

  • ANSWER: Run "module available" ( or "module av" for short ) and a large listing of all currently installed software will be displayed.

QUESTION: How do I know what queues I have access to?

QUESTION: How do I know what cores are available to run with RIGHT NOW?

Thank You:

Thank you for taking the time to read this web page that will hopefully get you started with running jobs on the UCI HPC campus cluster.

Please remember that HPC is large cluster serving the entire campus with thousands of jobs and hundreds of users, so start slowly and carefully please.

If after you run through the exercises above and are still having issues, first try asking one of your colleagues for help who is familiar with HPC. If they cannot help you, then try the HPC support staff at:

Joseph Farran