Class: Linux on the HPC Cluster
Date: 9a-5p+, April 26th, Bren Hall Room 4011 - hjm
(with the vocal stylings of Harry Mangalam)
This full-day course will introduce you to both the Linux Operating System and its implentation on the HPC cluster.
It will cover:
what a cluster is and how it works
do you even need a cluster..?
how to ask a question so we can help you
Logging in with ssh
maintaining terminal sessions (screen/byobu/x2go)
what are (and how to tame) STDIN / STDOUT / STDERR / pipes
text files and editing
Controlling your jobs
Modules and how to find the programs you need.
moving data to and from HPC.
how to archive and de-archive data
simple bash programming
some simple data munging techniques
if you do not want to take the course, but wish to consult on your data munging, analysis, movement, storage, archiving, etc, I’ll be available in the afternoon to consult as well. (The active course will run til about 2p with the supervised tutorial going until 5 or later. I’ll be happy to consult thru that time.)
/dfs3 Back to Normal Operation
4-2-2019 1pm Joseph
The issues with /dfs3 have been found and corrected. /dfs3 is now back to normal working operation.
So what happened?
A BIO software called Trinity created over 100 million empty directories causing /dfs3 MetaData to run out of room.
The /dfs3 file-system itself had plenty of space at 56% capacity, BUT the metaData part of BeeGFS was at 100% capacity. MetaData is where a listing of all files are kept.
We have a software called StarFish on HPC that monitors and accounts for files on HPC dfs files-systems.
Unfortunately, StarFish was NOT configured to also count empty directories and that threw us off. We are all very familiar with ZOT files on HPC ( tons of tiny files ) but none of us here ever suspected millions upon millions of empty directories. This very strange edge case was new to ThinkParQ, too. Our collective debugging efforts took us down other, seemingly more-likely paths, first. If you are interested in numbers, there were over 100M empty directories and "only" 60M files on all of DFS3. These empty directories outstripped real files 2:1.
BeeGFS support whom we have a contract with were invoked early in the process and they started running scans of our BeeGFS file-system which due to it size took days to complete. BeeGFS support found and alerted us to the issue of why the MetaData was full which was the empty dirs taking up all the disk space.
Yesterday evening the removal of empty directories was started and this morning it completed bringing back /dfs3 MetaData size back to normal level of 41% usage instead of 100%.
This is a first for us as nobody here were expecting a software to create that many empty dirs. We will be taking steps to prevent this from happening again such as placing a limit on the number of inodes (file / dir count) any one user can consume as well as to monitor for empty dirs now that we know what to look for.
Thank you for your patience and sorry this took so long but trouble-shooting issues like this takes a while. The scans BeeGFS support ran ultimately led them to finding the root cause (empty dirs), however those scans took days to complete on our 1.3PB /dfs3 file-system.
3-30-2019 10pm Joseph
An update on the status of /dfs3 as of Saturday the 30th at 10pm.
We continue working with BeeGFS support providing them with all requested info. We are running scans on the metadata as per BeeGFS support instructions - this process takes a while.
We have been able to remove millions of uncessary files from /dfs3/pub which is helping and making /dfs3 usuable albeit slow.
Until this issue is resolved, please don’t add to the problem by creating lots of files in /dfs3.
Stay tune, Joseph
/dfs3 I/O Errors
The /dfs3 file-system is currently having issues. The metadata portion of /dfs3 is at 100% capacity.
We are doing all we can to bring the space down and we are not sure what is causing metadata to go from ~55% full to 100% in a short amount of time.
We are working to resolve this and have filed a ticket with BeeGFS support.
Please stay tune.
HPC uid / gid Conversion
We are currently converting all HPC users User-id uid and Group-ID gid on the cluster to match the campus LDAP uciCampusID. This is being done to unify HPC with the rest of the campus UID scheme.
Accounts are being converted only when the user is not running jobs or logged into HPC. When accounts are converted, the user will receive an email notifying them of the fact and what was done.
With the HUGE amount of data and users and the complexity of users on multiple HPC groups, there will be hiccups here and there. If you receive quota issues and/or access problems, please report to it firstname.lastname@example.org
Thank you for your understanding and patience during this transition.
/dfs1/bio Move Completed
The scheduled move of /dfs1/bio to /dfs3/bio has been completed. Please check your files and let us know asap if anything is missing? The old data will remain for 2-weeks and them it will be permantly removed.
/dfs1/bio moving-to /dfs3/bio on 2/25/19
We are attempting once again to move /dfs1/bio to /dfs3/bio. Last attempt was canceled at the last minute ( the 11th hour ) due to an emergency.
|The move is scheduled for noon on 2/25/19, Monday the 25th.|
Please note that after the noon hour on 2/25/19 the /dfs1/bio directory will no longer work. You will need to use /dfs3/bio instead. Yes this means that all running jobs referencing /dfs1/bio will error out after this time.
If this move presents a problem for you, please let us KNOW ASAP at email@example.com
1-31-2019 Joseph 5:00am
The BeeGFS file-system /dfs3 is back UP. We continue working with ThinkParQ to see if there are any additional steps we need to perform on /dfs3 but right now /dfs3 is back up and available.
Running batch jobs using /dfs3 mostly likely died with I/O error - you will need to resubmit.
Please check your data on /dfs3 and report any issues to firstname.lastname@example.org
1-30-2019 Joseph 1pm
The /dfs3 BeeGFS file-system is down ( un-available ) due to BeeGFS meta-data issues. We are working with ThinkParQ support to bring the FS backup as soon as possible.
Updates will be posted here.
1-28-2019 Joseph 12:30pm
As planned everything in /dfs1 was moved to /dfs3 EXCEPT for bio. There was an 11th hour emergency and the bio folder was held back until a future date.
The /dfs1/bio folder will remain for a few more weeks - no date has yet been set for the move of /dfs1/bio to /dfs3/bio. When the date is known, it will be posted here.
New Instructions for using GPU nodes
For using HPC GPU nodes, please see:
/dfs1 is Retiring On 1/28/19
The /dfs1 file-server is being retired on 1/28/19 at Noon.
The /dfs1 hardware is very old, fragile and no longer under warranty. A new bigger and faster file server called /dfs3 is replacing it.
All data on /dfs1 will be copied to /dfs3.
|Starting on 1/28/19 at noon, /dfs1 will NO LONGER WORK. You will need to use /dfs3 instead. Jobs using /dfs1 after this time will be killed in order to un-mount /dfs1.|
This means that all running & pending jobs trying to use /dfs1 after the noon hour on 1/28/19 will fail. Yes you will need to update your scripts after the switch.
List of folders that will be replaced:
/dfs1/som → /dfs3/som
/dfs1/bio → /dfs3/bio
/dfs1/uci-mind → /dfs3/uci-mind
/dfs1/wpoon → /dfs3/wpoon
The following links will NO longer work when dfs1 is gone:
/dfs1/cbcl → /dfs2/cbcl
/dfs1/dabdub → /dfs2/dabdub
/dfs1/drg → /dfs2/drg
/dfs1/edu → /dfs2/edu
/dfs1/elread → /dfs2/elread
/dfs1/tw → /dfs2/tw
We are giving over 2-weeks notice so that you can plan accordingly, however, if the date presents a problem for you, let us know ASAP at email@example.com.
New Policy Placed on HPC Free Queues
A new policy has been enacted on HPC free suspend-able queues like free*, abio* and asom*.
In order to now use these queues, YOU MUST use either HPC restart or HPC checkpoint. For details please see:
This is being done because jobs that are suspended on these queues do not release the memory they are using causing the node owner not to be able to use all of the memory on his/her node and as a result not being able to run.
If you are not sure which one to use, select restart. It’s painless to use.
Also please note that you cannot use the restart nor checkpoint in interactive mode, only in batch mode.
ANOTHER HPC HEAT Meltdown
12-1-2018 2:45pm Joseph
The OIT data center suffered yet ANOTHER Heat Meltdown that started around 6-7pm on 11/30/18. At 7:45pm I received an alert from the OIT data manager of the heat issue and I powered off all compute nodes 5 minutes after the alert to prevent the nodes from melting further.
On 12-1-18 at 2:26pm, I received the green light from the OIT data-center manager that all was good and that I could bring back the HPC compute nodes that were powered off.
The HPC compute nodes were turned back on. Jobs running on the compute nodes when they were powered-off will need to be restarted unless you have HPC checkpoint or HPC restart enabled in your jobs.
I do NOT know the cause of the heat issue nor the extend of damage caused to the compute nodes ( if any ). I will look at the nodes on Monday when I get back to work.
HPC HEAT Meltdown
The OIT data center where HPC is housed suffered yet another Heat Meltdown starting around 5am on Wednesday the 10th.
The heat Meltdown:
Took out ~5 racks, 109 compute nodes (crashed) on the HPC cluster due to the excessive heat.
Destroyed two (2) $10k compute nodes ( compute-4-25 & compute-4-35 ).
Hundreds of research jobs were lost.
Probably shorten the life-span of surviving healthy working nodes.
Unfortunately this is not the first, nor second, nor 3rd time this has happened where issues at the data center causes heat to quickly build-up to dangerous levels destroying and melting hardware.
HPC Default gcc and OpenMPI
The default gcc version on HPC has been changed from gcc/4.8.2 to
Also OpenMPI has been removed as a default module load. You can still make it available in your account by editing your .bashrc file and adding it back:
module load openmpi-3.1.2/gcc-6.4.0
But OpenMPI is no longer a default module to load. This was done as not everyone uses OpenMPI and it was causing confusion.
HPC Maintenance completed
9-25-2018 7pm RCIC Team
HPC is UP.
BeeGFS has been upgraded to version 7 on all servers and nodes.
RPMs updated for CentOS 6.9
Mellanox drivers upgraded to Mellanox OFED version 220.127.116.11.0
Re-cabling of Infiniband network.
Ubiquiti switches firmware update and all switches including Mellanox rebooted.
OpenMPI version 3.1.2 compiled against new Mellanox drivers. OpenMPI 3.1.2 is now the default on HPC.
Other flavors of gcc compiled for the new OpenMPI v3.1.2. If you need a particular flavor of gcc and/or OpenMPI re-compiled, please email firstname.lastname@example.org
Older versions of OpenMPI 1.6 and 1.7 removed from modules.
Expect some road-bumps. If you find issues or have problems, please email us at email@example.com
Big Memory Node ( bigmemory queue )
The HPC big memory node is now only accessible by request and it is no longer part of the public queues. Starting today in order to access the bigmemory queue you will need to be in the "bigmemory" HPC group.
If you need access to the big memory queue, please send a request to firstname.lastname@example.org with an explanation of why you need it and how much memory you will be using.
/ssd-scratch and /fast-scratch
The server that serves /ssd-scratch and /fast-scratch file-systems ( nas-7-1 ) is having hardware issues. It’s over 5 years old. We are evaluating whether to replace it or not.
If /ssd-scratch and/or /fast-scratch are important to your research work and you need them back please let us know by sending an email to email@example.com of the fact.
Some R/3.4.1 packages updates issued warnings.
When updating large numbers of packages, a few always fail to update due to dependency order or other issues. If you use one of these packages, please contact me <firstname.lastname@example.org> and I’ll try to resolve the issue:
hdf5r R2jags GroupSeq jqr libsoc RcppParallel V8 Rmpfr units sdols bamboo CePa fst rgdal TTAinterfaceTrendAnalysis WGCNA ggforce wellknown sf PMCMRplus gdalUtils randomcoloR BNSP divest macleish quickPlot rnrfa shallot SiMRiv wikilake ggraph velox ALA4R btb easyDes iemisc reproducible Rfast RSDA tidyRSS TropFishR bibliometrix dtwclust eurostat scholar Seurat SpaDES.tools SpaDES.core SpaDES
Why is my job not running?
A new script has been created to hopefully help you understand why a particular job is not running on the HPC cluster after you submit it.
To use it simply enter "why <job-id>"
And the Grid Engine report will show possible reason(s) what the job is waiting on to run.
HPC Scheduler ( Grid Engine )
HPC Grid Engine scheduler is having issues in which jobs that are suppose to run are not running and jobs that should be suspended are not being suspended.
I was out for 3 weeks due to a medical emergency and I recently got back. I am working on the problem but please note that the scheduler is complex and it will take me some time to figure out what is causing this.
Thank you for your understanding
new tools to monitor or profile your jobs
To address some of the resource mis-allocation that is preventing efficient use of the HPC cluster, I’ve written some tools that should make it easier for users to figure out what their jobs are actually doing.
myjobs help → to get the usage
myjobs → to see what it does You can follow that up with jobs_at compute-x-x to inspect your jobs more closely via top. (Shows how much CPU and RAM YOUR jobs are using.)
qjobstatus - queries qstat to list all your jobs on all nodes and then cycles thru those nodes, listing the jobs and their run status. If you have no jobs submitted, don’t bother.
profilemyjobs - much more data, logged, printed, and plotted if you have an X11 screen available. Needs to be executed on the same machine as your job is running.
qbetta - view all Qs, all jobs, all status
Questions to <email@example.com>
Get your jobs on-CPU ASAP.
Unless you have dedicated Qs for your research group, the fastest way to get your jobs assigned a CPU is to use the free and/or pub Qs. Especially if you have short, serial jobs, your jobs will see silicon MUCH faster if you let the scheduler find a slot for you. We recently added several hundred cores to the free Qs so the free Qs should eat your jobs faster than ever. So add the following SGE Q directive to your qsub scripts: (note, no spaces)
#$ -q pub*,free*,otherQs
And don’t forget, we have large, non-64core machines (ie free56i) that will probably run your big jobs faster than the AMD 64-cores. If your code is floating-point-heavy, Intel CPUs will run faster than AMD 64core Bulldozers (but not the new AMD EPYC).
Also, unless you have verified that a 64-way job will run significantly faster than 2x32core jobs, use 2x32 core jobs, or 4x16 core jobs. Very few apps scale linearly past 16 cores.
/pub switch-over Completed
The switch over of /pub is now complete. All data from /share/pub was rscyn’ed to /dfs3/pub
The links were also updated on all compute nodes:
/pub -> /share/pub has been removed /pub -> /dfs3/pub has been created
This means that you can keep on using /pub/$USER as before and no need to update your job scripts.
The process took longer than expected as several users were still using the old /share/pub at the time of the switch-over which slowed me down.
/pub Switch-over at Noon on Monday 6/4/18
Switch over extended to Monday 6/4/18
The public disk file space /pub is being moved from /pub to /dfs3/pub in order to provide more room for the growing needs of all /pub users.
The move will happen at noon on Monday 6/4/18. Everything is going to be copied over from /pub to /dfs3/pub and a new link will be setup so that /pub will now point to /dfs3/pub.
If you are using the original /pub after the noon hour on 6/4/18, your processes will be killed so that we can setup the new links: /pub → /dfs3/pub.
Please note that jobs using /pub after 6/4/18 at noon will crash or be killed, so plan accordingly.
If this time presents a problem for you, email firstname.lastname@example.org asap.
/pub at 94% capacity
The public file server /pub is at 94% capacity. Writes disabled on /pub until it goes down to a safer level of 90% or less.
Please remove or move data to some other location.
A note will be added here when /pub goes below 90% capacity and /pub will go back to write-enable.
UPDATE: 5-7-2018 1:30pm Due to critical jobs using /pub and with /pub a little lower at 92%, it is back in write-mode for the time being. If /pub starts going up close to 100%, it will be disabled again. So please remove all unwanted data to help out.
/pub at 100% capacity
The public file server /pub is at 100% capacity. Writes disabled on /pub until it goes down to a safer level of 90% or less.
Please remove or move data to some other location.
We will be moving /pub to our new /dfs3 on a future date to fix the data crunch on /pub.
A note will be added here when /pub goes below 90% capacity.
UPDATE: 3-1-2018 9:30am /pub back down to a safer level of 87%. /pub back to normal read and write. Thank you for the quick action.
If you received emails about being over-quota on /pub, please ignore it.
Update your PUTTY if you use Windows.
Friday, Dec 15th
If you use [MS Windows]…
you have our sympathy.
if you use Windows < 10, please update your Putty to the latest version. We are changing the ciphers on the ssh servers on HPC and the old versions of Putty do not support the more secure ciphers.
If you use Windows 10, you can install OpenSSH natively now, finally. Or update Putty, which should continue to work on Win10.
The Putty page is here: https://www.chiark.greenend.org.uk/~sgtatham/putty/
New LightPath I/O Nodes
4:00pm Friday, 12-15-2017 Farran
Two new I/O nodes have been setup on the LightPath Network ( 192.5.19.x ).
To access the LP nodes, use:
qrsh -q ionode-lp
Each LP node has two 10G interfaces bonded for theorical speeds of 20Gb/s. You should be able to use the LP ionodes for moving data both inside and outside of UCI.
Disabled for the time being.
HPC BLCR ( Checkpoint )
1:00pm Friday, 11-17-2017 Farran
An issue has was discovered with HPC BLCR Checkpoint in that it was causing programs to not complete successfully - causing some programs to re-start.
The issue was isolated to left over bits from the old 3.10.104 kernel when HPC was upgraded back in August. BLCR was re-compiled from scratch and all nodes updated with correct BLCR bits for current 3.10.107 kernel. Testing shows the issue has been fixed and programs that work with BLCR now complete successfully.
Oct 30, 4:12pm - hjm
The metadata server (dfm-1-1) for /dfs1 locked up at about 4:12pm today for about an hr until it was rebooted. All files being written at that point (on /som, /bio) should be inspected for competeness, altho according to spec, the FS should be able to deal with fairly long timeouts. From the logs, it’s clear that at least some files failed to recover, but it’s also clear that some did recover - it looks like it’s application-specific. It looks like this was a random RAM memory glitch that managed to bypass the ECC circuitry. We’ll be keeping an eye on it.
/pub at 87% capacity
12:00am Friday, 10-13-2017 Farran
/pub is now back down to a safe level of 87%. Thank you for quickly acting on this.
The /pub file system is now back to normal in read & write mode.
If you received emails about being over-quota on /pub, please ignore it.
/pub at 99% capacity
11:30pm Thrusday, 10-12-2017 Farran
The /pub file-system is now at 99% - No new data will be allowed on /pub until the file-system is below 90%.
This means that running jobs writing to /pub will crash. This was done to prevent /pub from going 100% and causing massive data corruption across the file-system.
When /pub is under 90% a note will be added here and /pub will be be set to read & write again.
/pub at 98% capacity
11:45am Thrusday, 10-12-2017 Farran
The public file server /pub at 98% capacity and climbing. Please remove and/or move data OUT of /pub until it reaches a safer level of 90% or less.
If /pub reaches 99% quotas will be set at zero to prevent /pub from going to 100% and thereby causing possible data corruption.
11:45a Friday, 09-29-2017 Farran
The /dfs2 file system reached 99% capacity today and climbing to 100% capacity fast.
When a file-system reaches 100%, horrible, very horrible things can happen like then entire FS going corrupt which means massive data loss.
As an emergency, I removed a lot of files under the temp directory at:
To get /dfs2 to a safe capacity level. If you were using /dfs2/temp please be aware of this and don’t use the temp ( /dfs2/temp ) directory as it will soon will be removed.
Joseph Farran ( while on vacation )
Remaining problems since upgrade.
8:30a Monday, 09-05-2017 hjm
Checkpointing with BLCR still not working
at least 3 Qs were corrupted and have not been restored
Robinhood filescanning scripts and configuation was destroyed. Needs to be re-installed.
Selective Backup is not working (altho access to existing data has been restored - see below).
Some Intel 8core nodes are still not mounting /fast-scratch - seems like a network problem.
Please inform us of additional problems.
GPU nodes have new Nvidia drivers.
Monday, 09-05-2017 hjm
The Nvidia GPUs have been updated and (after a module load cuda) are responsive to nvidia-smi. We currently have these GPUs on HPC.
Selective Backup Server restored
Monday 8-30, 2:45p Francisco
Planb2, the Selective Backup server has been upgraded and the /sbak filesystem is available from compute-1-13, the interactive node, altho new backup runs are failing. as described in the doc.
Upgrade status, Remaining issues
Monday 8-30, Imam/Francisco/hjm
The HPC Cluster is up, but as noted below, there are some remaining issues. This was a significant upgrade (major OS, networking, software changes, as well as re-racking and re-cabling. We will be continuing this on an ongoing basis but with much less interruption. Individual racks may be shut down for several hours as we re-cable them for better power resilience and cooling.
some nodes lost ZFS software, resulting in the disks being orphaned. The data is still on the disks but they aren’t assembled into valid arrays. Being worked on. This includes:
possibly others (let us know if yours is among them)
some NFS mounts aren’t automounting correctly for some reason. This includes:
possibly others (let us know if yours is among them)
checkpointing is broken at least for some users. If you’re successfully using the BLCR checkpointing, let us know.
because NAS-7-1 seems to have a hardware failure, the services that run thru it will be broken until it’s fixed. Those include the Selective Backup, /fast-scratch, /ssd-scratch, and the Robinhood filesystem database, and our Trac ticketing & doc system.
Retrieval from your Selective Backup files will not be possible until the planb2 server is upgraded (it’s not adminatratively part of the HPC cluster, so it has to be done manually).
there are a few nodes that did not reboot correctly so we have to check them manually. If you’re missing a favorite node, please let us know.
HPC Upgrade (not quite) Completed
Monday 8-28-2017 6pm, Joseph/Imam/Francisco
HPC upgrade completed. The Cluster is UP.
Two new physical racks were installed and all of HPC main components which includes BeeGFS data server, login nodes, head-node, new switch and lots and lots of re-cabling were done. Imam did a great job of nicely re-wiring the two new racks from our previous spaghetti mess of cables.
All nodes were upgraded from CentOS 6.8 to 6.9, BeeGFS updated to latest 6.14 version, Mellanox Infiniband drivers and software upgraded to 4.1-18.104.22.168, and around 500 rpms updated.
The downtime took longer than estimated as we were hit with several surprises and also due to being short on staff.
For now the following will be unavailable/broken until we get around to fixing them:
Nas-7-1 has hardware issues
OpenMPI bits need updating, some OpenMPI modules may not work correctly.
GPU nodes needs new NVIDIA driver installed.
HPC Seletive backup.
Intel Software licenses.
BeeGFS not mounted on all nodes.
We did not get around to convering all nodes.
The issues above are mostly a direct result of the minimal staffing levels we have on HPC for the size and complexity of the cluster. We are operating with approx 1/3rd the needed staff when compared to other clusters of similar size and scope.
As with all upgrades please expect road bumps along the way. Please report issues to email@example.com and we will work on them as staff time permits.
Joseph & HPC Team
HPC Downtime Extended
Sunday 8-27-2017 9pm, Joseph
HPC downtime extended until Monday night 8/28/2017. Too much to do and not enough staff. Hopefully we can finish by Monday otherwise we will go into Tuesday.
Ether Switch issues
Monday 7-17-2017, Joseph/Imam
One of seven Nortel GigE switches on HPC was replaced and moved to a different location when it started powering itself down due to heat issues.
The problem started during the weekend and several nodes went off-line when the switch went down. Jobs may have been affected when Grid Engine could not communicate with the nodes, so please check your jobs.
All is back to normal now.
HPC Cluster Down-time August 26th 6am until August 27th at 9pm
Tuesday 6-27-2017, Joseph
|Downtime extended until Sunday 9pm|
The entire HPC cluster will be down and unavailable starting Saturday August 26th (8/26/17) at 6am until Sunday August 27th (8/27/17) at 9pm.
A maintenance of the Data center electrical generator is scheduled for August the 26th and as a result the entire HPC cluster will need to be taken down.
All jobs will be killed when the nodes are powered off so please plan accordingly.
We are planning on upgrading several parts of HPC during this downtime such as the OS and Kernel which means that jobs using HPC CheckPoint may not resume when the cluster comes back up due to the changes. We will make all attemps at resuming jobs BUT past history has shown that big changes to the cluster can confused BLCR checkpoint and thus the jobs are not able to continue.
If this downtime presents an absolute problem for you, please send us an email asap to firstname.lastname@example.org
/pub at 99% capacity
Friday 6-16-2017 9:30am, Joseph
The /pub file-system is at 99% capacity and climing. The /pub file-system was palced in No-More-Write Mode until /pub goes to 94% or lower. This has been done to prevent possible data corruption if /pub reaches 100%.
Current jobs writting to /pub will error and/or abort.
Once /pub is back to a safe level, a message will be added to the HPC login nodes of the fact.
Grid Engine Change
Tuesday 6-13-2017 Joseph
A change has been made to SGE "starter method". This is the script that all jobs on HPC go through when a job is submitted. The change was made to fix one specific issue. If you notice any strange problems with your jobs that were not present before, please email HPC support asap.
Interactive node / compute-1-13 is UP
Monday 6-5-2017 Joseph
The interactive node ( qrsh ) lost it’s OS disk. New disk installed and node re-imaged. It is back on-line. If you notice any particular software missing on the interactive node, please email hpc-support.
Interactive node / compute-1-13 is dead
Sunday 05-28-2017 hjm
The interactive node (aka compute-1-13) appears to have died; it keeps power-cycling and therefore is useless as a working node so it’s been taken offline. The Interactive Q has also been disabled. Any jobs running on c-1-13 have died along with it. We’ll replace it with other hardware on Tuesday.
nas-7-1 controller errors requires a restart
5-10-2017, 10am, hjm
10:00am, May 10: nas-7-1 is being taken offline to correct a critical disk controller problem. As far as we know, this problem has not caused any data loss or corruption, but is just causing controller restarts.
If you have jobs running on /fast-scratch or /ssd-scratch, they will fail and need to be re-started after nas-7-1 is restarted.
To test the fix, nas-7-1 will be restarted but may be taken down again if the problem persists to completely replace the disk controller. It will probably take about 2 days to detect the problem, so you may not want to use the above mentioned filesystems during that time.
DNS Problems on HPC
5-5-17 9:00pm Joseph / Harry
The campus automated security process took out HPC when it detected a possible intrusion. The intrusion is not yet clear what that means and we are looking into that.
The unfortunate thing is that HPC was taken out from the campus DNS service at around 2pm and NOBODY notified us of that change. So we spent numerous hours breaking our heads trying to figure out what was going on when some users could not get in. The DNS service was restored around 8pm after several email exchanges with the security folks.
From around 2pm to 8pm during the time DNS was out for HPC, users using normal passwords to login on HPC were denied. Also anyone already on HPC trying to get to the outside were also denied as the DNS name resolution was being blocked.
As of around 8pm today, things are back to normal except for a major migraine of a headache.
3-16-17 5:40pm Joseph Farran
The /dfs1 file-system is made up of 6 data servers. One of the data server, dfs-1-6, was experiencing Infiniband connectivity issues.
The dfs-1-6 server was rebooted and corrections made to the Infiniband configuation which resulted in some jobs that were using dfs-1-6 during that time to experience file I/O errors as the jobs could not get to the data.
The problems have been fixed and /dfs1 is now back to normal operating mode.
Please report issues on /dfs1 if they happen again AFTER the time stamp of this posting.
/pub available again in read/write mode
3-16-17 12:30pm Joseph Farran
The /pub file system is back down to a safe level of 81%. Thank you all for removing enough data to bring it down to a safe level.
/pub at 100% capacity
3-15-17 6:15pm Joseph Farran
The public /pub file-system reached 100% capacity. The file-system was placed in a no-more-write status until /pub is down to a safe level of 85% capacity or under.
If we allow /pub to continue at 100% capacity then that can easily cause data corruption which nobody wants.
Please start deleting / moving all unwated files OUT of /pub.
A followup note will be added here when /pub is back to a safe level and converted back to read/write mode.
/dfs1 is back up
8:12a, March 7, 2017 hjm * U P D A T E: 8:12a March 7, 2017
Joulien was able to restart the server and from a filesystem POV, things seem to be fairly normal. BUT, as noted above, jobs that were writing to or reading from /dfs1 may have crashed or otherwise misbehaved.
Please check your jobs carefully.
/dfs1 is down
4:15a, March 7, 2017 hjm *
/dfs1 (includes /bio, /som) is down this morning. One storage server went offline at about 1:44am this morning.
I have shut down the system until we fix the problem. Running jobs that referenced files on /dfs1 will hang and possibly fail when we re-start it. Please check your jobs carefully when the FS comes back.
HPC Short on Staff
2-23-2017 Joseph Farran
For the next month or longer there may be delays in answering HPC email as we are short on staff for a mimimum of one month.
If you don’t hear from us please send us a nice reminder after a couple of days and be patient.
Thank you for your understanding,
Shutdown Chill water supply to the OIT Data Center on Jan 23rd 6am
1-10-2017 Joseph Farran
On Jan 23rd OIT/UCI facilities will be performing the final commissioning of the Data Center Control Systems. In order to complete the commissioning process they have to shut down Chill water supply.
The Chill water supply will be offline from 6am until 4pm on the 23rd (1/23/17). We may have to bring down the HPC cluster at this time if we are not able to maintain the CORE with the backup cooling systems.
Hopefully HPC will NOT need to be shutdown on the 23rd but please plan accordingly.
HPC NOT Available
11-11-2016 Joseph Farran
A huge job array by a user was impacting the /data file server in a way that after several hours the /data file-server I/O stack was over-flowing causing /data server to crash and thus taking the entire HPC cluster down with it as /data is the heart of the cluster.
The user has been notified and account locked for now to prevent this from happening again over the holiday weekend.
HPC Status after the upgrade
11-06-2016 Joseph Farran
The HPC upgrade done on Friday went well for the most part with a few wrinkles. The /dfs2 file system took a while to come up as quotas were enabled.
Not all CheckPointed programs resumed correctly when the cluster came up. It looks like the change in kernel was enough to cause issues with resuming jobs that were checkpointed with the older kernel.
Please note that we have approx 1,500 installed software on HPC. We unfortunately do not have enough staff to check them so if you encounter issues with a particular software that was running before and it is having issues after the upgrade, let us know at email@example.com
HPC DOWN Friday 11/4/16 from 8am til 6pm
10-31-2016 Joseph Farran
The entire HPC cluster will be down on Friday 11/4/16 from 8am until 6pm in order to correct a serious security issue.
All jobs running with HPC CheckPoint will be check-pointed so that they will resume running when the cluster comes back up. All other running jobs will be killed when the compute nodes are rebooted.
We are hoping the cluster will be up by 6pm on Friday but please plan accordinly as it may take longer if we run into complications. Sorry for the short notice but this is emercengy.
Thank you for your understanding.
The HPC Team.
Data Center HVAC Testing Friday 7am
9-28-2016 Joseph Farran
The HVAC systems in the OIT Data Center where HPC is housed will be tested on Friday (9/30/2016) starting at 7am.
Everything is expected to perform without issue, however in the event we do encounter any problems we may need to shut down the HPC cluster to avoid overheating the data center. So please plan accordingly. Again we expect everything to go smoothly and HPC will remain running without issues.
/pub full ( 100 % )
9-21-2016 Joseph Farran
The public /pub file-system is at 100% capacity. The /pub FS has been made into a read-only file-system until I can find the time to trimm it down and get disk quotas enabled on it.
I will post an update here when /pub is back available in read/write mode.
5:30pm update: The /pub file-system has been re-mounted with disk quotas. Disk quotas have been set low until the file-system goes below 95% capacity.
PLEASE remove/move all unwanted files from /pub that you no longer need. As soon as /pub goes down to 95% or lower, I will reset disk quotas back to normal. Thank you for your cooperation.
12:00pm update: /pub down to 94% capacity. Resetting quotas back to normal. Thank you for removing enough files to bring /pub under 95%.
New Selective Backup Service Now available
9-13-2016 Joseph Farran
For many years HPC has ran without any kind of backup service. That has now changed.
We now have a new service called Selective Backup enabling you to specify what you like to have automatically backed-up on the cluster. Every user on HPC receives 1TB of backup space. For complete details, please see:
We highly recommend that you update your SB backup options to request email notifiactions. Instructions are included in the link above.
If you have any questions please let us know at firstname.lastname@example.org
8-29-2016 11am Farran/Joulien
It looks like the problem with /pub were bad drives and NOT a bad LSI raid card. The issue keeping the raid card from seeing all drives was isolated to two bad drives. Removing the bad drives allowed the raid card to come up all the way and work.
We are not sure why bad drives would hold the raid card captive, that’s a bad design but that was the issue.
Please use /pub with caution as we don’t know what if any other drive may go south and again hold the raid card captive bringing down /pub.
Mon Aug 29 11:49:34 PDT 2016 hjm
It looks like the chassis for /pub is having backplane issues. We are in the process of swapping the disks and controller to a spare chassis to see if we can bring up /pub on different hardware while we send the original back to be checked.
While /pub is down, many nodes will have higher than average apparent loads since they’re trying (and retrying) to connect to /pub files. Until we resolve this, that will continure. Don’t use the command df, unless you give it a specific filesystem. If you type df -h, your terminal will hang until you kill it unless you immediate put it in the background (df -h &). Also some other filesystem-related commands will be slow or hang, esp if they refer to /pub explicitly or implicitly.
/pub down again
8-28-2016 1:00pm Farran
Looks like /pub is out again with similar issues as before. The data server will be down until Monday when we can look at it.
/pub back on-line
8-24-2016 6:15pm Farran
The /pub file-system is now back on-line - no data loss as far as we can tell. The xfs_check completed without any errors.
The LSI MegaRaid 9266-4i card on nas-7-2 was replaced. We took extra precuation checking with the vendor as previous raid card swap have not always gone smoothly.
The /pub file-system was mounted without disk quotas because doing so causes the mount to stall. What this means is that we may still have hardware issues on nas-7-2. So please we aware of this in case nas-7-2 goes out completely in the future. For now /pub responds but hangs at moments and then continues (again probably hardware issues).
/pub file-system recovered but not available yet.
8-24-2016 11am Farran/Joulien [2:45pm]
Joseph has swapped controllers and has recovered the missing filesystems. He’s currently running the filesystems checks to make sure as much data is recovered as possible. We will post again when /pub and /checkpoint are up again. [hjm]
/pub file-system is still down.
8-24-2016 11am Farran/Joulien [11am]
Still waiting on confirmation of process by LSI technical assistance. Hoping that we can bring /pub back up by the end of today, tho we will have to run some filesystem checks to make sure that the data is intact. Given that /pub is a fairly large FS, this may take until tomorrow. [hjm]
/pub file-system is currently down
8-22-2016 11am Farran/Joulien
The /pub file system started experiencing I/O errors earlier today and /pub was not accessible. The server (nas-7-2) was reset but the server has not come up all the way as Raid card is currently doing checks and we need to wait for that to complete.
Please note that jobs using /pub will either freeze or crash. We will post an update here as soon as the server is available.
8/23/16 12:30pm Update: As for right now /pub remains unavailable.
Advanced HPC (the vendor) has been contacted. We suspect a bad Raid card but are waiting on Avago/LSI hardware expert before we proceed as don’t want to do something that will jeopardize the 60TB of data on /pub.
Some OpenMPI apps broken
Some flavors of OpenMPI apps are broken causing jobs to start but then go into 100% CPU state effectively locking-up the program.
Something similar happen last year in which new Mellanox drivers caused jobs not to hang but to run very slow. We are severely short on staff here so this may take longer than usual to track down and fix.
GE commlib errors and freezes
Grid engine has been reporting commlib error: (Connection refused) errors intermittently since yesterday with freezes and also users not being able to login.
We traced the issue to the probable cause being the /data server and/or it’s connections. When the /data server stops, HPC basically freezes as it is the heart of the cluster.
Today around 2:30pm Ethernet cables were replaced on the /data server. We are also using different ports for /data on the Nortel switch in case it was a bad port. The Nortel switch was also rebooted and other housekeeping things done.
System pings were showing around 12% packet loss originally and now we are getting 0% packet loss after the changes - so this is good news. Time will tell if the problem has been fixed or not.
HPC Grid Engine is Down
HPC Grid Engine ( scheduler ) database is corrupt.
Commands such as qstat, qsub, etc and all GE commands will not work while GE is down.
We are working to see if we can remove enough bad jobs to allow GE to start but we may need to reset the db.
Grid Engine database was corrupted enough that it had to be re-initialized. This means that all jobs that were running and waiting to run jobs were lost.
The Grid Engine database was moved to a new location under the ZFS file system in the event that if this happens again, we may be able to revert back on a previous db instead of starting from scratch.
The first Nytro card replacement went smooth as silk, however the second and last Nytro card went out with fireworks.
The raid card technician was on-site and after several hours of near catastrophes, the tech was able to bring back a missing volume on the last Nytro card thus avoiding a total and complete meltdown (data loss) of /dfs1/.
Many thanks to Avago/Broadcom/Advanced HPC for having the tech on-site as we would NOT have been able to have recovered from this by ourselves.
/pub not accessible
Due to a glitch post HPC upgrade, the /pub file system was not accessible on nodes that do not have Infiniband connection like the pub8i queue. The problem has been fixed.
/dfs1 ( /som & /bio ) Offline 6/27/16 from 10am til 2pm
The HPC distributed file-system /dfs1 ( /dfs1/som & /dfs1/bio ) will be unavailable Monday 27th from 10am til 2pm in order to try again the replacement of the problematic Nytro raid cards.
The technician from Avago/Broadcom will be on-site at our data center to help us with the Nytro raid cards replacement as it did not go well during Tuesday HPC maintenance.
The cluster will remain UP, however /dfs1 will be taken off-line. This means that jobs using /dfs1 will die/crash or the nodes may lock-up. You can help by NOT using /dfs1 on Monday the 27th and stopping all of jobs that use /dfs1.
|Users with files in /som & /bio, PLEASE backup all critical data in the event that the Nytro cards decide to go out with a blast and take the data with them.|
HPC Cluster Maintenance Done.
The HPC cluster maintenance is complete, however, we will need to schedule another downtime in the near future - read on please.
Over 200+ nodes & data servers were updated to the latest CentOS 6.8 release. We are NOT yet ready for CentOS 7, so CentOS 6 for now.
Each node received over 1,400+ updated packages, new Kernel, latest Mellanox drives with speedup bits, BLCR and latest BeeGFS file-system which now includes disk quotas.
The replacement of the problematic Nytro Raid cards did NOT go very well however. The first replacement went smooth, 2nd Nytro card replacement went south. The new raid card did not recognized the volumes on one of 4 distributed file-servers. After several failed attempts the decision was made to leave the old Nytro cards for now.
We are going to get the Avago/LSI/Broadcom expert back on-site to help us replace the remaining 3 Nytro cards as we don’t want to risk data-loss / corruption on huge file-system. This means that HPC will need to be taken down again in the near future - so this is a heads up.
|As soon as we can set a date with the Avago expert, we will schedule another HPC downtime and announce it here.|
Since we have new Mellanox drivers, we will be re-compiling all OpenMPI flavors against the new Mellanox bits in the days to come.
If you notice issues, please email us at email@example.com or better yet use the new handy-dandy mayday script on HPC.
Thank you for your patience,
The HPC Elves.
HPC Cluster Down Time: Tuesday June 14th 9am - 6pm.
The entire HPC cluster will be down on Tuesday June the 14th, from 9am til 6pm in order to replace the 4 remaining LSI Nytro Raid cards that have caused data corruption in the past.
The /checkpoint file-system is also going to be re-located as the current server for /pub & /checkpoint is over-loaded and cannot keep up.
|Replacing Raid cards is a risky manuver. Although we will take every possible precaution, there is no guranteed things will go smoothly so PLEASE BACKUP all of your important data outside of HPC in the event things go bad.|
If this downtime presents a problem for you, please notify the HPC support group ASAP at firstname.lastname@example.org
Module abyss/1.9.0 installed
June 6, 2016 Garr and Joulian
Installed the newest ABySS, version 1.9.0. Load it with:
module load absyss/1.9.0
/dfs1 metadata server crash
June 2, 2016 hjm
The metadata server for /dfs1 (/bio & /som) crashed at about 10:15pm on Thursday, June 2nd. Since it was the brains of the system, everything else stopped as well, so while any data in flight was probably lost, data on disk was probably safe. There’s no indcation what happened so far, but please check your data on /bio and /som to see if everything that’s supposed to be there, is still there. Another reason to calculate checksums and do frequent backups. Let us know if you’ve suffered catastrophic data loss. We probably won’t be able to recover it, but we’ll commiserate
/pub & /checkpoint back on-line
4:00pm 04-27-2016 Farran/Harry
The Avago ( LSI ) tech was at the data center this morning to supervise the removal of the problematic LSI Nytro card with the replacement of a new LSI Raid card ( non-Nytro ).
Good news in that all went well. The new Raid card was able to pick up all drives and see all volumes and both file-System: /pub & /checkpoint. The downtime took longer than expected due to the /pub file-system having to recheck quotas.
Since nas-7-2 data server is over-loaded serving both /pub & /checkpoint, we are looking at moving /checkpoint to a different server in the future. We are also going to plan on an HPC shutdown in order to replace the remaining 4 Nytro Raid controller before they cause more havoc.
The HPC shutdown will be announced here once we have a date set.
/pub, /checkpoint failure
11:00am 04-22-2016 hjm
|nas-7-2 which provides /pub and /checkpoint has crashed We are trying to figure out what the problem is but it will likely be hours before it comes back on line. If your jobs are running with Checkpointing, they will likely fail.|
A note that we are rebooting nodes on HPC in order to fix the glibc exploit that was recently discovered. All critical nodes have been fixed & rebooted. A few nodes remain that have jobs on them.
If you are a node owner with current running jobs that can be restarted, please consider stopping your jobs so that your nodes can be updated. The process takes 5 minutes max and you can then resume.
HPC major power outage
Posted: 3:40pm, Friday 01/29/2016 - Harry & Edward in DC
We had an unexpected power outage in our data center where 2 PDUs went out. Some services will be interupted as we are figuring out the problem.
4:40pm update: We have most parts of the service recovered.
HPC head node temporarily down
Posted: 10:25am, Tuesday 01/05/2016 - Edward
We’re currently experiencing an outage. The HPC head node is down.
11:00am update: The head node is back up.