ANOTHER HPC HEAT Meltdown

12-1-2018 2:45pm Joseph

The OIT data center suffered yet ANOTHER Heat Meltdown that started around 6-7pm on 11/30/18. At 7:45pm I received an alert from the OIT data manager of the heat issue and I powered off all compute nodes 5 minutes after the alert to prevent the nodes from melting further.

On 12-1-18 at 2:26pm, I received the green light from the OIT data-center manager that all was good and that I could bring back the HPC compute nodes that were powered off.

The HPC compute nodes were turned back on. Jobs running on the compute nodes when they were powered-off will need to be restarted unless you have HPC checkpoint or HPC restart enabled in your jobs.

I do NOT know the cause of the heat issue nor the extend of damage caused to the compute nodes ( if any ). I will look at the nodes on Monday when I get back to work.

Joseph

HPC HEAT Meltdown

10-10-2018 Joseph

The OIT data center where HPC is housed suffered yet another Heat Meltdown starting around 5am on Wednesday the 10th.

The heat Meltdown:

  • Took out ~5 racks, 109 compute nodes (crashed) on the HPC cluster due to the excessive heat.

  • Destroyed two (2) $10k compute nodes ( compute-4-25 & compute-4-35 ).

  • Hundreds of research jobs were lost.

  • Probably shorten the life-span of surviving healthy working nodes.

Unfortunately this is not the first, nor second, nor 3rd time this has happened where issues at the data center causes heat to quickly build-up to dangerous levels destroying and melting hardware.

Joseph Farran

HPC Default gcc and OpenMPI

10-1-2018 Joseph

The default gcc version on HPC has been changed from gcc/4.8.2 to

  • gcc/6.4.0

Also OpenMPI has been removed as a default module load. You can still make it available in your account by editing your .bashrc file and adding it back:

  • module load openmpi-3.1.2/gcc-6.4.0

But OpenMPI is no longer a default module to load. This was done as not everyone uses OpenMPI and it was causing confusion.

HPC Maintenance completed

9-25-2018 7pm RCIC Team

HPC is UP.

Upgrades done:

  • BeeGFS has been upgraded to version 7 on all servers and nodes.

  • RPMs updated for CentOS 6.9

  • Mellanox drivers upgraded to Mellanox OFED version 4.2.1.2.0

  • Re-cabling of Infiniband network.

  • Ubiquiti switches firmware update and all switches including Mellanox rebooted.

  • OpenMPI version 3.1.2 compiled against new Mellanox drivers. OpenMPI 3.1.2 is now the default on HPC.

  • Other flavors of gcc compiled for the new OpenMPI v3.1.2. If you need a particular flavor of gcc and/or OpenMPI re-compiled, please email hpc-support@uci.edu

  • Older versions of OpenMPI 1.6 and 1.7 removed from modules.

Expect some road-bumps. If you find issues or have problems, please email us at hpc-support@uci.edu

HPC Team

Big Memory Node ( bigmemory queue )

9-12-2018 Joseph

The HPC big memory node is now only accessible by request and it is no longer part of the public queues. Starting today in order to access the bigmemory queue you will need to be in the "bigmemory" HPC group.

If you need access to the big memory queue, please send a request to hpc-support@uci.edu with an explanation of why you need it and how much memory you will be using.

/ssd-scratch and /fast-scratch

9-12-2018 Joseph

The server that serves /ssd-scratch and /fast-scratch file-systems ( nas-7-1 ) is having hardware issues. It’s over 5 years old. We are evaluating whether to replace it or not.

If /ssd-scratch and/or /fast-scratch are important to your research work and you need them back please let us know by sending an email to hpc-support@uci.edu of the fact.

Some R/3.4.1 packages updates issued warnings.

08-08-2018 hjm

When updating large numbers of packages, a few always fail to update due to dependency order or other issues. If you use one of these packages, please contact me <hjmangalam@gmail.com> and I’ll try to resolve the issue:

hdf5r R2jags GroupSeq jqr libsoc RcppParallel V8 Rmpfr units sdols bamboo CePa fst rgdal TTAinterfaceTrendAnalysis WGCNA ggforce wellknown sf PMCMRplus gdalUtils randomcoloR BNSP divest macleish quickPlot rnrfa shallot SiMRiv wikilake ggraph velox ALA4R btb easyDes iemisc reproducible Rfast RSDA tidyRSS TropFishR bibliometrix dtwclust eurostat scholar Seurat SpaDES.tools SpaDES.core SpaDES

Why is my job not running?

7-31-2018 Joseph

A new script has been created to hopefully help you understand why a particular job is not running on the HPC cluster after you submit it.

  • To use it simply enter "why <job-id>"

And the Grid Engine report will show possible reason(s) what the job is waiting on to run.

HPC Scheduler ( Grid Engine )

7-17-2018 Joseph

HPC Grid Engine scheduler is having issues in which jobs that are suppose to run are not running and jobs that should be suspended are not being suspended.

I was out for 3 weeks due to a medical emergency and I recently got back. I am working on the problem but please note that the scheduler is complex and it will take me some time to figure out what is causing this.

Thank you for your understanding

Joseph

new tools to monitor or profile your jobs

6-24-2018 hjm

To address some of the resource mis-allocation that is preventing efficient use of the HPC cluster, I’ve written some tools that should make it easier for users to figure out what their jobs are actually doing.

  • myjobs
    Type:

    • myjobs help → to get the usage

    • myjobs → to see what it does You can follow that up with jobs_at compute-x-x to inspect your jobs more closely via top. (Shows how much CPU and RAM YOUR jobs are using.)

  • qjobstatus - queries qstat to list all your jobs on all nodes and then cycles thru those nodes, listing the jobs and their run status. If you have no jobs submitted, don’t bother.

  • profilemyjobs - much more data, logged, printed, and plotted if you have an X11 screen available. Needs to be executed on the same machine as your job is running.

  • qbetta - view all Qs, all jobs, all status

Questions to <harry.mangalam@uci.edu>

Get your jobs on-CPU ASAP.

6-28-2018 hjm

Unless you have dedicated Qs for your research group, the fastest way to get your jobs assigned a CPU is to use the free and/or pub Qs. Especially if you have short, serial jobs, your jobs will see silicon MUCH faster if you let the scheduler find a slot for you. We recently added several hundred cores to the free Qs so the free Qs should eat your jobs faster than ever. So add the following SGE Q directive to your qsub scripts: (note, no spaces)

#$ -q pub*,free*,otherQs

And don’t forget, we have large, non-64core machines (ie free56i) that will probably run your big jobs faster than the AMD 64-cores. If your code is floating-point-heavy, Intel CPUs will run faster than AMD 64core Bulldozers (but not the new AMD EPYC).

Also, unless you have verified that a 64-way job will run significantly faster than 2x32core jobs, use 2x32 core jobs, or 4x16 core jobs. Very few apps scale linearly past 16 cores.

/pub switch-over Completed

6-4-2018 Joseph

The switch over of /pub is now complete. All data from /share/pub was rscyn’ed to /dfs3/pub

The links were also updated on all compute nodes:

/pub -> /share/pub has been removed
/pub -> /dfs3/pub has been created

This means that you can keep on using /pub/$USER as before and no need to update your job scripts.

The process took longer than expected as several users were still using the old /share/pub at the time of the switch-over which slowed me down.

/pub Switch-over at Noon on Monday 6/4/18

5-14-2018 Farran

Switch over extended to Monday 6/4/18

The public disk file space /pub is being moved from /pub to /dfs3/pub in order to provide more room for the growing needs of all /pub users.

The move will happen at noon on Monday 6/4/18. Everything is going to be copied over from /pub to /dfs3/pub and a new link will be setup so that /pub will now point to /dfs3/pub.

If you are using the original /pub after the noon hour on 6/4/18, your processes will be killed so that we can setup the new links: /pub → /dfs3/pub.

Please note that jobs using /pub after 6/4/18 at noon will crash or be killed, so plan accordingly.

If this time presents a problem for you, email hpc-support@uci.edu asap.

/pub at 94% capacity

5-7-2018 Farran

The public file server /pub is at 94% capacity. Writes disabled on /pub until it goes down to a safer level of 90% or less.

Please remove or move data to some other location.

A note will be added here when /pub goes below 90% capacity and /pub will go back to write-enable.

UPDATE: 5-7-2018 1:30pm Due to critical jobs using /pub and with /pub a little lower at 92%, it is back in write-mode for the time being. If /pub starts going up close to 100%, it will be disabled again. So please remove all unwanted data to help out.

/pub at 100% capacity

2-28-2018 Farran

The public file server /pub is at 100% capacity. Writes disabled on /pub until it goes down to a safer level of 90% or less.

Please remove or move data to some other location.

We will be moving /pub to our new /dfs3 on a future date to fix the data crunch on /pub.

A note will be added here when /pub goes below 90% capacity.

UPDATE: 3-1-2018 9:30am /pub back down to a safer level of 87%. /pub back to normal read and write. Thank you for the quick action.

If you received emails about being over-quota on /pub, please ignore it.

Joseph

Update your PUTTY if you use Windows.

Friday, Dec 15th

If you use [MS Windows]…

  • you have our sympathy.

  • if you use Windows < 10, please update your Putty to the latest version. We are changing the ciphers on the ssh servers on HPC and the old versions of Putty do not support the more secure ciphers.

  • If you use Windows 10, you can install OpenSSH natively now, finally. Or update Putty, which should continue to work on Win10.

New LightPath I/O Nodes

4:00pm Friday, 12-15-2017 Farran

Two new I/O nodes have been setup on the LightPath Network ( 192.5.19.x ).

To access the LP nodes, use:

  • qrsh -q ionode-lp

Each LP node has two 10G interfaces bonded for theorical speeds of 20Gb/s. You should be able to use the LP ionodes for moving data both inside and outside of UCI.

Disabled for the time being.

HPC BLCR ( Checkpoint )

1:00pm Friday, 11-17-2017 Farran

An issue has was discovered with HPC BLCR Checkpoint in that it was causing programs to not complete successfully - causing some programs to re-start.

The issue was isolated to left over bits from the old 3.10.104 kernel when HPC was upgraded back in August. BLCR was re-compiled from scratch and all nodes updated with correct BLCR bits for current 3.10.107 kernel. Testing shows the issue has been fixed and programs that work with BLCR now complete successfully.

/dfs1 failure

Oct 30, 4:12pm - hjm

The metadata server (dfm-1-1) for /dfs1 locked up at about 4:12pm today for about an hr until it was rebooted. All files being written at that point (on /som, /bio) should be inspected for competeness, altho according to spec, the FS should be able to deal with fairly long timeouts. From the logs, it’s clear that at least some files failed to recover, but it’s also clear that some did recover - it looks like it’s application-specific. It looks like this was a random RAM memory glitch that managed to bypass the ECC circuitry. We’ll be keeping an eye on it.

/pub at 87% capacity

12:00am Friday, 10-13-2017 Farran

/pub is now back down to a safe level of 87%. Thank you for quickly acting on this.

The /pub file system is now back to normal in read & write mode.

If you received emails about being over-quota on /pub, please ignore it.

Joseph

/pub at 99% capacity

11:30pm Thrusday, 10-12-2017 Farran

The /pub file-system is now at 99% - No new data will be allowed on /pub until the file-system is below 90%.

This means that running jobs writing to /pub will crash. This was done to prevent /pub from going 100% and causing massive data corruption across the file-system.

When /pub is under 90% a note will be added here and /pub will be be set to read & write again.

Joseph

/pub at 98% capacity

11:45am Thrusday, 10-12-2017 Farran

The public file server /pub at 98% capacity and climbing. Please remove and/or move data OUT of /pub until it reaches a safer level of 90% or less.

If /pub reaches 99% quotas will be set at zero to prevent /pub from going to 100% and thereby causing possible data corruption.

Joseph

/dfs2/temp

11:45a Friday, 09-29-2017 Farran

The /dfs2 file system reached 99% capacity today and climbing to 100% capacity fast.

When a file-system reaches 100%, horrible, very horrible things can happen like then entire FS going corrupt which means massive data loss.

As an emergency, I removed a lot of files under the temp directory at:

  • /dfs2/temp

To get /dfs2 to a safe capacity level. If you were using /dfs2/temp please be aware of this and don’t use the temp ( /dfs2/temp ) directory as it will soon will be removed.

Thank you,

Joseph Farran ( while on vacation )

Remaining problems since upgrade.

8:30a Monday, 09-05-2017 hjm

  • Checkpointing with BLCR still not working

  • at least 3 Qs were corrupted and have not been restored

  • Robinhood filescanning scripts and configuation was destroyed. Needs to be re-installed.

  • Selective Backup is not working (altho access to existing data has been restored - see below).

  • Some Intel 8core nodes are still not mounting /fast-scratch - seems like a network problem.

  • Please inform us of additional problems.

GPU nodes have new Nvidia drivers.

Monday, 09-05-2017 hjm

The Nvidia GPUs have been updated and (after a module load cuda) are responsive to nvidia-smi. We currently have these GPUs on HPC.

Selective Backup Server restored

Monday 8-30, 2:45p Francisco

Planb2, the Selective Backup server has been upgraded and the /sbak filesystem is available from compute-1-13, the interactive node, altho new backup runs are failing. as described in the doc.

Upgrade status, Remaining issues

Monday 8-30, Imam/Francisco/hjm

  • The HPC Cluster is up, but as noted below, there are some remaining issues. This was a significant upgrade (major OS, networking, software changes, as well as re-racking and re-cabling. We will be continuing this on an ongoing basis but with much less interruption. Individual racks may be shut down for several hours as we re-cable them for better power resilience and cooling.

  • some nodes lost ZFS software, resulting in the disks being orphaned. The data is still on the disks but they aren’t assembled into valid arrays. Being worked on. This includes:

    • /share/tim2

    • possibly others (let us know if yours is among them)

  • some NFS mounts aren’t automounting correctly for some reason. This includes:

    • /share/chad

    • possibly others (let us know if yours is among them)

  • checkpointing is broken at least for some users. If you’re successfully using the BLCR checkpointing, let us know.

  • because NAS-7-1 seems to have a hardware failure, the services that run thru it will be broken until it’s fixed. Those include the Selective Backup, /fast-scratch, /ssd-scratch, and the Robinhood filesystem database, and our Trac ticketing & doc system.

  • Retrieval from your Selective Backup files will not be possible until the planb2 server is upgraded (it’s not adminatratively part of the HPC cluster, so it has to be done manually).

  • there are a few nodes that did not reboot correctly so we have to check them manually. If you’re missing a favorite node, please let us know.

HPC Upgrade (not quite) Completed

Monday 8-28-2017 6pm, Joseph/Imam/Francisco

HPC upgrade completed. The Cluster is UP.

Two new physical racks were installed and all of HPC main components which includes BeeGFS data server, login nodes, head-node, new switch and lots and lots of re-cabling were done. Imam did a great job of nicely re-wiring the two new racks from our previous spaghetti mess of cables.

All nodes were upgraded from CentOS 6.8 to 6.9, BeeGFS updated to latest 6.14 version, Mellanox Infiniband drivers and software upgraded to 4.1-1.0.2.0, and around 500 rpms updated.

The downtime took longer than estimated as we were hit with several surprises and also due to being short on staff.

For now the following will be unavailable/broken until we get around to fixing them:

  • Nas-7-1 has hardware issues

  • /fast-scratch/$USER

  • /ssd-scratch/$USER

  • OpenMPI bits need updating, some OpenMPI modules may not work correctly.

  • GPU nodes needs new NVIDIA driver installed.

  • HPC Seletive backup.

  • Intel Software licenses.

  • BeeGFS not mounted on all nodes.

  • We did not get around to convering all nodes.

The issues above are mostly a direct result of the minimal staffing levels we have on HPC for the size and complexity of the cluster. We are operating with approx 1/3rd the needed staff when compared to other clusters of similar size and scope.

As with all upgrades please expect road bumps along the way. Please report issues to hpc-support@uci.edu and we will work on them as staff time permits.

Thank you,

Joseph & HPC Team

HPC Downtime Extended

Sunday 8-27-2017 9pm, Joseph

HPC downtime extended until Monday night 8/28/2017. Too much to do and not enough staff. Hopefully we can finish by Monday otherwise we will go into Tuesday.

Ether Switch issues

Monday 7-17-2017, Joseph/Imam

One of seven Nortel GigE switches on HPC was replaced and moved to a different location when it started powering itself down due to heat issues.

The problem started during the weekend and several nodes went off-line when the switch went down. Jobs may have been affected when Grid Engine could not communicate with the nodes, so please check your jobs.

All is back to normal now.

HPC Cluster Down-time August 26th 6am until August 27th at 9pm

Tuesday 6-27-2017, Joseph

Important Downtime extended until Sunday 9pm

The entire HPC cluster will be down and unavailable starting Saturday August 26th (8/26/17) at 6am until Sunday August 27th (8/27/17) at 9pm.

A maintenance of the Data center electrical generator is scheduled for August the 26th and as a result the entire HPC cluster will need to be taken down.

All jobs will be killed when the nodes are powered off so please plan accordingly.

We are planning on upgrading several parts of HPC during this downtime such as the OS and Kernel which means that jobs using HPC CheckPoint may not resume when the cluster comes back up due to the changes. We will make all attemps at resuming jobs BUT past history has shown that big changes to the cluster can confused BLCR checkpoint and thus the jobs are not able to continue.

If this downtime presents an absolute problem for you, please send us an email asap to hpc-support@uci.edu

/pub at 99% capacity

Friday 6-16-2017 9:30am, Joseph

The /pub file-system is at 99% capacity and climing. The /pub file-system was palced in No-More-Write Mode until /pub goes to 94% or lower. This has been done to prevent possible data corruption if /pub reaches 100%.

Current jobs writting to /pub will error and/or abort.

Once /pub is back to a safe level, a message will be added to the HPC login nodes of the fact.

Grid Engine Change

Tuesday 6-13-2017 Joseph

A change has been made to SGE "starter method". This is the script that all jobs on HPC go through when a job is submitted. The change was made to fix one specific issue. If you notice any strange problems with your jobs that were not present before, please email HPC support asap.

Thank you.

Interactive node / compute-1-13 is UP

Monday 6-5-2017 Joseph

The interactive node ( qrsh ) lost it’s OS disk. New disk installed and node re-imaged. It is back on-line. If you notice any particular software missing on the interactive node, please email hpc-support.

Interactive node / compute-1-13 is dead

Sunday 05-28-2017 hjm

The interactive node (aka compute-1-13) appears to have died; it keeps power-cycling and therefore is useless as a working node so it’s been taken offline. The Interactive Q has also been disabled. Any jobs running on c-1-13 have died along with it. We’ll replace it with other hardware on Tuesday.

nas-7-1 controller errors requires a restart

5-10-2017, 10am, hjm

10:00am, May 10: nas-7-1 is being taken offline to correct a critical disk controller problem. As far as we know, this problem has not caused any data loss or corruption, but is just causing controller restarts.

If you have jobs running on /fast-scratch or /ssd-scratch, they will fail and need to be re-started after nas-7-1 is restarted.

To test the fix, nas-7-1 will be restarted but may be taken down again if the problem persists to completely replace the disk controller. It will probably take about 2 days to detect the problem, so you may not want to use the above mentioned filesystems during that time.

DNS Problems on HPC

5-5-17 9:00pm Joseph / Harry

The campus automated security process took out HPC when it detected a possible intrusion. The intrusion is not yet clear what that means and we are looking into that.

The unfortunate thing is that HPC was taken out from the campus DNS service at around 2pm and NOBODY notified us of that change. So we spent numerous hours breaking our heads trying to figure out what was going on when some users could not get in. The DNS service was restored around 8pm after several email exchanges with the security folks.

From around 2pm to 8pm during the time DNS was out for HPC, users using normal passwords to login on HPC were denied. Also anyone already on HPC trying to get to the outside were also denied as the DNS name resolution was being blocked.

As of around 8pm today, things are back to normal except for a major migraine of a headache.

/dfs1 Issues

3-16-17 5:40pm Joseph Farran

The /dfs1 file-system is made up of 6 data servers. One of the data server, dfs-1-6, was experiencing Infiniband connectivity issues.

The dfs-1-6 server was rebooted and corrections made to the Infiniband configuation which resulted in some jobs that were using dfs-1-6 during that time to experience file I/O errors as the jobs could not get to the data.

The problems have been fixed and /dfs1 is now back to normal operating mode.

Please report issues on /dfs1 if they happen again AFTER the time stamp of this posting.

Thank you,

Joseph

/pub available again in read/write mode

3-16-17 12:30pm Joseph Farran

The /pub file system is back down to a safe level of 81%. Thank you all for removing enough data to bring it down to a safe level.

/pub at 100% capacity

3-15-17 6:15pm Joseph Farran

The public /pub file-system reached 100% capacity. The file-system was placed in a no-more-write status until /pub is down to a safe level of 85% capacity or under.

If we allow /pub to continue at 100% capacity then that can easily cause data corruption which nobody wants.

Please start deleting / moving all unwated files OUT of /pub.

A followup note will be added here when /pub is back to a safe level and converted back to read/write mode.

/dfs1 is back up

  • 8:12a, March 7, 2017 hjm * U P D A T E: 8:12a March 7, 2017

Joulien was able to restart the server and from a filesystem POV, things seem to be fairly normal. BUT, as noted above, jobs that were writing to or reading from /dfs1 may have crashed or otherwise misbehaved.

Please check your jobs carefully.

/dfs1 is down

  • 4:15a, March 7, 2017 hjm *

/dfs1 (includes /bio, /som) is down this morning. One storage server went offline at about 1:44am this morning.

I have shut down the system until we fix the problem. Running jobs that referenced files on /dfs1 will hang and possibly fail when we re-start it. Please check your jobs carefully when the FS comes back.

hjm

HPC Short on Staff

2-23-2017 Joseph Farran

For the next month or longer there may be delays in answering HPC email as we are short on staff for a mimimum of one month.

If you don’t hear from us please send us a nice reminder after a couple of days and be patient.

Thank you for your understanding,

Joseph

Shutdown Chill water supply to the OIT Data Center on Jan 23rd 6am

1-10-2017 Joseph Farran

On Jan 23rd OIT/UCI facilities will be performing the final commissioning of the Data Center Control Systems. In order to complete the commissioning process they have to shut down Chill water supply.

The Chill water supply will be offline from 6am until 4pm on the 23rd (1/23/17). We may have to bring down the HPC cluster at this time if we are not able to maintain the CORE with the backup cooling systems.

Hopefully HPC will NOT need to be shutdown on the 23rd but please plan accordingly.

HPC NOT Available

11-11-2016 Joseph Farran

A huge job array by a user was impacting the /data file server in a way that after several hours the /data file-server I/O stack was over-flowing causing /data server to crash and thus taking the entire HPC cluster down with it as /data is the heart of the cluster.

The user has been notified and account locked for now to prevent this from happening again over the holiday weekend.

Joseph

HPC Status after the upgrade

11-06-2016 Joseph Farran

The HPC upgrade done on Friday went well for the most part with a few wrinkles. The /dfs2 file system took a while to come up as quotas were enabled.

Not all CheckPointed programs resumed correctly when the cluster came up. It looks like the change in kernel was enough to cause issues with resuming jobs that were checkpointed with the older kernel.

Please note that we have approx 1,500 installed software on HPC. We unfortunately do not have enough staff to check them so if you encounter issues with a particular software that was running before and it is having issues after the upgrade, let us know at hpc-support@uci.edu

Thank you,

HPC Team

HPC DOWN Friday 11/4/16 from 8am til 6pm

10-31-2016 Joseph Farran

The entire HPC cluster will be down on Friday 11/4/16 from 8am until 6pm in order to correct a serious security issue.

All jobs running with HPC CheckPoint will be check-pointed so that they will resume running when the cluster comes back up. All other running jobs will be killed when the compute nodes are rebooted.

We are hoping the cluster will be up by 6pm on Friday but please plan accordinly as it may take longer if we run into complications. Sorry for the short notice but this is emercengy.

Thank you for your understanding.

The HPC Team.

Data Center HVAC Testing Friday 7am

9-28-2016 Joseph Farran

The HVAC systems in the OIT Data Center where HPC is housed will be tested on Friday (9/30/2016) starting at 7am.

Everything is expected to perform without issue, however in the event we do encounter any problems we may need to shut down the HPC cluster to avoid overheating the data center. So please plan accordingly. Again we expect everything to go smoothly and HPC will remain running without issues.

/pub full ( 100 % )

9-21-2016 Joseph Farran

The public /pub file-system is at 100% capacity. The /pub FS has been made into a read-only file-system until I can find the time to trimm it down and get disk quotas enabled on it.

I will post an update here when /pub is back available in read/write mode.

5:30pm update: The /pub file-system has been re-mounted with disk quotas. Disk quotas have been set low until the file-system goes below 95% capacity.

PLEASE remove/move all unwanted files from /pub that you no longer need. As soon as /pub goes down to 95% or lower, I will reset disk quotas back to normal. Thank you for your cooperation.

12:00pm update: /pub down to 94% capacity. Resetting quotas back to normal. Thank you for removing enough files to bring /pub under 95%.

New Selective Backup Service Now available

9-13-2016 Joseph Farran

For many years HPC has ran without any kind of backup service. That has now changed.

We now have a new service called Selective Backup enabling you to specify what you like to have automatically backed-up on the cluster. Every user on HPC receives 1TB of backup space. For complete details, please see:

We highly recommend that you update your SB backup options to request email notifiactions. Instructions are included in the link above.

If you have any questions please let us know at hpc-support@uci.edu

/pub UP

8-29-2016 11am Farran/Joulien

It looks like the problem with /pub were bad drives and NOT a bad LSI raid card. The issue keeping the raid card from seeing all drives was isolated to two bad drives. Removing the bad drives allowed the raid card to come up all the way and work.

We are not sure why bad drives would hold the raid card captive, that’s a bad design but that was the issue.

Please use /pub with caution as we don’t know what if any other drive may go south and again hold the raid card captive bringing down /pub.

/pub update

Mon Aug 29 11:49:34 PDT 2016 hjm

It looks like the chassis for /pub is having backplane issues. We are in the process of swapping the disks and controller to a spare chassis to see if we can bring up /pub on different hardware while we send the original back to be checked.

While /pub is down, many nodes will have higher than average apparent loads since they’re trying (and retrying) to connect to /pub files. Until we resolve this, that will continure. Don’t use the command df, unless you give it a specific filesystem. If you type df -h, your terminal will hang until you kill it unless you immediate put it in the background (df -h &). Also some other filesystem-related commands will be slow or hang, esp if they refer to /pub explicitly or implicitly.

/pub down again

8-28-2016 1:00pm Farran

Looks like /pub is out again with similar issues as before. The data server will be down until Monday when we can look at it.

/pub back on-line

8-24-2016 6:15pm Farran

The /pub file-system is now back on-line - no data loss as far as we can tell. The xfs_check completed without any errors.

The LSI MegaRaid 9266-4i card on nas-7-2 was replaced. We took extra precuation checking with the vendor as previous raid card swap have not always gone smoothly.

The /pub file-system was mounted without disk quotas because doing so causes the mount to stall. What this means is that we may still have hardware issues on nas-7-2. So please we aware of this in case nas-7-2 goes out completely in the future. For now /pub responds but hangs at moments and then continues (again probably hardware issues).

/pub file-system recovered but not available yet.

8-24-2016 11am Farran/Joulien [2:45pm]

Joseph has swapped controllers and has recovered the missing filesystems. He’s currently running the filesystems checks to make sure as much data is recovered as possible. We will post again when /pub and /checkpoint are up again. [hjm]

/pub file-system is still down.

8-24-2016 11am Farran/Joulien [11am]

Still waiting on confirmation of process by LSI technical assistance. Hoping that we can bring /pub back up by the end of today, tho we will have to run some filesystem checks to make sure that the data is intact. Given that /pub is a fairly large FS, this may take until tomorrow. [hjm]

/pub file-system is currently down

8-22-2016 11am Farran/Joulien

The /pub file system started experiencing I/O errors earlier today and /pub was not accessible. The server (nas-7-2) was reset but the server has not come up all the way as Raid card is currently doing checks and we need to wait for that to complete.

Please note that jobs using /pub will either freeze or crash. We will post an update here as soon as the server is available.

8/23/16 12:30pm Update: As for right now /pub remains unavailable.

Advanced HPC (the vendor) has been contacted. We suspect a bad Raid card but are waiting on Avago/LSI hardware expert before we proceed as don’t want to do something that will jeopardize the 60TB of data on /pub.

Some OpenMPI apps broken

8-18-2016 Farran

Some flavors of OpenMPI apps are broken causing jobs to start but then go into 100% CPU state effectively locking-up the program.

Something similar happen last year in which new Mellanox drivers caused jobs not to hang but to run very slow. We are severely short on staff here so this may take longer than usual to track down and fix.

Joseph

GE commlib errors and freezes

7-7-2016 Farran

Grid engine has been reporting commlib error: (Connection refused) errors intermittently since yesterday with freezes and also users not being able to login.

We traced the issue to the probable cause being the /data server and/or it’s connections. When the /data server stops, HPC basically freezes as it is the heart of the cluster.

Today around 2:30pm Ethernet cables were replaced on the /data server. We are also using different ports for /data on the Nortel switch in case it was a bad port. The Nortel switch was also rebooted and other housekeeping things done.

System pings were showing around 12% packet loss originally and now we are getting 0% packet loss after the changes - so this is good news. Time will tell if the problem has been fixed or not.

HPC Grid Engine is Down

7-6-2016 Farran

HPC Grid Engine ( scheduler ) database is corrupt.

Commands such as qstat, qsub, etc and all GE commands will not work while GE is down.

We are working to see if we can remove enough bad jobs to allow GE to start but we may need to reset the db.

Update: 3:30pm

Grid Engine database was corrupted enough that it had to be re-initialized. This means that all jobs that were running and waiting to run jobs were lost.

The Grid Engine database was moved to a new location under the ZFS file system in the event that if this happens again, we may be able to revert back on a previous db instead of starting from scratch.

/dfs1 Update

6-27-2016 Farran/Harry/Joulien

The first Nytro card replacement went smooth as silk, however the second and last Nytro card went out with fireworks.

The raid card technician was on-site and after several hours of near catastrophes, the tech was able to bring back a missing volume on the last Nytro card thus avoiding a total and complete meltdown (data loss) of /dfs1/.

Many thanks to Avago/Broadcom/Advanced HPC for having the tech on-site as we would NOT have been able to have recovered from this by ourselves.

/pub not accessible

6-20-2016 Farran

Due to a glitch post HPC upgrade, the /pub file system was not accessible on nodes that do not have Infiniband connection like the pub8i queue. The problem has been fixed.

/dfs1 ( /som & /bio ) Offline 6/27/16 from 10am til 2pm

6-16-2016 Farran/Harry

The HPC distributed file-system /dfs1 ( /dfs1/som & /dfs1/bio ) will be unavailable Monday 27th from 10am til 2pm in order to try again the replacement of the problematic Nytro raid cards.

The technician from Avago/Broadcom will be on-site at our data center to help us with the Nytro raid cards replacement as it did not go well during Tuesday HPC maintenance.

The cluster will remain UP, however /dfs1 will be taken off-line. This means that jobs using /dfs1 will die/crash or the nodes may lock-up. You can help by NOT using /dfs1 on Monday the 27th and stopping all of jobs that use /dfs1.

Warning Users with files in /som & /bio, PLEASE backup all critical data in the event that the Nytro cards decide to go out with a blast and take the data with them.

HPC Cluster Maintenance Done.

6-14-2016 Farran/Harry/Garr/Joulien/Edward

The HPC cluster maintenance is complete, however, we will need to schedule another downtime in the near future - read on please.

Over 200+ nodes & data servers were updated to the latest CentOS 6.8 release. We are NOT yet ready for CentOS 7, so CentOS 6 for now.

Each node received over 1,400+ updated packages, new Kernel, latest Mellanox drives with speedup bits, BLCR and latest BeeGFS file-system which now includes disk quotas.

The replacement of the problematic Nytro Raid cards did NOT go very well however. The first replacement went smooth, 2nd Nytro card replacement went south. The new raid card did not recognized the volumes on one of 4 distributed file-servers. After several failed attempts the decision was made to leave the old Nytro cards for now.

We are going to get the Avago/LSI/Broadcom expert back on-site to help us replace the remaining 3 Nytro cards as we don’t want to risk data-loss / corruption on huge file-system. This means that HPC will need to be taken down again in the near future - so this is a heads up.

Warning As soon as we can set a date with the Avago expert, we will schedule another HPC downtime and announce it here.

Since we have new Mellanox drivers, we will be re-compiling all OpenMPI flavors against the new Mellanox bits in the days to come.

If you notice issues, please email us at hpc-support@uci.edu or better yet use the new handy-dandy mayday script on HPC.

Thank you for your patience,

The HPC Elves.

HPC Cluster Down Time: Tuesday June 14th 9am - 6pm.

4-28-2016 Farran

The entire HPC cluster will be down on Tuesday June the 14th, from 9am til 6pm in order to replace the 4 remaining LSI Nytro Raid cards that have caused data corruption in the past.

The /checkpoint file-system is also going to be re-located as the current server for /pub & /checkpoint is over-loaded and cannot keep up.

Warning Replacing Raid cards is a risky manuver. Although we will take every possible precaution, there is no guranteed things will go smoothly so PLEASE BACKUP all of your important data outside of HPC in the event things go bad.

If this downtime presents a problem for you, please notify the HPC support group ASAP at hpc-support@uci.edu

Module abyss/1.9.0 installed

June 6, 2016 Garr and Joulian

Installed the newest ABySS, version 1.9.0. Load it with:

module load absyss/1.9.0

/dfs1 metadata server crash

June 2, 2016 hjm

The metadata server for /dfs1 (/bio & /som) crashed at about 10:15pm on Thursday, June 2nd. Since it was the brains of the system, everything else stopped as well, so while any data in flight was probably lost, data on disk was probably safe. There’s no indcation what happened so far, but please check your data on /bio and /som to see if everything that’s supposed to be there, is still there. Another reason to calculate checksums and do frequent backups. Let us know if you’ve suffered catastrophic data loss. We probably won’t be able to recover it, but we’ll commiserate

/pub & /checkpoint back on-line

4:00pm 04-27-2016 Farran/Harry

The Avago ( LSI ) tech was at the data center this morning to supervise the removal of the problematic LSI Nytro card with the replacement of a new LSI Raid card ( non-Nytro ).

Good news in that all went well. The new Raid card was able to pick up all drives and see all volumes and both file-System: /pub & /checkpoint. The downtime took longer than expected due to the /pub file-system having to recheck quotas.

Since nas-7-2 data server is over-loaded serving both /pub & /checkpoint, we are looking at moving /checkpoint to a different server in the future. We are also going to plan on an HPC shutdown in order to replace the remaining 4 Nytro Raid controller before they cause more havoc.

The HPC shutdown will be announced here once we have a date set.

/pub, /checkpoint failure

11:00am 04-22-2016 hjm

Warning nas-7-2 which provides /pub and /checkpoint has crashed We are trying to figure out what the problem is but it will likely be hours before it comes back on line. If your jobs are running with Checkpointing, they will likely fail.

Glibc Exploit

12-18-2016 Farran

A note that we are rebooting nodes on HPC in order to fix the glibc exploit that was recently discovered. All critical nodes have been fixed & rebooted. A few nodes remain that have jobs on them.

If you are a node owner with current running jobs that can be restarted, please consider stopping your jobs so that your nodes can be updated. The process takes 5 minutes max and you can then resume.

Thank you,

Joseph

HPC major power outage

Posted: 3:40pm, Friday 01/29/2016 - Harry & Edward in DC

We had an unexpected power outage in our data center where 2 PDUs went out. Some services will be interupted as we are figuring out the problem.

4:40pm update: We have most parts of the service recovered.

HPC head node temporarily down

Posted: 10:25am, Tuesday 01/05/2016 - Edward

We’re currently experiencing an outage. The HPC head node is down.

11:00am update: The head node is back up.

2015 and earlier status reports