Reduce Duplication of Code

Dec-28-2015 - hjm

Please do not duplicate duplicate duplicate the applications that we provide in the module system unless you really, REALLY need them. Many ppl are downloading apps that have huge file counts (boost, visit, python, R, matplotlib, repeatmasker, trimomatic, etc) that are provided in our module system. Please check what’s available in our module system BEFORE downloading and compiling your own.

'module avail' to see what's available.

If you need an update or a particular version, please check with us first.

Another Public Q opens up: pub8i

Tuesday, Dec 22, 2015 - hjm

 +------------------+-------+-------+--------+---------+-----------+-----------+
 |        Queue     | Wall- | Cores |        |         |           | Available |
 |        Name      | Clock | Per   | Nodes  |  Cores  | Available | WHOLE     |
 | (S)=Suspend-able | Limit | Node  | Total  |  Total  | Cores     | Nodes     |
 +------------------+-------+-------+--------+---------+-----------+-----------+
         pub8i       72:00      8      20       160        160         20

Thanks to an anonymous donor & Joseph’s device driver wrangling, we have another 160 cores of Intel cores to dedicate to the public queues. This Q is NOT suspendable - your jobs will go to completion. If this Q is used to capacity, we have another 6 racks of similar nodes. These nodes do not have Infiniband so data IO will be slower than the other compute nodes, but the computation will be just as fast, or faster.

Wanted: Full Time Linux Wizard.

Dec 17, 2015 - hjm

We have been awarded an NSF grant to hire a Full Time Employee to assist with various projects related to Research Computing. The NSF grant proposal is described here: http://moo.nac.uci.edu/~hjm/nsf-cie/NSFCIE-job-recruitment.html

The person we hire would probably not be fulfilling it solo. Instead, the hire would be assisting the current HPC staff and relieving time pressure for some things for the senior staff to take on this work. However, we would expect that the hire would also contribute and participate in the work described above and would eventually take on one or more of those projects as lead.

You would be a good fit if:

  • you were completing a degree here that required you to use HPC or some other computational resources heavily

  • you needed or wanted to remain at UCI for at least 3 years.

  • you are interested in the computational aspects of your and other fields and you want to find out more about them and to make them efficient, especially in the area of storage and filesystems.

  • see the other requirements and preferences described in the link above.

For more info, please contact hpc-support@uci.edu. If you want to apply for the position, the official application is at: https://goo.gl/tDF1Yk

2015 Winter Break

12-17-2015 Farran

A note that most of the HPC staff will be on vacation starting 12/23/2015 until the beginning of the new year. Please note that this means that if you send email to HPC support, it will most likely not be answered until everyone is back.

For HPC emergencies (cluster down, etc), please report the problem to the OIT help desk at:

  • x42222 (949) 824-2222

Best wishes,

The HPC Elves

UCI Research Cyber-Infrastructure Symposium

Posted: 12/6/15 - by Farran for Allen Schiano

The UCI Research Cyber-Infrastructure Symposium is January 27th in the CalIT2 Auditorium from 9 to 3:30pm.

Everyone is invited. More info to follow…

AMD Roadmap meeting 12/3/15, free lunch [ LOCATION CHANGE ]

Posted: 11/30/15 - Farran

In order to accomodate the roughly 20 RSVP’s, we had to move the AMD Roadmap meeting to a new location:

Important NEW LOCATION: AIRB ( Anteater Instruction & Research Building ), Room # 1020, Campus building 653.

Same date and time: Thurday 3 Dec 2015 11:30 AM.

UCI authentication server problems

Posted: 10:25am, Saturday 11/28/15 - Garr

The campus kerberos server was on the blink this morning, causing authentication problems for all campus services that depend on it, including email, the campus VPN, and our HPC login.

See the OIT status page for updates:

11:00am edit: UCI’s authentication server is working again.

AMD Roadmap meeting 12/3/15, free lunch

Posted: Monday, November 2nd, 2015, Updated 11/30/15 - Farran

Note LOCATION HAS BEEN MOVED. See 11/30/15 Entry

Vendor Advanced HPC with AMD will be at UCI to give a sneak peak into the AMD Opteron roadmap. Meeting date and location:

Open to all at UCI. Free lunch provided courtesy of Adavanced HPC and AMD. If attending please reserse a spot by sending an email to:

So that we can gather a head count.

HPC web server auto-redirecting to https mode

Thursday, Oct 1, 2015 2:30pm - Garr

Starting today, almost all web access to http://hpc.oit.uci.edu is redirected to "https" secure mode. If this breaks anyone’s HPC web pages, let us know at <hpc-support@uci.edu> and I’ll exempt your page from being automatically redirected.

HPC Up again after TWO Data Center Power interruptions

Friday, Sept 12, 2015 5pm - jf & hjm

The entire HPC cluster was taken down on Friday 9/11/15 at 7pm due to power issues at the data center. We were given only hours of notification about this outage.

On Saturday 9/12/15 around 9am HPC was brought back up after the data center issues were fixed, however due to more data center power problems addtional repairs were needed and we were given only few hours of notification again. So HPC was again taken down on 9/12/15 around 11am.

As of Saturday 9/12/15 at 5pm, HPC is up again ( cross your fingers ).

Note: We have been extra careful with HPC whenever electrical work is done at the data-center BECAUSE more than once power has been CUT to critical HPC data servers causing a lot of harm and problems to the hardware and grief for us mortal sys admins, so we now opt on the caution-side to take HPC down any time electrical work is being done at the data center.

Update on /dfs1 & Data Center Power interruptions

Friday, Sept 11, 2015 - hjm & jf

Good news:

In a startling change of events, we’ve managed to bring back the failed dfs-1-2 storage server, re-added it to the BeeGFS distributed filesystem and and /dfs1 seems to be largely intact. That means that for other than the few thousand files (and pieces of files) that were lost in the controller failure, all of them seem to be re-mapped and intact, tho we’re still going thru some tests.

The /dfs1 filesystem is still READ-ONLY for now, until we assure ourselves of its contents and stability, but if the current status holds, and contents are in fact there, we’ll open it as as WRITEABLE on the weekend.

Please check your files on /dfs1 (/bio, /som) to see if your files are intact and let us know the result - if previously unreadable files have now become readable.

Bad News:

The power glitch that caused so many headaches on the 9th may be coming back tonight bc of the need to switch power back from commercial power to UPS power. The whole point of having backup power is that we should be able to switch back and forth seamlessly, but that certainly was not the case on the 9th.

So, there is a strong probability that we will have to perform a shutdown of the entire cluster tonight and not start up again until Saturday morning or even later. Watch this space for updates. If this space no longer exists, HPC is offline.

Multiple Power Outages at the Data Center.

Sept 09, 2015 - hjm

At about 4:30 this morning, there was a major power outage that took out most of the machines in the OIT Data Center. Just as we were restoring them, at about 8:20a, another power surge hit and rebooted or hung many of the critical storage and login nodes.

We have restored both /dfs1 (still read-only) and /dfs2 with apparently no more loss of files, but please let us know if data has been lost.

Because of the multiple power hits and restarts, you should evaluate results from jobs that were running very carefully. In particular, /dfs2 was subject to multiple stops and starts and data in flight may well have been lost or corrupted.

/dfs1 crash - update 2

9/08/2015 4:00p -hjm

We have recovered the 2 lost arrays but in the process, our spare controller has failed [&!%#&], with the result that we now only have the previously failed controller (which seems to have rejuvenated itself into temporary operability). We do not trust this controller and are waiting for a replacement from the vendor which should happen by Thursday.. At that time, we’ll try to bring the whole filesystem back up.

Note in the filesystem recovery, it looks like about 4000 files or parts of files were lost in the underlying fileystem recovery (the controller inexplicably zero’ed the first few sectors of the arrays, but left the rest intact. However, this is preferable to the ~2M files that would have been lost otherwise.

Please stay with us; the light we see at the end of the tunnel is probably not an oncoming train anymore.

/dfs1 crash - update 1

9/04/2015 3:00p -hjm

About a week ago, /dfs1 crashed due to a failed storage server. There is now reason to believe that we can restore the crashed server once the arrays re-verify (in progress). The current /dfs1 in READ-ONLY state will remain available, with many missing files but once the arrays can be added back to /dfs1, most of those files should re-appear. We are waiting for some more info back from the BeeGFS ppl in Germany but due to the time difference, we will probably not hear from them until Monday. However, we are more optimistic about the data recovery than we have been for a week.

/dfs1 crash

08/21/2015 4pm - hjm

/dfs1 (including /bio, /som) has crashed

/dfs1 is made up of 5 storage servers and a metadata server. One of the storage servers had a disk controller hardware malfunction that caused its 2 disk arrays to become inaccessible and when I replaced, it automatically started to re-initialize the arrays, rather than verifying them, causing the data in both arrays to be overwritten. This is not what is supposed to happen and in fact we verified a hot controller replacement during the last major cluster downtime to assure ourselves that this was possible.

Since /dfs1 (and /dfs2, still intact) are both distributed filesystems, all the file information needs to be available in order to read and write the data correctly. The loss these 2 arrays has almost certainly caused the loss of all the data on this filesystem (which includes both /som and /bio). There is a small hope that some small files can be recovered but unless you had created MD5 checksums of your files, you will not be able to tell if the files are intact or not. Large files will almost certainly not be recoverable, since they are striped across multiple arrays for performance. We are contacting the vendor of both the filesystem and the disk controller to see if some data can be recovered, but we are not optimistic about it.

We will not be restarting the remainder of /dfs1 until we hear back from the respective vendors, probably next week.

In case anyone thinks we keep secret backups, I’m afraid that we do not.

Harry, for the rest of the HPC team. <harry.mangalam@uci.edu>

short /dfs1 outage

08/03/2015 - hjm

At about 4:23a today, dfs-1-1, one of the 5 storage servers that make up the /dfs1 BeeGFS distributed filesystem, stopped working. The console was blank and there were no warning messages in the system and daemon logs as far as we can tell; just a hard lockup - a rare but not unheard-of crash. We are checking into heat issues as well.

Rebooting the storage server and restarting the server daemon seems to have fixed the failure and it appears that files stored on /dfs1 are intact as far as we can tell right now. The RAID controller reports that the RAID6 arrays hosted on this server are fine.

Files that were being written by batch jobs after 4:23am may not have gotten to disk safely so please check your recent files if you were R/W files on /dfs1. /dfs1 stores files only for /bio & /som.

Dirs hosted on /dfs2 were unaffected:

/dfs2/atlas
/dfs2/cbcl
/dfs2/dabdub
/dfs2/drg
/dfs2/edu
/dfs2/elread
/dfs2/malek
/dfs2/tw

The /dfs1 filesystem is back online. Please check your files to see that they are intact (we think they are), but please let this be another warning to BACK UP YOUR FILES ELSEWHERE.

nas-7-2 /pub ( BACKUP YOUR DATA )

07-23-2015 Farran

The raid server for /pub & /checkpoint called nas-7-2 is having hardware issues. Two drive bays are not being recognized by the hardware so we are currently running with NO hot-spare drives on this unit.

Although nas-7-2 is not currently in degraded mode, we fear that the hardware may be unstable and as such you will want to backup files you have in /pub asap.

We are going to plan a downtime for nas-7-2 sometime next month and a note will be added here when that will happen, for now we don’t want touch nas-7-2 as it may make things worse. Again start backing any data you have in /pub.

Jupyter installed

07-01-2015 hjm

Some of you may be familiar with IPython notebooks, a specialized Python shell environment that can be used to write and debug code, run existing code (including parallel operations), as well as provide interactive graphics. It is quite similar to the the Mathematica Notebook approach and is becoming more popular as a way of distributing research code and data. See:

IPython is available with all of our python modules.

IPython is in the (slow) process of being superceded by Jupyter, which is a Python3-based Notebook that can be run as a webserver system, where the Notebook is a Javascript-based page that submits Notebook processes to a compute server. These Notebooks can be in Python, R, Julia, and a few other languages and can, thru the magic of Javascript, produce interactive graphics from the Python matplotlib and other such packages.

You can try an online version of the Jupyter Notebooks or a local version on the interactive node by doing the following:

From an HPC Login node set up to tunnel X11 to an appropriate client (native X11 client or better, x2go)

  • $ qrsh -q interactive

  • $ firefox http://localhost:8000 &

  • if it’s the first time you do this, you will be prompted to log in with your HPC credentials

  • you will be presented with a directory listing of your HOME dir.

  • on the far right, click New → Notebooks → Python3 to start a new notebook.

  • we only have the Python notebook now. Will add others if demand increases.

commondata deleted

6-10-2015 hjm

I mistakenly deleted the common data in /pub/share/ (symlinked to /data/apps/commondata). Some of it has been restored, but if you need those data sets (or other common data sets), please let me know so I can re-download it. Harry

/data file server

6-9-2015 Farran / Garr

As scheduled, /data ( nas-7-7 ) was taken down at noon and the Mellanox Infibiand card was replaced. The card slot location was also moved to a more appropratite slot.

Time will tell if this helps or not with /data ib0 having issues. We need to rule out hardware issues first.

/data file server

6-3-2015 Farran

The /data file server became mostly un-responsive around 3pm and was rebooted. All appears to be well again. We are keeping a close eye or two on it since this also happened last week.

Update 6-8-15 3:30pm: /data file server exeperienced another ib0 hardward issue. /data will be down tomorrow 6-9-15 starting at noon for about 1/2 an hour in order to swap out the infiniband card.

HPC login nodes problematic

5-28-2015 Farran/Edward/Garr

The connection to one of the data servers is having problems, which is hindering connections to the login nodes. Joseph is working in the Data Center to fix the problem;

Update: 4:38pm HPC back to normal.

Grid Engine extremly slow and sluggish

5-21-2015 Farran

Grid Engine is running extremely slow and sluggish. It is not down, just very slow. We are looking into this.

Note: A number of jobs that were waiting to run with no shell or queue have been removed in the process of trying to isolate the offending problem. So if you don’t see your job, this is why.

Update: 6:20pm GE still running slow. Will continue later - I need a break some food.

Update: 11:55pm Issue still continues. Tired for now, going sleep, continue tomorrow.

Update: 5/22/15 11:30pm Grid Engine Fixed. You should now be able to get your node in a timely fashion with "qrsh" as before.

Intel Parallel Studio XE software suite

5-15-2015 Farran

Due to Prof. Aparna Chandramowlishwaran teaching HPC concepts on the cluster, we were able to request and were approved for the entire Intel Parallel Studio XE software suite on HPC.

To load the software, use:

  • module load intel-parallel-studio-xe/15.0.3

The software includes:

  • Intel C++ Compiler

  • Intel Fortran Compiler

  • Intel Threading Building Blocks (C++ only)

  • Intel Integrated Performance Primitives (C++ only)

  • Intel Math Kernel Library

  • Intel Cilk™ Plus (C++ only)

  • Intel OpenMP

  • Rogue Wave IMSL* Library2 (Fortran only) Bundled and Add-on Add-on Add-on

  • Intel Advisor XE

  • Intel Inspector XE

  • Intel VTune™ Amplifier XE3

  • Intel MPI Library3

  • Intel Trace Analyzer and Collector

Please note that there are restrictions for using the product and it is explained when you load the module.

Many thanks to Prof. Aparna Chandramowlishwaran for making this possible!

New BUG discovered with BLCR and BeeGFS

5-7-2015 Farran

A new bug has been discovered when using HPC Checkpoint BLCR while running from a BeeGFS filesystem. The issue is causing jobs to abort/core dump and the aborts are not consistent. The same job will run successfully sometimes while core dumping other times.

If you are running from BeeGFS ( /dfs1, /dfs2, /fast-scrach ), please do not use BLCR. Make sure your script does not include:

  • #$ -ckpt blcr

Note: last year we reported a bug in which BLCR running under BeeGFS cannot save the context checkpoint files. We have an open ticket with BeeGFS support and are still waiting on a fix for this.

The issue now is new and it is causing jobs to abort at random times, so it’s critical that you do not use BLCR Checkpoint when running from BeeGFS unless you enjoy headaches.

BLCR Checkpoint is working just fine from NFS file system like /pub/$USER however.

We will be contacting BeeGFS support soon to report this new bug and now that we have paid support from BeeGFS, we are going to pressure them on a fix for both bugs.

Clean up your Trinity runs

4-29-2015 hjm

Trinity Users: In the Chrysalis stage of the Trinity analysis, there are typically hundreds of thousands of ZOTfiles created. Once you run the butterfy commands on the Chrysalis files, PLEASE delete or tarchive the chrysalis dir. If you don’t know how to do this, ask us. Tarchiving the chrysalis dir results in the reduction of the 100K files to 1 and a ~7X reduction in size. You can run this as a qsub job, using this script as a template

MPI re-done again on HPC

4-21-2015 Farran

It was recently discovered that some of the re-compilation of OpenMPI 1.8.3 that was done on 4/16/15 had some mxm bits left-over causing MPI jobs to run slow and/or stall as before.

OpenMPI 1.8.3 for the following modules were wiped-clean and re-compiled again to make sure all references to mxm were removed:

  • openmpi-1.8.3/gcc-4.8.2

  • openmpi-1.8.3/gcc-4.8.3

  • openmpi-1.8.3/intel-12.0.5

  • openmpi-1.8.3/pgi-14.10

Please let us know if you are experience any issues with the new modules.

kdirstat visual file browser

4-20-2015 Edward & hjm

Edward has managed to coerce kdirstat to compile and it’s now available on both login nodes. You will need an X11 app (Xquartz on OSX or x2go on Windows). cd to your home dir and type:

module load kdirstat; kdirstat &

then choose your starting dir. BE CAREFUL to choose only subdirs of your HOME dir or subsets of your data dirs. kdirstat runs a complete recursive descent of the dir it starts at. You can use it to ID gigantic files or ZOTfiles, and can sort on size, # of files, age of files, etc. Use it (carefully) to clean up your data.

Intel Xeon Eval server available

04-16-2015 hjm

Some good news, finally. We’ve gotten an eval dual-socket server from Intel with their latest high-end Xeon CPU (E5-2699v3 @ 2.30GHz). It has 36 physical CPUs, but represents itself as having 72 with Hyperthreading. While it currently only has 32GB RAM, Intel is going to supply us with at least 256GB for testing. It also has a Phi coprocessor card and a 2TB PCIe SSD, which should make high IOPS programs very fast once it’s configured correctly. We will be installing the Intel compiler suite on it for testing purposes and to allow compilation for the Phi.

Please try it out and let us know how your programs perform relative to the AMD CPUs, epsecially on floating-point heavy code. The Q name is pub72phi, which is now open to the public.

/pub & /checkpoint problems

4-16-2015 Garr / Farran

Nas-7-2 which is the /pub & /checkpoint file server started having issues again on HPC around 3pm. The logs are recording weird infiniband card errors.

The infiniband card was replaced today at around 4:00pm. Time will tell if this resolves the issue. For now /pub & /checkpoint are back in production. Also, a failed drive was replaced on nas-7-2.

MPI on HPC

4-16-2015 Farran

As mentioned earlier, a problem surfaced after the HPC upgrade in which MPI jobs were running extremely slow with some MPI programs hanging.

Ironically the issue was traced to be the Mellanox mxm speedup function:

We have an open ticket with Mellanox to resolve this issue, but in the meantime the following OpenMPI have been re-compiled ( fixed ) on HPC without the mxm bits:

  • openmpi-1.8.3/gcc-4.8.2

  • openmpi-1.8.3/gcc-4.8.3

  • openmpi-1.8.3/intel-12.0.5

  • openmpi-1.8.3/pgi-14.10

If you need another flavor of OpenMPI re-compiled without mxm, please send email to hpc-support@uci.edu

Problems on /pub seem to be resolved

04-15-2015 9:00a hjm

The /pub filesystem was checked and a flaky disk replaced, and the RAID rebuilt. Hopefully this has addressed the timeout problem that happened yesterday. It looks like all mounts re-attached, but check your jobs to make sure they’re still behaving.

More problems with /pub

04-14-2015 3:35p hjm

We are going to have to reboot nas-7-2 to force an unmount of /pub. If your job references /pub, it will hang and possibly have to be re-started. If your jobs are being checkpointed, any checkpoints will hang as well, but they should resume after the reboot.

/pub problems

04-14-2015 3:20p hjm

We have a problem on the /pub filesystem and have unmounted it to run some diagnostics and repairs. While it is offline, please do not try to read or write from it.

Infinite Backup

04-10-2015 hjm

Yes, as infinite as Google can make it.. Read more here:

  1. and it’s actually not 10TB, it’s UNLIMITED (quoting Google).

/pub & /checkpoint Data Server

4-8-2015 Farran

A note that nas-7-2 which is the data server for /pub & /checkpoint has recorded two errors on ib0 port ( Infiniband ) which causes some nodes to not be able to get to /pub or /checkpoint.

A reboot of nas-7-2 has fixed the issue, so this is a heads-up note that we may have hardware issues on nas-7-2 or Mellanox driver issues.

MPI NOT Running correctly

4-7-2015 Farran

A note that MPI jobs are running pathetically slow on HPC. The problem appears to have started after the 3-30-15 HPC maintenance.

Several things change during the maintenace including the kernel and Mellanox drivers which MPI jobs use for communication. With our limited staff, it will take some time to isolate the issue.

HPC Linuxbrew is now available

3-31-2015 Edward

HPC Linuxbrew is ready for beta testing.

Linuxbrew is a software package manager for Linux.

There is two ways of trying it:

  1. Simply "module load brew" to load everything in Linuxbrew, then run "brew list" will tell you what packages are available.

  2. Run "module avail brew" will show you a list of packages which has been ported to HPC modules. Use "module load" to load a single package.

Almost everything in brew is latest stable version, so we don’t bother put the version number in modules. If you want to know which version is installed, first load brew/brew, then run "brew info package_name"

Following compilers are available through "module" command:

  • brew/gcc48

  • brew/gcc49

  • brew/mpich-gcc48

  • brwe/mpich-gcc49

  • brew/open-mpi-gcc48

  • brew/open-mpi-gcc49

When using brewed compilers, you can mix brewed modules with other HPC modules. However, using brewed packages on non-brewed compilers may need additional tweaks.

Please beware this is in beta testing, but feel free to give it a try.

Any feedback would be appreciated.

HPC Maintenance Completed

3-30-2015 Farran/Harry/Garr/Edward

HPC maintenance completed.

Changes/updates were done cluster-wide on all HPC compute & data servers:

  • Physical move of compute & data servers in order to house new spare chassis.

  • Two new spare chassis setup to act as emergency units in the even that /dfs1 or /dfs2 file-system experiences a chassis failure.

  • New spare chassis tested by physically swapping all disks from a production-unit to the spare-unit. The raid-set data survived.

  • Update Grid Engine to version 8.1.8.

  • Update kernel to 3.10.72-1.el6.elrepo.x86_64.

  • Update CentOS 6.6 rpms (yum update) as of 3/23/15 HPC mirror.

  • Update Mellanox Drivers and Firmware to version 2.4-1.0.0.

  • Update BeeGFS from version R9 to R14. Lock issue/bug has been fixed.

  • Update all Raid cards firmware.

  • GPU nodes updated with latest cuda version 7.0.28.

    • New cuda module created "module load cuda/7.0"

  • New /dfs2 file-system went on-line and it is now in production mode.

  • Following groups have been moved from /dfs1 to /dfs2 (please update your job scripts accordingly, altho the symlinks should keep them working for now):

    • /dfs2/atlas

    • /dfs2/cbcl[*]

    • /dfs2/dabdub

    • /dfs2/drg

    • /dfs2/edu

    • /dfs2/elread[*]

    • /dfs2/tw

[*] These 2 dirs had a subdir with so many ZOTfiles that they were impossible to move in the time alloted. The offending subdirs have been moved out of the parent dir trees and are still residing on /dfs1 as

  • /dfs1/cbcl/shared → /dfs1/ZOT/cbcl,shared

  • /dfs1/elread/rxn-share → /dfs1/ZOT/elread,rxn-share (mostly resolved as of Monday, March 30, 2015).

Please do not use these dirs until we can resolve these issues.

Lots of changes made so expect some road speed bumps. If you find issues, please report it to <hpc-support@uci.edu> ASAP.

We STILL need you to delete and/or archive old data as soon as possible on ALL filesystems.

After the Upgrade Issues found:

  • THERE IS AN ISSUE WITH OLD JOBS USING BLCR CHECKPOINT. Due to change in kernels, previous BLCR checkpoint jobs are failing and taking nodes with it. PLEASE REMOVE old jobs if you were running with HPC BLCR Checkpoint. New BLCR jobs are working just fine however.

HPC Down-Time Reminder

3-17-2015 Farran

Reminder that HPC will be down starting at 9am Wednesday 3/25/15 until 3/26/15 5pm.

Data Science Initiative day courses

3-12-2015 Harry/Farran

Introduction to Linux, R and other short courses that may be of interest to HPC users:

HPC Down for Maintenance 3/25/15 - 3/26/15

2-26-2015 Farran/Harry/Garr/Edward

The HPC cluster will be down for maintenance starting on Wednesday 3/25/15 9am until 3/26/15 5pm. If all goes well the cluster will be up sooner but please do not plan on this.

Two new hardware chassis have been ordered to serve as emergency chassis in the event one of /dfs1 or /dfs2 chassis fails.

The entire cluster will be down in order to update software ( BeeGFS, Grid Engine, Infiniband, etc ) and to test the new-chassis fail over procedure to make sure all works as expected so that if/when the real thing happens we will not be caught off guard.

Checkpoint-able jobs will be checkpointed and should resume running once the cluster is back in operation, all other jobs will terminate when the nodes are rebooted.

If this downtime presents a problem for you, please let us know at hpc-support@uci.edu asap.

NFS Funding opportunity for BIGDATA

2-25-2015 Allen Schiano

To all HPC Partners, we like to make you aware of the following funding opportunity:

From: Henry Warchall <hwarchal@NSF.GOV> Subject: Updated NSF funding opportunity: BIGDATA: Critical Techniques and Technologies for Advancing Foundations and Applications of Big Data Science & Engineering

Dear Colleagues,

An NSF funding opportunity update is now available:

Critical Techniques and Technologies for Advancing Foundations and
Applications of Big Data Science & Engineering (BIGDATA)

Full Proposal Deadline Date: May 20, 2015

Please see

for details and links to additional information.

From the program synopsis:

The BIGDATA program seeks novel approaches in computer science, statistics, computational science, and mathematics, along with innovative applications in domain science, including social and behavioral sciences, geosciences, education, biology, the physical sciences, and engineering, that lead towards the further development of the interdisciplinary field of data science. The solicitation invites two types of proposals: "Foundations" (F): those developing or studying fundamental theories, techniques, methodologies, technologies of broad applicability to Big Data problems; and "Innovative Applications" (IA): those developing techniques, methodologies and technologies of key importance to a Big Data problem directly impacting at least one specific application. […] > > In addition to approaches such as search, query processing, and analysis, visualization techniques will also become critical across many stages of big data use—to obtain an initial assessment of data as well as through subsequent stages of scientific discovery. Research on visualization techniques and models will be necessary for serving not only the experts, who are collecting the data, but also those who are users of the data, including "cross-over" scientists who may be working with big data and analytics for the first time, and those using the data for teaching at the undergraduate and graduate levels. The BIGDATA program seeks novel approaches related to all of these areas of study.

trinity/r20140717 much faster. New trinity/r2015-2.0.3 installed.

2-12-2015 Garr Updegraff

Edward Xia and I installed a patch suggested by a Trinity developer (tip of the hat to Guia Guffanti for finding this) that allows HPC’s trinity/r20140717 module to perform sorts in parallel on all available cores using the new samtools/1.1 module. The improvement in sorting time is enormous: Guia reports that a Trinity run that formerly required days to finish now completes in a day and a half. Karen in EcoEvo reports that a Trinity run that required more than a day now completes in 2 hours. Because the improvement is so enormous, I modified the Trinity modulefile to automatically load samtools/1.1; that means there is no need to load a samtools module before running this version of Trinity.

I also finished installing a new version of Trinity, module trinity/r2015-2.0.3. This version includes the parallel samtools sorting patch, but only for the Trinity Perl program, and not for the sub-programs. Depending on what we hear from users, I may decide to add the parallel sorting patch to the sub-programs, too. This module automatically loads the samtools/1.1 module, so there is no need to load the samtools module in your qsub script.

gcc/4.8.2 Now The Default Compiler on HPC

2-3-2015 Farran

There appears to be issues with gcc/4.8.3. Until we can figure out the problem, gcc/4.8.2 is now the default on HPC.

For OpenMPI, the default is openmpi-1.8.3/gcc-4.8.2

GLIBC Expoit Issue

1-29-2015 Farran

We received feedback that several critical jobs are running on HPC with grant deadlines that are due soon.

In an effort to keep HPC running while addressing the newly discovered exploit, we are going to NOT take HPC down but rather do nodes one at a time while working around current jobs.

There are certain nodes that we have to update and we will be doing those first. There are also NFS data servers that will be updated today (like /data file-server). This means that /data will be un-responsive for 5-10 minutes when it is rebooted.

Both login nodes and the interactive node HAVE to be updated. All private nodes will be updated around current user’s jobs (when the current jobs on the node complete).

HPC Job RESTART

1-27-2015 Farran

A note that JOB RESTART is now available on HPC:

The restart option is great for large job arrays when using the FreeQ ( free64 ) queue system on HPC. You can significanly speedup your work-flow using the 5,500+ possible cores on free64 queue with no suspended jobs.

Help us cite your papper

1-24-2015 Harry

D I D   Y O U R   P U B L I C A T I O N   R E S E A R C H   U S E   H P C ??

             H E L P  U S   C I T E   Y O U R   P A P E R

We would like to represent to the funding orgs both on campus and elsewhere that HPC is being used to generate publications. So if you have used HPC in the analysis of data that has resulted in a publication, we’d like to cite that paper as evidence that HPC is a valuable resource on campus and it deserves perhaps a little more support. Soooo…

If you have published such a paper, please send us an email with the citation info as well as a brief note as to what type of resources were particularly useful on HPC and what you’d like to see improved.

Thanks very much! The HPC support group: <hpc-support@uci.edu>

Citations being added here: http://hpc.oit.uci.edu/publications

Data Corruption in /dfs1

1-21-2015 Farran

A note & reminder that the BeeGFS /dfs1 file-system has data corruption! The data corruption is due to the failed LSI Raid controller on /dfs1 metadata server that happened on 1-8-15.

Important You will WANT to check all of your files!

We are in the process of ordering additional hardware to have as spare backup hardware as we are currently running with the bare minimum.

A spare chassis is being ordered to serve as a backup to the current eight (8) Raid servers on HPC.

We also need a spare chassis for the two metadata servers and are currently trying to get authorization for that.

Once both /dfs1 and /dfs2 go on-line, if a metadata server fails, that will take the entire /dfs1 or /dfs2 with it, so we are trying to get a spare chassis (backup hardware) for that as well.

We are doing our best to keep the cluster running as smoothly as possible, but there is only so much we can do with our limited ( non existent ) budget. Thank you for your understanding and if you have any helpful suggestions we would like to hear them.

If you have any questions, comments, suggestions please let us know at hpc-support@uci.edu

HPC still data-checking /dfs1, reviving /dfs2

1-9-2015 4pm Farran/Harry

Almost done. We will try to open up HPC to users by the end of the day. However, there is an additional wrinkle in that bc we did not receive the replacement controller today, we may have to briefly reboot the compute nodes when /dfs2 becomes usable. Still trying to come up with a way to avoid this.

HPC’s /dfs1 seems to be running /fairly/ normally. Please check back here for updates in case you find things outof place or not working the way you expect. If the case is not mentioned here, please let us know ASAP.

Data on /dfs1 seems to be mostly intact, but the complete filesystem check (fsck) will take about 10 days to complete, and we have run into cases where some compute nodes cannot see certain subdirs, but others can. Alert us to these cases.

We are moving all data except that for /bio and /som to /dfs2 in the coming days, so please watch for updates if you are in these groups:

atlas   cbcl   dabdub   drg   edu   elread   tw

We are also moving the Mortazavi group data to /share/samdata. Please watch for emails about this.

The replacement controller for /dfs2 is slated to arrive today, but it’s unclear when that will happen. Obviously, /dfs2 is unavailable until the controller arrives. Should you see /dfs2 suddenly appear, please stay off it until we announce it’s ready.

/dfs1 Update, HPC down until Friday

1-8-2015 4pm Farran/Harry

The LSI disk controller on /dfs1 metadata server went bad taking the entire /dfs1 file-system down with it. Since we have no spare parts, the disk controller from /dfs2 was yanked out to replace the bad one on /dfs1.

The /dfs1 file-system is up now and the data appears to be intact. You will want to run md5 checksums on your data to make sure however.

HPC logins have been disabled and will remain unavailable until 9am Friday (1/9/15) in order to move data OUT of /dfs1 as /dfs1 is almost full.

Jobs that were using /dfs1 most likely crashed when /dfs1 went south, so yes you will need to restart those jobs again. All other jobs not using /dfs1 are still running.

The downtime is longer than expected as we struggle to find the needed hardware resouces.

Update: 7:00a, Thu, Jan 08, 2015

No data on /dfs1 available anywhere and since it is the central storage system for HPC, very little will be able to be done until it is fixed.

The metadata server seems to be the cause; the storage nodes seem to be fine. MD server is booting very slowly, seems to be a problem with the disk controller that runs both the OS and metadata arrays. The OS is on a RAID1 (mirrored) and the metadata is on a RAID10 (double mirrored), so we’re hoping the data is safe with the appro hardware replacement.

Left it running thru its init and filecheck routines last night. There are still a few things to check but may have to order a new one to replace it which may take a couple days.

More as we find out.

Update 12:30p, Thu, Jan 08, 2015

It appears the disk controller is bad. we’ve swapped it with the one from /dfs2 and it appears to be working OK, but we’re taking this oppo to move some data off /dfs1. The previous controller also worked for a while and then began to record errors, so we’re watching the replacement carefully to make sure it was teh controller and not another part of the system.

Most of the data on /dfs1 appears to be there, altho the bad controller recorded many bad reads (which may have been recovered by the error correction). Users will have to check their data carefully after we open HPC to users again. We anticipate that HPC will be closed until tomorrow to make sure that the errors were caused by the controller that was replaced.

No HPC Email

1-5-2015 Farran

The campus MTA’s (Email Transport Systems) are blocking HPC from receiving/sending email due to a user job gone wrong on HPC that sent 40,000+ emails triggering an automatic shutdown by the MTA’s. HPC user account has been locked and email will be blocked until the issue is resolved.

2pm Update: Email working now

Happy New Year!

2014 Winter Break

12-19-2014 Farran

Most of the HPC staff will be on vacation starting on 12/22/2014 until the beginning of the new year. Please note that this means that if you send email to HPC support, it will most likely not be answered until everyone get’s back.

For HPC emergencies (cluster down, etc), please report the problem to the OIT help desk at:

  • x42222 (949) 824-2222

Best wishes,

The HPC Elves

HPC /data sluggish

12-17-2014 Farran

The /data server (nas-7-7) became very sluggist with a high load and HPC was basically non-responsive.

Harry rushed to the data center to reset the server when it failed to remotely reboot. It is back up now and running normal after the reboot and after jobs were killed that were impacting /data.

A reminder to please NOT run any jobs from your home directory /data as the file server cannot keep up with the load.

For the proper data server to use, please see:

HPC MPI Modules Re-Compiled

12-10-2014 Farran

Now that HPC’s default OpenMPI is openmpi-1.8.3/gcc-4.8.3 and that new updated OFED Mellanox drivers have been installed on the compute nodes, we are re-compiling all of HPC MPI program modules.

The following list includes what has been updated so far. This list will be updated as other modules are updated at a later date.

Important If you need a particular HPC MPI module updated now, please let us know.

List of what has been updated to date:

  • PETSc/3.3 PETSc/3.4 PETSc/3.4.4 PETSc/3.5.2

  • pflotran/20130408

  • amber/amber12

New OpenMPI & Defaults

12-4-2014 Farran

New version of OpenMPI 1.8.3 has been installed on HPC to work with the new Mellanox OFED setup. The following compilers have been configured and are available with OpenMPI 1.8.3:

  module load openmpi-1.8.3/gcc-4.8.3
  module load openmpi-1.8.3/gcc-4.9.2
  module load openmpi-1.8.3/intel-12.0.5
  module load openmpi-1.8.3/pgi-14.10
Important NOTE 1: You will need to re-compile your MPI programs in order have MPI jobs work correctly on HPC.
Important NOTE 2: The default gcc and OpenMPI is now openmpi-1.8.3 & gcc-4.8.3. You will need to log-off and back in again for the new defaults to take effect.
Important NOTE 3: Nodes are being updated on HPC with latest Mellanox OFED. If you are a node owner, release jobs from your nodes when convenient so that your nodes can be updated.
Important NOTE 4: New GNU debugger gdb/7.8 has been installed and compiled against the new gcc-4.8.3 default. NAMD been removed as a default.

If you encounter HPC MPI modules that need to be re-compiled, or if you need other compiler versions added, please email hpc-support@uci.edu.

OpenMPI 1.8.3 has been configured to work with all of the latest and greatest Mellanox speedup functions:

Please let us know if you notice your MPI jobs running faster or slower than before. Hopefully faster.

A new HPC public test queue has been configured with a 5 (five) minute max run-time limit in order to allow everyone to test jobs on HPC. If you like to try out the new MPI setup, a sample job is available at:

cat ~demo/hello-mpi.sh

As with all new configurations, expect some road speed-bumps along the way.

Updated Compilers Installed

12-3-2014 Farran

Gcc and PGI Portland Group compilers latest version installed and now available on HPC. GNU debugger also updated to version 7.8 compiled with gcc/4.8.3.

New modules added:

 module load pgi/14.10

 module load gcc/4.9.2

 module load gdb/7.8

OpenMPI

11-20-2014 Farran

With the recent upgrades of new Mellanox drivers, new 3.10 kernel and new CentOS 6.6, most flavors of OpenMPI are not working correctly / optimally on HPC.

Various OpenMPI modules need to be re-compiled from source in order to pickup all of the new and exciting bits. This process is very involved and time consuming. With our limited HPC staff, this will take me some time to complete.

I will work on upgrading our OpenMPI distribution as time permits and will post back here when they are ready.

If you need OpenMPI done sooner than later, please send us a note at hpc-support@uci.edu about your urgency.

Thank you for your understanding and patience.

Joseph

File Server for /pub & /checkpoint Crashing

11-9-2014 Farran

The file server called nas-7-2 that serves data for /pub and /checkpoint has been crashing recently - sometimes twice per day.

This NAS server has been rock solid before and we believe the recent crashing may be due to one of two possibility:

  • Moving checkpoint files to it ( heavy usage ).

  • Using the old not so good drives from Gluster for /checkpoint.

We are going to upgrade nas-7-2 Kernel from 2.6 to 3.10 to see if this helps not crash the NAS server just like it has helped with the compute nodes NOT crashing going to the 3.10 Kernel.

Important The /pub and /checkpoint file systems will be UNAVAILABLE on Monday 11/10/14 from noon to around 3pm in order to upgrade the Kernel.

During this downtime of nas-7-2, several jobs will ERROR out in ( Eqw ) state. Nothing you need to do. Once nas-7-2 goes back on-line the jobs in error state will be reset back to normal.

We are also going to ask OIT for money to replace the /checkpoint disks that are old and problematic to replace them with new.

CheckPoint issues with BeeGFS FileSystem

11-06-2014 Farran

HPC CheckPoint is failing if you run jobs using the BeeGFS file system (/dfs1, /fast-scratch).

CheckPoint is woking just fine if you run from an NFS file-system like /pub/$USER or /data/users/$USER however.

This is a new issue now that we are moving nodes over to the 3.x kernel and BLCR 0.8.6. Adam/Harry will soon be filing a report with BeeGFS support. We will report back when we have a fix or more info.

Nodes Crashing Update

11-04-2014 Farran

We believe that we found the solution to HPC’s node crashing problem. Using an updated 3.x kernel, BLCR 0.8.6 and not using /dfs1 for the checkpoint files appears to be the key.

A test node was setup with the above combo on Sunday and the problematic jobs that were previously crashing nodes on HPC have not crashed the test node as of today. So this is good news.

In order to move to the 3.x kernel and combo, we can either bring the entire HPC cluster down or do nodes one at a time. Taking HPC down will require several days of down-time to configure all nodes as several nodes have file-systems and need to be done manually.

Instead of an extended downtime, we have opted to do nodes one at a time while the cluster remains running. So what this means to you is that you will notice some nodes down for a few hours during the day while they are being configured.

If you are running on the free queues like free64 queue, we will checkpoint your jobs (if you are running with HPC checkpoint) before taking the node down. If you are NOT using checkpoint, we may need to kill your jobs on the free queues in order to upgrade the nodes. If this presents a problem for you, let us know asap.

The public pub64 queue which are compute-7-1 through compute-7-9 will be the first to be converted and the last nodes will be private compute nodes that have file-system on them as we need to take special precautions to make sure the node’s FS is not destroyed. We will be contacting node owners when we are ready to process those nodes, but please make sure you have back-ups of your data on your private nodes file-system just in case.

If you have any questions or want your nodes converted as soon as possible, email us at hpc-support@uci.edu and tell us what your nodes are.

One last note: Checkpoint files have been moved to:

  • /checkpoint/$USER

Nodes Crashing

10-29-2014 Farran

A note that nodes have recently been crashing, sometimes several nodes per day. The issue is happening when the cluster is heavily loaded with work and when using HPC Checkpoint which is now the norm on HPC.

We are looking into this and appreciate your understading in that this may take a while to isolate given the complexity and staffing level of HPC.

Thank you for your understanding,

Joseph

ANSYS Software available on HPC

10-21-2014 Farran

A note that the ANSYS software package http://www.ansys.com is available on HPC.

  • module load ansys/15.0.7

The complete ANSYS commercial software product line has been made available to all HPC users doing research work free of charge. Many thanks to the APEP group for securing the license for HPC.

DNS: Domain Name Service

10-15-2014 Farran

A recent yum change broke the way our local DNS was working. The result was that you could not get to nodes having a file-system on them. The issue has been fixed and you should be able to get to all nodes once again.

Course in Next Generation Data Analysis at UCR

10-13-2014 hjm

NOTABLE: The Institute for Integrative Genome Biology at UC Riverside will offer the next intensive workshop on "Next Generation Data Analysis" on Dec 5-8, 2014. Detailed information about this event and sign in instructions are available on this site:

Due to the high demand for these event, please sign up as soon as possible. These are not free, but are VERY good courses.

Free Python training from Enthought.

10-10-2014 hjm

NOTABLE: Free Python training from Enthought. Many of you use Python, intentionally or as part of larger packages and for good reason - it’s one of the best and most widely used languages for productive programming. Enthought, the company that produces the Enthought Python package is offering free online training in Python for academics. https://training.enthought.com/courses

GLuster (/gl) gone

10/7/2014 hjm

The Gluster filesystem (/gl) is now unmounted from all user-available nodes and will be shut down later today to be recycled into /dfs1.

HPC BLCR Checkpoint

10/5/2014 Farran

HPC checkpoint BLCR turned off in order to drain BLCR jobs out of nodes. This is being done in order to re-install BLCR cluster wide.

BLCR will be back in operation on Monday afternoon 10/6/14. If you are using BLCR, there is nothing you need to do but wait. Your jobs will sit in the queue until BLCR comes back on-line at which point your jobs will continue running automatically.

HPC GPU Nodes

10/2/2014 Farran

NVIDIA drivers and cuda programs updated on HPC nodes compute-1-14, compute-6-1 & compute-6-3 with the latest available version: cuda_6.5.14.

New module created for version 6.5:

  • module load cuda/6.5

HPC Login

10/2/2014 Farran

One of two HPC login nodes was not accepting logins. So 2 out of 3 attempts at login to hpc.oit.edu failed.

The issue has ben fixed and you should now be able to login on hpc.oit.uci.edu again.

HPC Maintenance

10/1/2014 Harry/Adam/Joseph

HPC scheduled downtime was longer than normal due to the electricians pulling the power on the WRONG HPC racks which included most of HPC critical Raid servers and setting us back. /dfs1 seems to have come back unscathed altho some local disks on other filesystems failed. Let us know if you find suspicious behavior.

HPC was updated or changed with the following:

  • the Gluster filesystem is now READ-ONLY, and only from the INTERACTIVE node. If you need to get files from it, let us know. On the 7th, it goes away permanently.

  • Newer kernel 2.6.32-431.29.2.el6.x86_64

  • BLCR compiled against new kernel

  • Mellanox OFED drivers updated from 2.0 to 2.3.1.0.1

  • Patch from BeeGFS that Adam reported to Fhgfs support. We are hoping that the patch and newer OFED will fix the reported bug: https://groups.google.com/forum/#!topic/fhgfs-user/wpPUV4-xLQ0

  • Some checkpoint programs are having problems checkpointing the context file, so we are hoping the above patch and newer OFED drivers will fix this.

  • Nodes compute-3-3,3-4 & 3-6 were re-imaged to fixed their original small root partition. The mdadm ( software raid ) configuation did not make it on startup. The /compute-3-x filesystems were redone.

  • compute-2-12 is down with hardware issues.

If modules is not working for you, log-off and back in again to pick up the correct module setup

Gluster is being retired

09/25/2014 hjm

Gluster (/gl) is being retired. /gl will be made read-only on Tuesday, Sept 30th after the shutdown. You will be able to read from it, but not write to it or delete files from it. One week later Tuesday, * Oct 7th it will be removed permanently. You should all be using /pub or /dfs1. If you need to delete files on /gl/old so they can be synced to /dfs1, PLEASE DO IT NOW.

Update on status of /dfs1

9/22/2014 hjm

We are still about 2 weeks away from the end of the /dfs1 data squeeze.

What needs to happen:

  • We need to finish re-rsyncing some users' files from /gl to /dfs1. It takes a long time to rsync millions of files bc /gl is so slow, especially on ZOTfiles (Zillions of Tiny Files)..

  • During this time users should be backing up their valuable data. As we’ve seen today, unfortunate things can take place and data can be lost. We do not back up your data; if it’s valuable, it’s up to you to back it up. You should also be checking whether your files have been transferred corrrectly to /dfs1. How to do this is described here. (see the Note: To check for differences between the 2 dirs). If there are only a couple of differences, please resolve them yourself by copying the files. If there are hundreds of differences, please let us know and we’ll resolve it for you.

  • After all data has been copied, we need to take the /gl servers down, replace the faulty disks, and either create a new filesytem (/dfs2) or add the servers to /dfs1. The funding for this is still being worked out and may require more time.

  • we will be adding quotas to /dfs1 to prevent overuse and scanning the filesystem regularly to identify users who are creating ZOTfiles, and keeping files older than X months.

  • Once that’s done, users can start to use /dfs1 normally.

/pub Gone.

9/22/2014 Farran

Sorry folks. All data in /pub has been deleted. Sunday night while testing NFS version 3 vs 4, /pub was mounted on a node at /tmp and it looks like a nightly node script that cleans /tmp wiped out /pub as well.

We had other system data missing at a different file-system location and the above does not explain how this data was also wiped out as well however.

The clean-up sripts have been running for years on nodes but the right combination of events unfortunately most likely cause /pub to be deleted.

Unfortunately there is no un-delete for XFS file system.

We take all data loss seriously and opologize for the problems it has caused you.

Note: /w1, /w2 & /mathbio are still available as read-only. Please copy your data back again from /w1, /w2 & /mathbio to /pub/$USER.

Grid Engine Sluggish responding

9/20/2014 Farran

Grid Engine is maxing out on the head node and so it’s running slower than normal for commands like qrsh.

Looking into it.

10pm Back to normal. A series of jobs were in a flux state keeping GE busy.

Grid Engine Core Binding Semi-Broke

9/19/2014 Farran

A note that Grid Engine core binding is not fully working on HPC. It broke with either the last yum updates all nodes got or the new GE version with yum combo.

I will work on it as time permits.

Grid Engine has a Major Tune Up

9/17/2014 Farran

HPC Grid Engine had a major tune-up in order to get around the issue of jobs over-subscribing on compute nodes. The parallel environment on GE went from 7 PE’s to 156. Each queue now has it’s own parallel environment and this appears to help/fix the over-subscription issue.

The PE change is transparent to the user base, however, for MPI jobs, the method of creating MPI scripts has changed. The "exclusive" consumable is no longer. For full details, please visit the MPI section on running jobs on HPC web page:

Yes you will need to update your batch scripts if you are running with MPI.

Location changed for personal user web directories on the HPC

9/17/2014 Garr/Farran

The location of personal user web directories is changing. We are in the process of moving everyone’s web files from their old personal HPC location, which was one of the following (depending on one’s group):

/data/users/public-www/$USER/
/bio/public-www/$USER/
/som/public-www/$USER/

to the new location, which for all users is:

/pub/public-www/$USER/

After a user’s files have been transferred to the new location, the old directory will be empty and become read-only, to prevent anything new from being posted there.

More details (including how to enable directory listings) here:

Reminder /w2 & /mathbio going Read-Only on 9/15/2014

9/12/14 Farran

A reminder that on Monday the 15th ( 9/15/14 ), the /w2 and /mathbio file system will go into READ-ONLY mode. This is being done in order to force users to start using the new public data server available at:

  • /pub/$USER

If you are using /w1/$USER, /w2/$USER or /mathbio/$USER, please copy your data to /pub/$USER.

The new public data server is replacing the old and failing /w1, /w2, /mathbio servers.

The old data servers will be retired and turned-off on 10/1/14.

HPC 9-9-14 Maintenance

Harry/Garr/Farran

HPC maintenance done. Please see:

If you encounter problems/issues on HPC, before sending email please read the maintenance details to see if your issue is described and explained in the link above.

If you are still having problems after reading the above, let us know at hpc-support@uci.edu and as always, explain as much as possible as "I am having a problem" does not help us much.

Thank you,

HPC Team

Enthought Python has been updated

9/3/14 - hjm

Enthought Python has been updated and at least some users are reporting previous bugs have been fixed. There were some warnings and some update failures, but they appear to be minor or not in the original module tree. Please let us know if you find any oddities with this update. We will be transitioning to Canopy, the newest version of the Enthought Python line in the next few weeks.

/w2 will go Read-Only on 9/15/14 and then go away on 10/1/14

9/3/14 Farran

The /w2 file server will be set to READ-ONLY on 9/15/14 in order to start forcing all HPC users to start using the new public data server:

  • /pub/$USER

The new public /pub data server is replacing the old and failing /w1, /w2 and /mathbio file servers, so please start using /pub/$USER as soon as possible.

After 10/1/14, the old /w2 server will become a big-scratch file server until the unit dies. The server for /w1 & /mathbio has too many hardware problems to keep it running and so it will be turned off for good after 10/1/14.

/w1, /w2 & /mathbio file servers will go off-line on 10/1/14, so make sure you get all data OUT before 10/1/14.

Bug Found in Fraunhofer BeeGFS ( /dfs1 ) File System

9/2/14 Farran

A bug has been brought to our attention with /dfs1 ( Fraunhofer / BeeFGS ) file system that affects GROMACS and many other applications.

Details of the bug can be found here:

It is our intention to update /dfs1 file system during the HPC downtime next week to correct this.

New Public Data Server /pub NOW Available

8/30/2014 Farran

The new Public Data server that is replacing the old and failing /w1 and /w2 file servers went on-line early ( today ). You can access the new public data server at:

  • /pub/$USER

Please start copying your data from /w1 and w2 to /pub now. The /w1 & /w2 servers will go away by October 1st ( 10/1/14 ), so start copying your files to /pub/$USER now please in case /w1 or /w2 chooses early retirement check-out.

/w1 UP

8/30/2014 1pm Farran

The /w1 File server is UP early thanks to the on-site operator who was able to reset it.

/w1 crashed because it went 100% full. The server will crash again if it stays full.

1:30pm Update:

/w1 crashed again due to jobs continuing to write to /w1 while it is at 100% capacity. The /w1 file system was made READ-ONLY to prevent users from crashing it again.

/w1 Down

8/30/2014 Farran/Garr

The super old public data server /w1 crashed. It will remain down until we get back on Tuesday to review the console log and see what happened. This server has been on it’s last legs for a long time now so maybe it finally gave out. We will know more next week.

Almost Lost /gl

8/28/2014 10am Farran/Garr

A note that we had two failed drives on one of /gl storage units and due to how the notification was setup, it was not known to some of us of the degraded unit status.

The raid unit is currently rebuilding after the two bad drives were replaced this morning.

This is another wake-up call that you need to have backups of any important data you have on HPC. As a reminder, we do not have the resources to back-up the huge file-systems on HPC. It is your responsibility to make sure you maintain your own backups.

Also, a reminder that HPC runs with a skeleton crew and skeletons also need to take vacations. If you send email to hpc support and do not hear back from us, wait a few days and try us back again. HPC email support is only available during normal working hours of M-F 8-5pm - holidays not included.

For HPC emergencies during off hours such as HPC not responding, please call OIT help desk at x2222 (949)-824-2222 and report the issue.

Thank you for your understanding,

Joseph Farran

HPC Not Reachable

8/27/2014 Farran

The building Cisco switch that serves part of HPC network died sometime around 8pm last night (8/26/14).

The switch was replaced by OIT network team this morning at around 10am ( 8/27/14).

HPC batch running jobs were not affected, however, network connections to/from HPC were lost.

HPC DownTime Correction.

8/26/2014 Farran

HPC will be DOWN all day Tuesday 9/9/14, NOT on 9/4/14. Sorry for the confusion.

Auto Mounter

8/21/2014 Farran

Auto-mounter was restarted on all nodes. The /w2 file system was not being mounted on some nodes, so this was done to fix this issue.

/w2 Update

8/14/2014 Farran Edward

The /w2 file server is back in operation. Please check your files and let us know if you suspect any missing data?

/w2 Update

8/14/2014 Farran

The data on /w2 looks to be intact! We were able to log in remotely and mount the file-system and things appear to be good. The /w2 data server is being shipped back to UCI and will arrive on Monday 8/18/14 which will then be made available on HPC at the same mount point of /w2.

Many thanks to Advanced HPC for fixing /w2 for free.

Please note that a brand new public Raid server will be setup in the coming weeks. This new raid server will replace the old aging public /w1 & /w2 raid servers. A note will be placed here when the new public Raid server goes on-line on HPC.

/w2 Update

8/12/2014 Farran

Advanced HPC (AHPC) called yesterday to notify us that they received the repaired raid card back from Taiwan for the /w2 server. AHPC will soon let us know if the data on /w2 can be recovered. Stay tune.

HPC Downtime Tuesday 9/9/14 All Day

8/4/2014 Farran / Harry

The HPC cluster will be down all day Tuesday 9/9/14 starting at 8am until 8pm.

The entire cluster has to be brought down in order to switch from Gluster to the new BeeGFS ( http://www.fhgfs.com ) /dfs1 file system. We are also moving to a new Raid server for /data home directory.

Checkpointable jobs will be checkpointed and will resume running once the cluster is back in operation, all other jobs will be terminated when the nodes reboot.

If this downtime presents a problem for you, please let us know at hpc-support@uci.edu now.

Joseph Farran

HPC Email Support Expectations

7/29/2014 Farran

A note that HPC email support ( hpc-support@uci.edu ) is only available Monday-Friday, 8am-5pm with next business day reply.

Warning

HPC email support is NOT available 24/7 year-round as we do not have the staffing to support this.

This weekend there were dozens of email exchanges on hpc-support and some users were expecting an immediate response. So to make sure expectations are in check, please do not expect an answer from HPC support during off hours.

Thank you,

Joseph Farran

Double disk failure on the Gluster Filesystem

7/17/2014 hjm

This morning, we had a double disk failure on one of the Gluster RAID servers. That is one disk away from an array failure. If the array had failed ~1/8 of the data on the gluster fs would have been lost. This is on top of the failure of the RAID controller on /w2 (data possibly recoverable, but probably not for at least 2 weeks) and increasing warning messages about the /w1 filesystem. If you don’t have a backup system in place, it might be in your interest to investigate Amazon’s Glacier backup service http://aws.amazon.com/glacier/

Grid Engine Queues Full

7/15/2014 Farran

The head node /tmp file system went full and caused Grid Engine to not be able to schedule new jobs ( all queues showed full ). The problem has been corrected and queues are working normally again. Sorry folks for the problem this caused.

/w2 Update

7/9/2014 Farran

Good news. Allen Schiano / Dana Roode have approved the purchase of a new public Raid server for HPC to replace the failing 10+ year old raid servers /w1 & /w2. The EPO was submitted yesterday and it will take around 3-4 weeks for the paper work and unit to be built.

The old broken /w2 raid server is with the vendor AHPC and they believe that the problem resides with the Raid card. The /w2 Raid card is very old and no longer being made, so the Raid card is being sent back to the manufacture for repairs in Taiwan. AHPC is estimating around ~4 weeks for the turn around to fix /w2. We remain hopeful that /w2 will be fixed so that the data can be retrieved but there are no guarantees. As soon as we hear back from AHPC on the fate of /w2 it will be posted here.

/w2 Is in the repair shop

6/30/2014 6pm Farran / Harry

The /w2 data server is dead for now. Harry tried resuscitating it back to life but found a burnt cable & burnt board connection which is not good. The unit boots but it does not see the /w2 file system although the Raid card itself appears to be ok. Looks like a chassis component went bad and thus no data can be found.

To expedite the process, the /w2 Raid server was physically transported this morning by car back to the vendor Advanced HPC (AHPC) in San Diego. We are hoping that AHPC will be able to fix the server. We will know more in the days to come and as soon as we hear back from AHPC, it will be reported here as to the fate of /w2.

Both /w2 and /w1 public data servers are very old hardware. We are looking at replacing both with a new single public data server and we need all the political help you can render in making this happen. We have no budget and the public data servers are heavily used by all users on HPC. If you have the political clout, please contact Allen Schiano schiano@uci.edu and let him know how important this is to your research work.

/w2 I/O errors continues

6/29/2014 8pm Farran

/w2 server was up for a short time and then crashed again. It will stay down until we can check it Monday morning.

/w2 I/O errors.

6/29/2014 Farran

The file server for /w2 experience I/O errors and /w2 went off-line. Some jobs using /w2 died.

The data server was rebooted and it came back up around 3pm.

/data fileserver OS disk dies

6/6/2014 Farran/Harry/Adam

The /data file server which is the heart of HPC suffered a major stroke with a dead OS disk. The server was brought back to life around 11am with a new OS drive and image.

Around 1pm after /data came back on-line, some nodes (about 40%) were having NFS stale file handles and the affected nodes had to be rebooted. Jobs running with HPC CheckPoint should continue running from last checkpoint. All other jobs died when the nodes rebooted. Nodes that did not need to be rebooted were left as-is, so jobs on these nodes should continue to run.

File Descriptor Issue on /data

06/05/2014 Adam Brenner

Folks, the /data server is having some file descriptor issues (allocating and modifying them) over the NFS mounts. You might experience programs hug or waiting for feedback. We will take care of this issue in the morning.

Input/output error on /data

05/27/2014 Farran

The file-system /data starting having I/O errors at around 1pm. Server was rebooted and all is back to normal. Please check your files. Also, this may be a sign that the server or controller is on it’s way out. As always, make sure you have backups.

KERNEL PANIC interactive node

05/20/2014 Farran

The interactive node crashed and was rebooted around 2pm.

KERNEL PANIC ON /data SERVER

05/18/2014 hjm

At about 3am on Sunday May 18th, nas-1-1 which serves out /data had a kernel panic which locked up /data. This is where all user data is stored as well as cluster config files, so it effectively locked up the cluster.

It appears to have been related to an error with swap, not the /data partition which looks to be intact, but this lockup may well have disrupted running batch jobs. Please check your jobs and files.

AND AGAIN: BACK UP YOUR DATA!! THIS MAY WELL HAVE BEEN …the end.

nas-7-1 UP

5/14/14 Joseph / Harry

Nas-7-1 is up using a spare LSI controller and the spare controller is suspect as this was the controller which gave out at the beginning of the year on one of Gluster raid servers.

AHPC vendor has issued an RMA and is working on shipping us a replacement asap. When the RMA replacement arrives, nas-7-1 will need to be taken down again to install the new LSI controller.

Until then /fast-scrach is up and available and hopefully it will stay up until the new LSI controller arrives.

Note: Any job using checkpoint that tried to jump nodes while /fast-scratch was down very likely died.

nas-7-1 down (Including /fast-scratch)

5/14/14 Adam Brenner

On May 14th at roughly 10pm, nas-7-1 went down. This is the node that is in charge of hosting /fast-scratch and /flash-scratch. In addition, it hosts the Galaxy Web Portal, HPC Trac (hpc-trac.oit.uci.edu) and Accounting Reports.

The cause is due to a failed RAID Card "Mulitbit ECC errors detected on the RAID Controller." This card is no longer in working condition. HPC does not have a spare RAID Card (no funds) and a replacement (RMA) is in the process.

This process can take a few weeks to be finished. In the mean time, the nas-7-1 server will be offline, along with all its services (THIS INCLUDES APPLICATION CHECK POINTING).

Harry Mangalam is attempting to replace the failed RAID card, with another RAID card which was loaned to us by LSI for testing purposes (pre-production model) in order to restore some services for the cluster. However, this is far from an ideal solution / fix.

More updates will be posted here. For questions, comments, concerns, please email hpc-support@uci.edu

gcc 4.9.0

5/14/14 Farran

Gcc version 4.9.0 was re-recompiled when it was discovered that gcc threads was not correct.

Core Binding

5/6/14 Farran

Core-binding has been turned on cluster-wide on HPC. This will help with programs that try to consume more cores than have been requested. For details, please see:

Core Binding is new on HPC so please report any issues or problems to support@hpc.uci.edu

Login Nodes Rebooted

5/6/14 (11am) Farran

Sorry folks - a mis-configured script caused the login nodes to be rebooted.

Grid Engine Issues UPDATE

04/28/14 Farran

Grid Engine 8.1.6 was re-compiled from source and installed cluster-wide live. As of this moment it appears all is well and GE is behaving - jobs are being scheduled/suspended correctly. Time will tell if GE is now back to normal or not.

Grid Engine Issues

04/28/14 Farran

Grid Engine has been acting up since the update of GE 8.1.6 during the cluster maintenance. Jobs are not being suspended nor scheduled correctly. This was not obvious until days later after the upgrade in how GE was failing to suspend.

Tried reverting back to GE 8.1.4 but now all queues show as full.

Sorry folks but we may need to take the cluster down and start GE from scratch.

Compilers updated & MPI Issues

04/24/14 Farran

Latest version of PGI Portland compiler version 14.3 installed on HPC:

  • module load pgi/14.3

New GNU compiler gcc version 4.9.0 installed.

  • module load gcc/4.9.0

Following gcc dependencies also updated: gmp/6.0.0, mpc/1.0.2 & binutils/2.24.

MPI Issues:

  • When the nodes were updated to CentOS 6.5 during the cluster maintenance, some OFED bits (Mellanox software) broke MPI communication. All nodes OFED rpms will need to be re-installed which means rebooting the nodes. If you need your nodes done now, please let us know at hpc-support@uci.edu

Maintenance

04/21/14 Adam, Harry, Farran

HPC Cluster maintenance completed 1 day early. Updates and changes include:

  • Physical re-location of some compute nodes and raid servers in preparation for more storage Raid servers. Cleaned out some Raid bays of dust.

  • Son of Grid Engine updated to latest version 8.1.6

  • All nodes updated to CentOS 6.5

  • Franhufer File System OS(fhgfs) upgraded.

  • Other maintenance tasks.

Please report any issues and or problems to hpc-support@uci.edu

/fast-scratch

04/17/14 Farran

The /fast-scratch service went south and /fast-scratch was not available cluster-wide around 2pm. Server was rebooted and all appears to be back to normal. Some jobs were in limbo while /fast-scratch was not accessible.

Java 1.7 Updated.

04/17/14 Adam / Farran

Adam updated Java 1.7 to version 1.7.0.55. Older Java 1.7 was found to have an exploit.

Java 1.7 has been temporarily disabled due to a security issue. Once it’s patched it will be available again via modules.

HPC April News / Information / Planned Downtime

04/03/14 Adam Brenner <aebrenne@uci.edu>

Please read the mass email that was sent out today regarding the cluster status and planned downtime. More information here: http://hpc.oit.uci.edu/april-news

/fast-scratch Ready to Use

03/31/14 Adam Brenner <aebrenne@uci.edu>

The /fast-scratch file-system is ready to be used cluster wide and is available immediately to all users under /fast-scratch/$USER

The /fast-scratch file-system is designed for TEMPORARY (the length of your job) storage that is accessible cluster wide. Its purpose is to offer fast (read and write) speeds in a distributed environment over infiniband network.

If your job has to use Zot Files http://hpc-trac.oit.uci.edu/wiki/HowTo/CleanUpZotFiles and can not be changed, /fast-scratch is an ideal location to use for the runtime of your job. After your job is finished, compress your files into a single archive and store the data elsewhere.

/fast-scratch IS NOT DESIGNED FOR LONG TERM STORAGE (over a few days) NOR A REPLACEMENT FOR PUBLIC / PAID STORAGE. Because space is limited on /fast-scratch your data MAY BE DELETED WITHOUT NOTICE if we find files that are older then a few days.

More information on HPC datastorage can be found here: http://hpc.oit.uci.edu/data-storage

/w2 FileServer Back On-line

03/29/14 Joseph Farran

A large job array writing to /w2 caused the Raid server to crash due to I/O from too many nodes writing too fast. When /w2 Raid server crashed then NFS goes south and that kept users from being able to login to HPC today.

PLEASE DO NOT EMBARK WITH LARGE JOBS WITHOUT FIRST DOING SOME DRY RUNS. Please start slow and scale up SLOWLY to make sure you do NOT impact the cluster.

/w2 FileServer Offline

03/29/14 Joseph Farran, Harry Mangalam, Adam Brenner

The /w2 fileserver is offline at the moment and the cause is unknown. Harry is going to the data-center to take a look as to why.

Colorful Login Nodes

3/26/14 Harry Mangalam

Unless you have explicitly set your prompt, your login prompt has now changed to a multicolor prompt like this:

Wed Mar 26 12:40:30 [0.21 0.22 0.25] you@node:/where/you/are/working 1066 $

Please include that prompt with all support requests.

(Farran) If you prefer the original and normal default prompt, run

  • setup-default-prompt.sh

Pub64 72 hours max run time

3/25/14 Farran

The public 64 queue (pub64) wallclock max run-time changed to 3 days (72 hours) to allow everyone quicker access to the public queue.

New Users on HPC

3/20/14 Farran

HPC is growing fast with many new users. If you are new to HPC, PLEASE make sure you read how to run jobs on the cluster:

This read is MANDATORY for all new HPC users as we are getting several users doing things they should not and as a result transforming the HPC system administrators into system janitors cleaning up after the mess. You don’t want your account locked and we don’t want to lock it, so read the link above.

/data Input/Output Error

3/18/14 Farran

The /data file server on HPC experience an I/O error around 1:15pm today causing /data to become non-responsive across the entire cluster.

The /data file server was rebooted and it is now working again, however, several jobs on HPC were effected and/or aborted since /data is the heart of the HPC system and without it, bad things happen.

Please check your jobs and resubmit if you were affected.

If you are using /data for your work, please DON’T. The /data file system is only meant for very light data processing. For any serious type of data work, use the public file servers and/or the private data servers if you have access to them.

New Login Nodes

3/13/14 Farran/Adam

Two new login nodes have been added to HPC cluster each with 3 GigE external networks to help with network congestion. When you ssh to hpc.oit.uci.edu, you will now be connected to one of six possible HPC GigE connections in a round-robin fashion to load balance the network.

If you need to move large amount of data to/from HPC, please use an IO node as you will get better performance. To request an IO node, use:

  • qrsh -q ionode

Then start your transfers with sftp, filezilla, rsync, etc.

/w1 & /w2 New Disk quotas

3/4/14 Farran

Starting 3/18/14 (in 2 weeks) the public data servers /w1 and /w2 will now have disk quotas enforced with a 3TB limit per user. This is being done to allow all HPC users fair access to the public data servers.

You can check on your HPC disk usage at:

If you need more disk space, you can purchase space on the HPC distributed file system. Send your requests to hpc-support@uci.edu and the amount of disk space you need.

/hpc path

2/25/14 Farran

A note that directories that were under /hpc/$USER are now under /gl/your-group/$USER. An email about this move was sent a while ago and the removal of the links were done a few days ago.

So if you were using /hpc/$USER, you should find your files in /gl/your-group/$USER.

If you cannot find your files, please let us know at hpc-support@uci.edu

ls glitch on /gl

noon, Monday Feb 24th, 2014 - hjm

In an effort to improve the glacial speed of the ls command on the gluster (/gl) filesystem, I switched on an option that was supposed to optimize its speed and in fact it seemed to do so for several days (anyone notice?). However, this morning at about noon, a sharp-eyed user noticed that all his files were gone from /gl and inquired about them. The files were intact but ls refused to report them anymore. This bug has been reported to the gluster developers and ls will be slower but more reliable from now on.

Unplanned HPC shutdown

5pm, Thu, Feb 20th. - hjm

In testing a script to smoothly shut down the Fraunhofer filesystem, a difference between versions of fuser caused the not-so-smooth shutdown of all processes owned by root (instead of all processes that reference the /ffs filesystem). This caused almost all compute nodes to crash and most had to be manually rebooted. Hundreds of SGE jobs have failed and most will have to be restarted. My shamefaced apologies. hjm

Warning
Tues Feb 11th, /ffs is UNRELIABLE, CHANGED TO READ-ONLY

The Fraunhofer FS (/ffs) had another hardware-related issue on restart on Sunday and we now think that it is UNRELIABLE until we replace the controller hardware.

We have changed it to READ-ONLY, so you can get any data off it, but cannot write any more data to it. Please remove all data that you need and can verify (we suspect that much of the data now on it is corrupted, especially large files (multiMB and larger).

Unless we hear from ppl before this Friday (Feb 14th), we will be taking it down for a full refresh (all data on it will be lost).

*As always, unless you have backup up your data elsewhere, do not expect to see it again when you go log off *

hjm

/w2 at 100% Capacity

2/19/14 Farran

The public /w2 file system is at 100% capacity. Please help by removing all unwanted files from your /w2 workspace.

At great expense I have made the public /w2 file server with no disk quotas to help everyone. If /w2 continues to stay at 100% capacity I will be forced to turn on disk quotas on the public /w2 file server.

HPC File System Summary

2/12/14 Farran

So what the heck is going on with HPC file systems?

Glad you asked! The current FraunhoferFS file system (/ffs) is made up of two storage Raid servers and one meta-data server. Back in December of 2012 the Raid controller on one of two storage servers went bad which caused major data corruption on /ffs. A new Raid controller was ordered to replace the bad Raid controller, however, the /ffs file system was repaired but the Raid controller was not replaced due to technical issues. More data corruption occurred again a few days ago and /ffs was made read-only to:

  1. Stop further data-corruption.

  2. Allow users to get whatever they can OUT of /ffs so that we can replace the bad controller and wipe-clean and redo /ffs correctly from scratch.

In order to redo /ffs, I am requesting that:

  1. We get a support contract from FraunhoferFS. This is currently being negotiated.

  2. We get new hardware for the FraunhoferFS meta-data server ( old hardware is currently being used for the meta-data ).

So this is why /ffs was made read-only. Please get whatever you need OUT of /ffs before the dead-line so that /ffs can be wiped cleaned and redone.

With respect to Gluster ( /gl ) file system. Our long term plans are to move away from Gluster and to FraunhoferFS. The reasons for this move is that Gluster:

  1. Is extremely slow in traversing the /gl file-system. This is a major issue given our huge data needs of almost 1 Petabyte of storage.

  2. During heavy I/O, Gluster thinks that some files do not exist when they actually do exist. This is another major problem that is causing Grid Engine to abort jobs since expected files are not present ( not found ).

  3. Does not work well with lots of small/tiny files ( the ZOT issue ).

In order to move away from Gluster, we need to purchase extra data storage servers since Gluster is currently at the 90% capacity.

In summary, we need funding in order to purchase the needed hardware and proper support. We are doing all we can to secure the funding but we need your help in order to help you by making your school, department, anyone who can make a difference aware of the needs of HPC.

One last note: We get a LOT of questions if data on HPC can be made more reliable and with backups, lots of backups.

The answer is resounding YES with enough money and staff resources of which we have neither. So yes we can make data on HPC as secure and reliable as American Express / Citibank / you_name_it, but the university does not have those deep pockets. We are trying to do our best given our limited resources while trying to provide a high performance cluster for your research work. We do need the funding to setup a basic reliable file-system on HPC without wasting money but requesting even the most of minimal configuration is proving to be a challenge.

Thank you for your understanding and please feel free to send any helpful suggestions to us at hpc-support@uci.edu, or to our Director Allen Schiano at schiano@uci.edu

Joseph Farran

/w1 Update

2/10/14 4:30pm Farran

The xfs_repair worked on /w1 and the file system is now back and available. There may have been some data loss for files that were open when the server went down earlier today.

Yet another reminder to please make sure you have backups of your data.

/w1 File Server

2/10/14 Farran

The public /w1 file server is having hardware issues. This is the oldest Raid server on HPC and the power outage of this weekend did not help.

The file system will not mount so we are looking at running xfs_repair which will take about a day to see if it can be repaired.

P.S. Joseph is back from jury duty

HPC Email Support Response

Feb 4/14 hjm

Harry is back, but Joseph is still on Jury Duty so we’re still playing catch-up. Please send us reminders if we haven’t completed a request.

Current FileSystem Status

2/11/14

  • The Gluster FS (/gl and links off it: /bio, /som, /cbcl, /hpc, /edu) are stable and no oddities have been reported in the last 2 days. Gluster will still fail to keep up under heavy ZOTfile load and/or Array Job load.

  • /data is fine

  • /w2 is fine, but it’s VERY full and bc of the /ffs failure (which was holding a copy of /w2 while the /w2 storage server was repaired, some of the files that were on it were corrupted. Please be careful with them (see the section about generating md5 checksums below

  • /w1 is stable, but we recently had a filesystem glitch which required an xfs_repair which may have lost some data.

Check MATLAB license status

To check the usage status of the campus license pool:

module load MATLAB # load the MATLAB module; sets up the env vars

$MATLAB/bin/glnxa64/lmutil lmstat -a -c 1711@seshat.nacs.uci.edu
#                                      (port@license-server)

# Please include the above line in your qsub scripts if you're using
# MATLAB to make sure the license server is online.

# you can check more specifically by then grepping thru the output.
# For example to find the status of the Distributed Computing Toolbox
# licenses:

$MATLAB/bin/glnxa64/lmutil lmstat -a -c 1711@seshat.nacs.uci.edu | grep
Distrib_Computing_Toolbox

We still require that you compile long-running MATLAB programs into executable code with mcc as described here., but approach can be used to debug as well as to discover whether a priori whether a library is available for use (or a posteriori, why it didn’t work.

CANCELLED: POWER OUTAGE this weekend

2/6/14 - hjm

We have been told that the Data Center power will be uninterrupted during this work so we will NOT be halting HPC over the weekend.

That said, please take care to always back up your valuable data.

How to visualize your disk usage

1/10/14 - hjm

I have added disk space visualizer kdirstat to the list of tools to check your disk space.

The short version (with an active X11 client).

kdirstat

Calculate md5 checksums on your data

1/6/14 hjm With the problems we’ve been having with the filesystems, it is always good to be able to tell which files have been corrupted and which are intact.

The program md5deep makes it easy. To generate the md5 sums on all files in a dir tree, just type:

# the 'r' option recurses thru all dirs and files
# the 'e' option sends (to STDERR) the estimated time to completion for large files

$ md5deep -re starting_dir > /where/you/want/the/file/of/md5sums

Add this command to the end of your qsub scripts to generate a complete list of files with their md5sums and then copy that file to a safe place (not on HPC).

The "som-filezilla" queue has been replaced.

1/6/14 Joseph For SOM users: The "som-filezilla" queue has been replaced. Please use "ionode" now for moving data to and from HPC.

To grab an ionode, do:

$ qrsh -q ionode

and then start your filezilla, rsync, sftp, etc process.

Update on GlusterFS problem (1/4/14; Midnight)

1/4/14 Harry Midnight /gl has been remounted and it appears to have had the desired effect ..? Please let us know if there are still missing files, IO misbehavior, inability to delete dirs, etc.

Update on GlusterFS problem (1/4/14; 9pm)

1/4/14 Harry 9pm After much googling, listserving, ircing, and experimentation, it looks like unmounting and remounting the glusterfs fixes many if not all problems we’re seeing.

We’ve been able to remount most of the cluster nodes already and are going to block the remaining ones so that we can un/remount /gl as they go idle.

However, we need everyone off the /gl filesystem AT NOON TOMMORROW (SUNDAY)

so we can do a remount on the login node. I’ve already sent email to those ppl who currently have open files on /gl and I’ll do it again tomorrow shortly before noon. If you still have open files on /gl, those processes will be killed, but the rest of the activity on hpc can continue.

HPC Cluster Maintenance Update

12/20/13 hjm, jf, aeb

Starting under a cloud

We started the HPC maintenance break with an emergency Data Center shutdown (Monday December 16th, 2013) where AC was shut off and the temperatures went well over 100F, enough so that some machines actually melted the plastic release knobs on their power supplies. We were able to shut down most of HPC before temps got too high, but we’ve been having a number of unanticipated problems during the maintenance (which actually went pretty smoothly). A number of other campus services went offline as well (email, EEE, sites, etc).

The Good News

The good news is that almost all the compute nodes were brought up to date and synced to CentOS 6.4 with identical kernels, patches, and drivers. The storage nodes were also brought into sync for the system and filesystem software (tho slightly different from the compute nodes). We did find that some problems cleared up with the forced power-off, and we’re going to regularly do full power-off cluster maintenance more frequently in the future, probably ~ 2-4 month cycle.

The Bad News

The bad news is that during the past week, we had 5-6 major storage failures, some of which caused data loss and some were fairly odd.

  • /ffs (Fraunhofer)

    • On bs5 we had twin RAID failures AGAIN that resulted in data loss. We will bring /ffs back online on a few nodes so that people can try to retrieve data, but I estimate that 20-50% of files are corrupted, based on md5 checksums that I had. Smaller files have a better chance of surviving, but unless you have checksums or other mechanisms for verifying them, assume that they are corrupted. Until we replace the controller, we consider /ffs to be unreliable. When we get a new controller, we will zero the filesystem (all data on it will be lost) and rebuild it fresh.

Important USE /ffs AT YOUR OWN RISK FOR NOW UNTIL WE CAN GET THE PROPER AND NEEDED HARDWARE.
  • /gl (gluster)

    • On bs4 the RAID controller had repeated lockups - solved (?) by re-powering and re-seating the controller. /gl became unresponsive, but I think there was little if any data loss since it was inactive at the time.

  • /data

    • All data on the /data filesystem mysteriously vanished, along with our souls - looks like a software glitch that caused this. Most of the data was recovered, people may have lost 1 days worth of files.

  • /w1

    • Not feeling left out, /w1 filesystem controller did not come up after moving the server and after the AC was shutoff in the datacenter. After repeatedly trying various things, Joseph reseated the controller and it was recognized again - we lost sweat, but no data.

  • /w2

    • Incompletely restored due to above /ffs failure (estimate ~ 20%-50% data loss of /w2). The /w2 filesystem is now running on the original raid server and no longer under /ffs. The data on /w2 is roughly from the first week of December. You may try and access your old data from /ffs under /ffs/w2. CAUTION: Very very limited disk space is available on /w2. Please clean up your data or find a new home for it.

  • DABRICK

    • not really an HPC data loss, but another storage server in an HPC rack had a simultaneous disk failure (no data loss).

/gl (Gluster) is usable

The /gl filesystem appears to be usable. We recommend that you use /gl for your short-term data. Plans are being made to move from Gluster to Fraunhofer.

/ffs is NOT RELIABLE

As noted above, we consider the /ffs filesystem unreliable until the controller is replaced. You can use it for scratch space and for very short term faster storage for array jobs, checkpoints etc, but move your useful data off it ASAP.

USE IO Nodes

Use IO nodes to move data to and from HPC. To request an IO node, simply do:

  • qrsh -q ionode

When done, please exit to allow someone else to use it.

Networking changes

The HPC IB networking was largely re-cabled and the large Voltaire switch was replaced with 2 smaller, newer Mellanox switches. The architecture was also modified, so multi-node MPI jobs will have to be run more carefully, tho this should be invisible to you. While the IB network seems to be OK, high traffic may reveal problems, so let us know if you detect problems.

New Services Node and Fast Scratch

HPC was able to get a services node that will improve some services on the cluster. Some of these services includes: Galaxy web portal, pacbio, offer more reporting features for all HPC users and groups — disk and cluster usage — as well as other services. This node will also offer extremely fast scratch storage space: 2TB SSD storage in RAID 0 and 12TB 10K Drivers in RAID 5 with 100GB SSD cache for read/write.

The services node was going to be configured and setup in the new year. However, due to the hardware failure on /ffs and lack of any spare hardware (no budget) the hardware within the services node will be used to bring stability to the cluster.

Closing Remarks

As always, we do not guarantee that your data will be where you left it yesterday, or even in an hour, so PLEASE BACK UP YOUR IMPORTANT DATA ELSEWHERE.

We hope you spend your holiday in a nicer place than we have spent the last week.

Happy Winter Break.

Harry, Joseph, and Adam.