News & Announcements
UPDATE - Filesystem problems on NIH Biowulf (Biowulf)
Date: 17 June 2011 14:06:52From: steven fellini (sfellini@NIH.GOV)
This is an update to the intermittent file system problems on the NIH Biowulf cluster occurring between May 29 and June 14. On June 14, a software fix provided by IBM was applied to the GPFS file system code. Since that time the filesystem has been stable and no fileserver panics have occurred. Thanks for your patience during the period of filesystem instability. Original message: > From: steven fellini> Date: June 9, 2011 9:07:25 AM EDT > To: biowulf-users@helix.nih.gov > Subject: Filesystem problems on NIH Biowulf > > Biowulf users have probably noticed pauses while trying to > access their data directories over the past week. In some > cases, jobs have failed with "Stale NFS file handle" errors. > We apologize for this disruption to your work. > > During the downtime on April 26, a new version of IBM's > GPFS file system was installed on one of the Biowulf storage > systems, with the intent of improving reliability and > availability. Unfortunately, beginning on May 29 something > in our job mix has uncovered a bug in the code which is > causing kernel panics in the fileservers. Most pauses > encountered by users occur during fileserver failovers. > > The Biowulf staff is actively pursuing a resolution to the > problem with the storage vendor and with IBM. We will > follow up this email with another once the problem has > been resolved. >

