faq:misc09 [SDFeu.org]

[09] WHAT SHOULD I DO IF A SYSTEM CRASHES OR LOCKS UP?

     Hopefully this will not happen at all to you, but if you experience
     'lock ups' or 'freezes', please follow these steps to help prevent
     your own data loss.

     Also, it is important to note that you do not have a direct connection
     to SDF and are mostly likely hopping through 10 or more networks to
     get to SDF.  You can use ping and traceroute to measure lag between
     your computer and SDF.  So, your experience of lag on SDF is subjective
     and it is very important for you to understand that.

     Typically a lockup will occur when you are trying to access a 
     file that is resident on the fileserver.  For instance, say you
     are trying to cat a file and instead of seeing the contents you
     get either nothing or a message similar to:

     ol1:/sys: not responding

     Be patient, the fileserver will recover shortly and your task
     will be completed .. you will probably see:

     ol1:/sys: is alive again

     which means your request will actually begin to be processed.

     During the hang time, you can use ^T (CTRL T) to display the
     status of your job .. for instance:

     load: 2.04  cmd: tail 12966 [select] 0.00u 0.00s 0% 808k

     [select] is the current state of the process id 12966 which
     is the 'tail' program.  If the system is waiting on actual
     disk I/O, you'll probably see [biowait].  In cases of a hang
     you may see either [nfsrcvlk] (Network File System Received Lock)
     or [vnlock] (Virtual Node Lock) which the system will usually
     recover from, but can be telling of a serious resource problem
     on the NFS client should this state be prolonged.
     
     In the event that the fileserver becomes unavailable, it is 
     important that you do not become impatient and interrupt, quit 
     or suspend your jobs (^C, ^\ or ^Z) but rather, wait them out.
     If you are patient your chances of losing data will be
     significantly reduced.  Usually the fileserver will respond
     within a few seconds, but usually no longer.  In the case when
     it is the NFS client's problem (vnlock for more than say 20
     seconds) that particular host will most likely need to be reset.

     More on this.  SDF is pushing NetBSD to its limits and we are
     currently (2003-2004) doing quite a bit of investigation with
     the uvm/vfs/vnode code developers to help NetBSD become scalable
     in high usage situations such as the loads we experience on SDF.
     Solutions we find will be incorporated into the public code.
back