Differences

This shows you the differences between two versions of the page.

Link to this comparison view

faq:misc09 [2011/04/30 17:31] (current)
clemens created
Line 1: Line 1:
 +<​code>​
 +[09] WHAT SHOULD I DO IF A SYSTEM CRASHES OR LOCKS UP?
  
 +     ​Hopefully this will not happen at all to you, but if you experience
 +     '​lock ups' or '​freezes',​ please follow these steps to help prevent
 +     your own data loss.
 +
 +     Also, it is important to note that you do not have a direct connection
 +     to SDF and are mostly likely hopping through 10 or more networks to
 +     get to SDF.  You can use ping and traceroute to measure lag between
 +     your computer and SDF.  So, your experience of lag on SDF is subjective
 +     and it is very important for you to understand that.
 +
 +     ​Typically a lockup will occur when you are trying to access a 
 +     file that is resident on the fileserver. ​ For instance, say you
 +     are trying to cat a file and instead of seeing the contents you
 +     get either nothing or a message similar to:
 +
 +     ​ol1:/​sys:​ not responding
 +
 +     Be patient, the fileserver will recover shortly and your task
 +     will be completed .. you will probably see:
 +
 +     ​ol1:/​sys:​ is alive again
 +
 +     which means your request will actually begin to be processed.
 +
 +     ​During the hang time, you can use ^T (CTRL T) to display the
 +     ​status of your job .. for instance:
 +
 +     load: 2.04  cmd: tail 12966 [select] 0.00u 0.00s 0% 808k
 +
 +     ​[select] is the current state of the process id 12966 which
 +     is the '​tail'​ program. ​ If the system is waiting on actual
 +     disk I/O, you'll probably see [biowait]. ​ In cases of a hang
 +     you may see either [nfsrcvlk] (Network File System Received Lock)
 +     or [vnlock] (Virtual Node Lock) which the system will usually
 +     ​recover from, but can be telling of a serious resource problem
 +     on the NFS client should this state be prolonged.
 +     
 +     In the event that the fileserver becomes unavailable,​ it is 
 +     ​important that you do not become impatient and interrupt, quit 
 +     or suspend your jobs (^C, ^\ or ^Z) but rather, wait them out.
 +     If you are patient your chances of losing data will be
 +     ​significantly reduced. ​ Usually the fileserver will respond
 +     ​within a few seconds, but usually no longer. ​ In the case when
 +     it is the NFS client'​s problem (vnlock for more than say 20
 +     ​seconds) that particular host will most likely need to be reset.
 +
 +     More on this.  SDF is pushing NetBSD to its limits and we are
 +     ​currently (2003-2004) doing quite a bit of investigation with
 +     the uvm/​vfs/​vnode code developers to help NetBSD become scalable
 +     in high usage situations such as the loads we experience on SDF.
 +     ​Solutions we find will be incorporated into the public code.
 +</​code>​
 +
 +[[misc|back]]