no way to compare when less than two revisions

Differences

This shows you the differences between two versions of the page.


faq:misc09 [2011/04/30 17:31] (current) – created clemens
Line 1: Line 1:
 +<code>
 +[09] WHAT SHOULD I DO IF A SYSTEM CRASHES OR LOCKS UP?
  
 +     Hopefully this will not happen at all to you, but if you experience
 +     'lock ups' or 'freezes', please follow these steps to help prevent
 +     your own data loss.
 +
 +     Also, it is important to note that you do not have a direct connection
 +     to SDF and are mostly likely hopping through 10 or more networks to
 +     get to SDF.  You can use ping and traceroute to measure lag between
 +     your computer and SDF.  So, your experience of lag on SDF is subjective
 +     and it is very important for you to understand that.
 +
 +     Typically a lockup will occur when you are trying to access a 
 +     file that is resident on the fileserver.  For instance, say you
 +     are trying to cat a file and instead of seeing the contents you
 +     get either nothing or a message similar to:
 +
 +     ol1:/sys: not responding
 +
 +     Be patient, the fileserver will recover shortly and your task
 +     will be completed .. you will probably see:
 +
 +     ol1:/sys: is alive again
 +
 +     which means your request will actually begin to be processed.
 +
 +     During the hang time, you can use ^T (CTRL T) to display the
 +     status of your job .. for instance:
 +
 +     load: 2.04  cmd: tail 12966 [select] 0.00u 0.00s 0% 808k
 +
 +     [select] is the current state of the process id 12966 which
 +     is the 'tail' program.  If the system is waiting on actual
 +     disk I/O, you'll probably see [biowait].  In cases of a hang
 +     you may see either [nfsrcvlk] (Network File System Received Lock)
 +     or [vnlock] (Virtual Node Lock) which the system will usually
 +     recover from, but can be telling of a serious resource problem
 +     on the NFS client should this state be prolonged.
 +     
 +     In the event that the fileserver becomes unavailable, it is 
 +     important that you do not become impatient and interrupt, quit 
 +     or suspend your jobs (^C, ^\ or ^Z) but rather, wait them out.
 +     If you are patient your chances of losing data will be
 +     significantly reduced.  Usually the fileserver will respond
 +     within a few seconds, but usually no longer.  In the case when
 +     it is the NFS client's problem (vnlock for more than say 20
 +     seconds) that particular host will most likely need to be reset.
 +
 +     More on this.  SDF is pushing NetBSD to its limits and we are
 +     currently (2003-2004) doing quite a bit of investigation with
 +     the uvm/vfs/vnode code developers to help NetBSD become scalable
 +     in high usage situations such as the loads we experience on SDF.
 +     Solutions we find will be incorporated into the public code.
 +</code>
 +
 +[[misc|back]]