This shows you the differences between two versions of the page.

Link to this comparison view

faq:misc12 [2011/04/30 17:33] (current)
clemens created
Line 1: Line 1:
 +     ​Typically the SDF Public Access UNIX System is available to its
 +     ​members and, in some cases, the general public 24 hours a day,
 +     7 days a week, 365 days a year, 10 years a decade, 25 years a
 +     ​quarter century .. and so on.
 +     That being said there are unforeseen issues that can cause the
 +     ​system to become unavailable:​
 +        1.  Hard Disk Crash - We have several spare drives, some of
 +            them already plugged in and ready to be used.  In the
 +            best case scenario no maintenance window is required.
 +        2.  Fire - In the case of fire all SDF machines must be shut
 +            down unless the fire is an isolated occurance.
 +        3.  Natural Disaster - In the Spring (Apr-May) we do get 
 +            affected by lighting strikes in our area due to heavy
 +            thunderstorms. ​ Best case scenario the UPS systems filter
 +            the spikes and dips which allow SDF to run uninterrupted.
 +        4.  Software Bug - This due crop up from time to time and are
 +            usually related to system updates. ​ On SDF we typically ​
 +            will let the public access machines lag behind NetBSD
 +            development in order to test new releases in our lab before
 +            subjecting the userbase to 'new bugs'.
 +        5.  Routine and Scheduled Maintenance - Please read below.
 +        6.  Hardware Component Failure - We have many spare machines,
 +            some completely cabled up and ready to go at the flick of
 +            a remote command. ​ If an SDF client host becomes completely
 +            unrecoverable,​ a spare can be put into operation within ​
 +            minutes. ​ Keep in mind that while all of your personal files
 +            are hosted on the file server, the /tmp directory is exclusive
 +            to each SDF client host.  ​
 +     There is a weekly maintenance window on Sunday mornings beginning at
 +     02:00 AM until 03:00 AM.  This windows is not always used and when it
 +     is, it is used very briefly. 5 minutes prior to a shutdown or runlevel
 +     ​transition all logged in members will be notified on their terminals.
 +     If you see this message alerting you to system maintenance,​ you should
 +     save all open files and prepare to logout.
 +     ​Scheduled maintenance is always announced several days in advance on
 +     the bboard in the  board. ​ If it that maintenance window ​
 +     ​requires extended time (basically anything over 5 to 10 minutes) the
 +     /​etc/​motd file (displayed at login) will note the details of the event.
 +     ​Scheduled maintenance is really only used when hardware upgrades have
 +     to take place. ​ In most cases, software updates can occur while the
 +     ​systems are up and available.
 +     ​Uptime is relative. ​ What we're after is 'high availability'​. ​ This
 +     means that our goal is to have the servers answering at least 99.9%
 +     of the time.  In the 20+ years of service SDF has been able to meet
 +     this goal.  The most uptime you'll see on any given server will be
 +     about 3 to 4 weeks. ​ After 3 weeks performing maintenance is necessary.
 +     This helps with clearing buffers, caches and other inconsistencies ​
 +     that can occur as the systems run from cold or warm boot.  Rather
 +     than waiting for the system to fail due to kernel panic or a hang,
 +     a warm boot is performed, during the weekly maintenance window, which
 +     takes roughly 5 minutes or less.  Keep in mind, this doesn'​t occur
 +     ​weekly but usually after 3 to 4 weeks of linear uptime.
 +     Why is this necessary? (aka "My box runs for years under my desk"​).
 +     We too have very low usage non-public NetBSD systems that run for years
 +     ​without requiring a reboot. ​ However, SDF is extremely high volume with 
 +     ​sophsiticated NFS, NIS and VNODE caching. ​ While these do not cause
 +     ​problems with light loads, with 40,000 active users they become an
 +     ​issue. ​ Again, our goal is high availability which doesn'​t necessarily
 +     have to translate it long uptimes.