[12] WHEN ARE THE SYSTEM MAINTENANCE WINDOWS? WHY THE LOW UPTIME?

     Typically the SDF Public Access UNIX System is available to its
     members and, in some cases, the general public 24 hours a day,
     7 days a week, 365 days a year, 10 years a decade, 25 years a
     quarter century .. and so on.

     That being said there are unforeseen issues that can cause the
     system to become unavailable:

        1.  Hard Disk Crash - We have several spare drives, some of
            them already plugged in and ready to be used.  In the
            best case scenario no maintenance window is required.

        2.  Fire - In the case of fire all SDF machines must be shut
            down unless the fire is an isolated occurance.

        3.  Natural Disaster - In the Spring (Apr-May) we do get 
            affected by lighting strikes in our area due to heavy
            thunderstorms.  Best case scenario the UPS systems filter
            the spikes and dips which allow SDF to run uninterrupted.

        4.  Software Bug - This due crop up from time to time and are
            usually related to system updates.  On SDF we typically 
            will let the public access machines lag behind NetBSD
            development in order to test new releases in our lab before
            subjecting the userbase to 'new bugs'.

        5.  Routine and Scheduled Maintenance - Please read below.

        6.  Hardware Component Failure - We have many spare machines,
            some completely cabled up and ready to go at the flick of
            a remote command.  If an SDF client host becomes completely
            unrecoverable, a spare can be put into operation within 
            minutes.  Keep in mind that while all of your personal files
            are hosted on the file server, the /tmp directory is exclusive
            to each SDF client host.  

     ROUTINE AND SCHEDULED MAINTENANCE

     There is a weekly maintenance window on Sunday mornings beginning at
     02:00 AM until 03:00 AM.  This windows is not always used and when it
     is, it is used very briefly. 5 minutes prior to a shutdown or runlevel
     transition all logged in members will be notified on their terminals.
     If you see this message alerting you to system maintenance, you should
     save all open files and prepare to logout.

     Scheduled maintenance is always announced several days in advance on
     the bboard in the  board.  If it that maintenance window 
     requires extended time (basically anything over 5 to 10 minutes) the
     /etc/motd file (displayed at login) will note the details of the event.

     Scheduled maintenance is really only used when hardware upgrades have
     to take place.  In most cases, software updates can occur while the
     systems are up and available.

WHY THE LOW UPTIME?

     Uptime is relative.  What we're after is 'high availability'.  This
     means that our goal is to have the servers answering at least 99.9%
     of the time.  In the 20+ years of service SDF has been able to meet
     this goal.  The most uptime you'll see on any given server will be
     about 3 to 4 weeks.  After 3 weeks performing maintenance is necessary.
     This helps with clearing buffers, caches and other inconsistencies 
     that can occur as the systems run from cold or warm boot.  Rather
     than waiting for the system to fail due to kernel panic or a hang,
     a warm boot is performed, during the weekly maintenance window, which
     takes roughly 5 minutes or less.  Keep in mind, this doesn't occur
     weekly but usually after 3 to 4 weeks of linear uptime.

     Why is this necessary? (aka "My box runs for years under my desk").
     We too have very low usage non-public NetBSD systems that run for years
     without requiring a reboot.  However, SDF is extremely high volume with 
     sophsiticated NFS, NIS and VNODE caching.  While these do not cause
     problems with light loads, with 40,000 active users they become an
     issue.  Again, our goal is high availability which doesn't necessarily
     have to translate it long uptimes.

back