no way to compare when less than two revisions

Differences

This shows you the differences between two versions of the page.

@@ Line 1: / Line 1: @@
+<code>
+[12] WHEN ARE THE SYSTEM MAINTENANCE WINDOWS? WHY THE LOW UPTIME?
+     Typically the SDF Public Access UNIX System is available to its
+     members and, in some cases, the general public 24 hours a day,
+days a week, 365 days a year, 10 years a decade, 25 years a
+     quarter century .. and so on.
+     That being said there are unforeseen issues that can cause the
+     system to become unavailable:
+.  Hard Disk Crash - We have several spare drives, some of
+            them already plugged in and ready to be used.  In the
+            best case scenario no maintenance window is required.
+.  Fire - In the case of fire all SDF machines must be shut
+            down unless the fire is an isolated occurance.
+.  Natural Disaster - In the Spring (Apr-May) we do get
+            affected by lighting strikes in our area due to heavy
+            thunderstorms.  Best case scenario the UPS systems filter
+            the spikes and dips which allow SDF to run uninterrupted.
+.  Software Bug - This due crop up from time to time and are
+            usually related to system updates.  On SDF we typically
+            will let the public access machines lag behind NetBSD
+            development in order to test new releases in our lab before
+            subjecting the userbase to 'new bugs'.
+.  Routine and Scheduled Maintenance - Please read below.
+.  Hardware Component Failure - We have many spare machines,
+            some completely cabled up and ready to go at the flick of
+            a remote command.  If an SDF client host becomes completely
+            unrecoverable, a spare can be put into operation within
+            minutes.  Keep in mind that while all of your personal files
+            are hosted on the file server, the /tmp directory is exclusive
+            to each SDF client host.
+     ROUTINE AND SCHEDULED MAINTENANCE
+     There is a weekly maintenance window on Sunday mornings beginning at
+:00 AM until 03:00 AM.  This windows is not always used and when it
+     is, it is used very briefly. 5 minutes prior to a shutdown or runlevel
+     transition all logged in members will be notified on their terminals.
+     If you see this message alerting you to system maintenance, you should
+     save all open files and prepare to logout.
+     Scheduled maintenance is always announced several days in advance on
+     the bboard in the  board.  If it that maintenance window
+     requires extended time (basically anything over 5 to 10 minutes) the
+     /etc/motd file (displayed at login) will note the details of the event.
+     Scheduled maintenance is really only used when hardware upgrades have
+     to take place.  In most cases, software updates can occur while the
+     systems are up and available.
+WHY THE LOW UPTIME?
+     Uptime is relative.  What we're after is 'high availability'.  This
+     means that our goal is to have the servers answering at least 99.9%
+     of the time.  In the 20+ years of service SDF has been able to meet
+     this goal.  The most uptime you'll see on any given server will be
+     about 3 to 4 weeks.  After 3 weeks performing maintenance is necessary.
+     This helps with clearing buffers, caches and other inconsistencies
+     that can occur as the systems run from cold or warm boot.  Rather
+     than waiting for the system to fail due to kernel panic or a hang,
+     a warm boot is performed, during the weekly maintenance window, which
+     takes roughly 5 minutes or less.  Keep in mind, this doesn't occur
+     weekly but usually after 3 to 4 weeks of linear uptime.
+     Why is this necessary? (aka "My box runs for years under my desk").
+     We too have very low usage non-public NetBSD systems that run for years
+     without requiring a reboot.  However, SDF is extremely high volume with
+     sophsiticated NFS, NIS and VNODE caching.  While these do not cause
+     problems with light loads, with 40,000 active users they become an
+     issue.  Again, our goal is high availability which doesn't necessarily
+     have to translate it long uptimes.
+</code>
+[[misc|back]]