[rescue] web server loadbalancing...

Greg A. Woods rescue at sunhelp.org
Fri Aug 3 09:55:22 CDT 2001


[ On Friday, August 3, 2001 at 10:45:59 (+0100), David Cantrell wrote: ]
> Subject: Re: [rescue] web server loadbalancing...
>
> You are assuming that there is someone on site who could do this.  Lots of
> big sites do not have this.  Sure, there are people on site, but they don't
> have access to the racks for reasons of security.  And in plenty of
> instances, even 5 minutes of downtime is unacceptable.  As far as *we* are
> concerned, no downtime is acceptable.  So we have redundant everything.

Since I know nothing of your systems and their configurations, please
don't take this personally or as an attack, but:

Do you really have "redundant everything", or does it just look that way
on paper?  I.e. have you gone in and randomly removed, turned off,
disconnected, or otherwise simulated failure of, half of everything in
your system?  How many times have you done that?  Have you ensured every
test is truly random and that nobody who knows anything whatsoever about
the system's design can influence the order or type of failures?

I was very impressed when I read an account of a casino IT manager being
certain enough of at least his network infrastructure that he'd invite
anyone on tour of the facilities to pull up to five patch cords of a
specific colour from his switch array so long as they didn't pull all
such coloured cords from any one switch (these being the ones that
interconnected the switches in a full mesh, thus at least partially
simulating switch failures, though without actually cutting anyone off
the network).

If you don't think you can do that then your redundancy design was a
waste of money, time, and effort.

Now what would be really interesting is to have an actuary come in and
take the MTBF numbers for each component and fully user-accessible
cable, switch, plug, etc., work out weightings for the failure of each
device and then go in and simulate such failures randomly but in
weighted frequency on *all* equipment, using MTTR numbers to weigh when
to "fix" the simulated fails again.  It should then be possible to
compress a reasonable simulation of a full lifetime down into a
"program" you could run to test your design in a reasonable real-time
period.  In an ideal world you'd continually run that test over and over
and over to catch any changes that would effect the redundancy in the
design.

I guess if I were asked to do that I might lamely argue that running
such tests would actually decrease the MTBF, but I'd bet any
insurance-type person wouldn't care and would want the tests run
anyway.

> And speaking as an ordinary user, then five minutes of downtime is also
> unacceptable and I will go elsewhere.

If ordinary users already have that perception then the Inetnet is
already doomed long before it even gets started.

If five minutes of downtime makes that much difference to you in your
real life, and if no other factors influence your choice of "service
provider", then I think you're living in a dream world too!

Why do people in general seem to have such stupid misconceptions when
computers are involved?  Most sane people don't get so upset in real
life.  Computers are real machines with real failure modes -- they're
not shining ideals of perfection that can never fail.

(the same goes for risk analysis, though of course people in general are
already very very bad at doing risk analysis in real life -- they're
just worse at it in computing and networking)

-- 
							Greg A. Woods

+1 416 218-0098      VE3TCP      <gwoods at acm.org>     <woods at robohack.ca>
Planix, Inc. <woods at planix.com>;   Secrets of the Weird <woods at weird.com>



More information about the rescue mailing list