[geeks] computer room gallery 8-)

Eric Dittman geeks at sunhelp.org
Sat Jan 5 00:57:53 CST 2002


> > I think requiring an NDA while investigating is terrible service.  I've
> > never had to sign an NDA to get a vendor to investigate or debug a problem.
> > I don't think blaming the problems on the environment was any more than a
> > delaying factor.
> 
> I thought so too.  However, I spent q good amount of time analyzing
> this and other problems.  I saw some pretty out-of-spec datacenters.  Liek
> the one with an open door to the outside, and the one that was running
> at 65F and 80% humidity...  While it certainly wasn't the cause of the
> problem, a poor datacenter enviroment really brought problems like this
> and others to light.  (Makes sense, like margin testing.)

I've seen some pretty poor data centers, too.  However, I've seen systems
running in those data centers without any problems.

> > There also appear to have been a couple of revised modules which didn't
> > actually fix the problem as the cache was mirrored but still didn't
> > have ECC.  There was also the fix that Sun produced that impacted
> > performance.
> 
> ECC didn't come along until the UltraSPARC III.  The Mirrored SRAM works
> quite well, as the chance of getting pairty hits on the same bit in two
> modules is about the same as that of me voting for any of the George
> Bushes.

After the mirrored caches were installed in eBay's systems didn't they
still have some crashes related to cache corruption?

> > I hope they got new architects for their CPUs.  The design problems they
> > had with the CPU module was not consistent with their earlier work.
> 
> The CPU isn't really at fault.  The UltraSparc I and II are good chips,
> and in certain modules perform well - the Ultra-1 is a great box, and the
> CPU modules in the U2/30/60/etc. are really solid. I had an E4000 with the
> 1M cache 250's that ran for 3 years w/2 unscheduled downtimes, both were
> disk failures.  The problem was poor planning for the scaling of the
> cache, and Sun suffered (as they should have) as a result.

I can't really agree here.  The problem with memory cell errors due
to radiation was known long enough for the designers to have taken
that into account.
-- 
Eric Dittman
dittman at dittman.net
Check out the DEC Enthusiasts Club at http://www.dittman.net/



More information about the geeks mailing list