[geeks] computer room gallery 8-)

Peter L. Wargo geeks at sunhelp.org
Fri Jan 4 23:27:52 CST 2002


On Fri, 4 Jan 2002, Eric Dittman wrote:

> I agree that designing cache without ECC was stupid.  I heard about the
> NDA being required for a fix in the early days from several places that
> I trust to have the details.

The NDA's were employed while various early fixes were being tried.  I was
involved, both as a customer, then as an employee of Sun's high-end server
group.  (I recently switched to the RAS labs, and am now part of R&D.)  I
agree that Sun was foolish (I was royally pissed), but I am of the
impression that the intent was not to cover up a problem, but they were
floundering for a solution and didn't want to jump the gun.  I've been
using Sun equipment since 1987 or so, and they are generally pretty good
about being honest.  (They did produce some royal crap, like the 386i and
Solaris 2.0...)

> I think requiring an NDA while investigating is terrible service.  I've
> never had to sign an NDA to get a vendor to investigate or debug a problem.
> I don't think blaming the problems on the environment was any more than a
> delaying factor.

I thought so too.  However, I spent q good amount of time analyzing
this and other problems.  I saw some pretty out-of-spec datacenters.  Liek
the one with an open door to the outside, and the one that was running
at 65F and 80% humidity...  While it certainly wasn't the cause of the
problem, a poor datacenter enviroment really brought problems like this
and others to light.  (Makes sense, like margin testing.)

> There also appear to have been a couple of revised modules which didn't
> actually fix the problem as the cache was mirrored but still didn't
> have ECC.  There was also the fix that Sun produced that impacted
> performance.

ECC didn't come along until the UltraSPARC III.  The Mirrored SRAM works
quite well, as the chance of getting pairty hits on the same bit in two
modules is about the same as that of me voting for any of the George
Bushes.

> I hope they got new architects for their CPUs.  The design problems they
> had with the CPU module was not consistent with their earlier work.

The CPU isn't really at fault.  The UltraSparc I and II are good chips,
and in certain modules perform well - the Ultra-1 is a great box, and the
CPU modules in the U2/30/60/etc. are really solid. I had an E4000 with the
1M cache 250's that ran for 3 years w/2 unscheduled downtimes, both were
disk failures.  The problem was poor planning for the scaling of the
cache, and Sun suffered (as they should have) as a result.

However, (and I am not being a corporate shill) Sun learns well from their
mistakes.  I know that the focus is on product design and quality.  I was
really moved by Scott's speech when he told us: "We are no longer in the
mission-critical business, we are in the life-critical business." (Meaning
911 systems and the like.  Sun had to adjust their thinking from the way
they always did business and design before.  

-Pete



More information about the geeks mailing list