[geeks] Game GPU clusters for supercomputering

Fri May 23 14:06:47 CDT 2008

On Fri, May 23, 2008 at 03:09:56PM -0400, Shannon Hendrix wrote:

> >The GPU stuff is cool looking, but there are a lot of things I like  
> >better about the Cells.  The Cells seem a bit more straight forward  
> >to program.
> 
> I thought they seemed kind of weird, but then I've only read  
> documentation, not actually done either one.
> 
> I wish I had the opportunity.

Me too.

The cells are weird, just not as wierd.  The short of it is that a SPE
is basically one beefed up stand alone altivec unit.  It has only vector
registers. Instead of 8 or 16 altivec registers though, the cell has 128
registers.  It mostly uses the same Altivec instructions as I under
stand it, so C code written using GCCs vector support or GCCs altivec
instrisincs will compile directly for the SPE as well and it will take
advantage of the extra registers since when using the intrisincs, GCC
still handles register allocation.

Where the SPE gets wierd is that it doesn't have a stack of regular
32bit registers.  As I understand it, non-vectored stuff, like a loop
counter, ends up taking up an entire 128bit register with the upper
units being masked out.  

The other odd part is that SPEs don't have access to main memory.  They
can only use their local cache memory.  On the PS3 chips, that is 256K
of local memory per SPE.  Communicating with the outside world is
accomplished by triggering DMA transfers between the SPE memory and
outside memory.  SPEs can't take general interrupts directly.

As you can imagine, between the 128x128 register file, and the SRAM, a
context switch on a SPE is horrendous.  This means that in a
multitasking environment SPEs can't really be shared between
applications, so if you have a video player that uses 2 SPEs, it grabs
those SPEs and doesn't release them until you stop it.

On a side note, I just discovered the Toshiba has something called the
SpursEngine.  It is a chip with 4 SPEs and no PPE (PPE is the plain PPC
at the center of Cell).

> >The Cells can take a lot more memory (first generation is 2 gigs per  
> >chip instead of 1.5 gigs, but the new model that just started  
> >shipping can take 32gigs per chip).
> 
> Yes, but the Cell doesn't actually address that memory directly.  The  
> way I understand it, they only address about 256K, and there are  
> instructions that treat the rest of the memory as an incoming stream  
> or something like that.

The Cell is the entire chip, and it can address all of the memory
directly.  The PPE (PPC core) accesses all memory, and it DMAs into and
out of the SPE's 256K scratch areas.

> I have an article on how to program them efficiently, and it seems  
> anything but straight-forward to me.

Have you looked in depth into programming CUDA chips though?

> >The Cells don't require a PC next to them.
> 
> I thought a front-end was a requirement for most applications though,  
> since you don't want to write a lot of the I/O handling part of your  
> app on the Cell.

Remeber, the Cell is the entire chip, including the large general
purpose PPC code.

> >With the Cell you have a lot of documentation and you have GCC,  
> >which means you have C, C++, Objective C, Fortran, and ADA.  I'm not  
> >sure that C++, Objective C or Ada are practical on the SPUs, but  
> >Fortran has got to be a big deal for a lot of people.  Oh, and we  
> >have assembly.  And having assembly means that we can get a lot more  
> >languages for it.
> 
> Heh... I'm not sure lacking GNU C support is a bad thing.
> 
> GNU compilers are only good in so far as they are widely available.  I  
> personally would love to see better compilers without the GPL baggage.

LLVM is coming along, and LLVM also has SPE support for the Cell.  LLVM
does a lot of cool stuff.  Apple is using it for purposes ranging from
shader code to speeding up the front end processing of distcc.  I can't
really say how good the vectorizer of llvm is though.  Not that I'm at
all impressed with the vectorizer in GCC.