[SunRescue] (off-topic) LIW etc: explanation

Fri Sep 29 06:24:08 CDT 2000

Mike Hebel wrote:

[Apologies for the topic drift, and the length of this post...]

> VLIW? ILW?  i860? Itanic? (Titanic?)
> *head in hands* I'm so confused!!?!!

LIW = Long Instruction Word
VLIW = Very Long Instruction Word

There are effectively two ways to make a CPU go faster: make it
do more cycles per second, or make it do more work per cycle.

At the moment, semiconductor technology is advancing rapidly,
so we are getting faster and faster clock speeds without the
chip designs really changing very much.

There have been a number of points in history where semiconductor
technology was lagging behind, and so CPU designers had to squeeze
out more work per cycle to keep the chips getting faster.

This has to be done using some form of parallelism. Pipelining
is one form: it overlaps all the various stages that go into
executing a single instruction. If done properly, this can
bring the speed up to one instruction per cycle (compared with
typically four or more cycles for a single instruction on older
CPUs).

To go faster than one instruction per cycle, it is necessary
to execute more than one in parallel. {,V}LIW is one technique for
doing this: code is not simply a linear stream of instructions to
be executed in order:

	insn0
	insn1
	insn2
	...

but instead has multiple instructions packed into each instruction
word:

	insn0a	insn0b
	insn1a	insn1b
	...

For example, imagine that each instruction in the first example is
32 bits long (a typical value). Every cycle, the CPU fetches 32 bits,
decodes the instruction, and executes it.

In the second example, every cycle the CPU wil fetch 64 bits,
decode both instructions in parallel, and execute both in parallel.
This doubles the peak speed of the CPU, and you don't have to stop
at two. Experimental machines were built with up to (iirc) 16
instructions fetched and executed at once.

Unfortunately, not all code is parallelisable. For example, think
of the instructions to execute the simple assignment:

	tmp = a + b + c

which probably becomes:

	tmp = a + b
	tmp = tmp + c

These two instructions now have to be executed in order, because
the second relies on the result of the first. This sort of thing
is a serious limitation on how much parallelism can be exploited.
The second slot on each long word must now be filled with a
no-op instruction.

A good compiler can sometimes manage to separate these two
instructions and fill the gaps with useful instructions, but
this doesn't happen very often on typical code, and it also
relies a lot on the compiler being good. What's more, if the
gaps aren't filled they are still using memory.

> Seriously I have not a cluon as to what you guys are talking about.  Is
> there someone/thing to explain this?  (I'm kind guessing that the i860
> is a separate processor card being used for compiling and/or application
> running in the PC environment?  Maybe?  *gulp* Help? Anyone?)

The i860 was an Intel CPU from the mid-1980s. It used both LIW and
several other techniques to get very high peak performance, especially
on floating point. Unfortunately, for compiled code the performance
was around 20%--30% of the potential, because of the problems
mentioned above. All code had to be written in assembler to even
come close to maximum performance.

There were some other problems with the design as well, which meant
that a context switch between processes when running in a multitasking
environment was so slow as to be unusable.

Because the chip was no use for running a normal OS or normal compiled
software, its applications were fairly specialised. They were mostly
used for signal processing and accelerated graphics, where people were
prepared to write software in assembler to get maximum performance,
and also where there weren't multiple processes running.

Hope this helps.

--m