This is a little story about computer performance from my First Industrial Period in Silicon Valley, 1971-1983.
As Wikipedia article on the Intel iAPX 432 processor explains, this new processor architecture, intended to replace the 80286 architecture altogether, was generally regarded as a dismal failure. The Intel PCs we have today are still lineal descendants of the 80x86 architecture, and in my opinion as a longtime student of instruction sets, it's still a pretty horrible architecture.
Why did it fail? Let me start at the beginning, around 1970 at Hewlett-Packard when the design of the venerable HP 3000 processor line was just getting underway.
One of the spiritual forebears of the HP 3000 was the Burroughs B5000, a 1961 design. Unlike just about every other architecture out there, the B5000 had a stack machine architecture. Several of the HP 3000 team, notably Tom Blease, came out of that shop and were strong proponents of the ease of design that came with a stack architecture.
Most architectures used a flat address space and moved data into and out of a set of fast registers to perform arithmetic and other operations on them. The Burroughs machine performed operations on items at the top of a stack, which the above-quoted Wikipedia article explains nicely. This way of using memory fits very nicely with Algol-type languages, and in fact most of the system software for the B5000 was written in a dialect very close to Algol 60.
I'll leave to another time the story of how the Omega begat the Alpha which became the HP 3000. It suffices here to say that it was a nice general-purpose 16-bit architecture that lasted in HP's product line for decades.
At the assembly language level, the HP 3000 had no registers except one called X used for indexing arrays. Consider a source statement something like C = A + B. The assembly language was something like this:
LOAD a LOAD b ADD STOR c
Very straightforward, standard procedure for a stack machine. Push the two operands. The ADD pops the two words off the top of the stack, adds them, and pushes the result back onto the stack. The STOR instruction pops the top word off the stack and stores it in c.
Internally, there was a set of four fast registers that shadowed the top of the memory stack, and were actually connected to the logic that did operations like ADD. At the hardware level, most of the stack was in real memory, but as many as four words of it would actually reside in these registers at any given point.
So, in the above example, let's assume that the shadow registers were empty. In that case those four instructions would run very quickly because there are empty registers available.
However, let's look at the case where all four shadow registers were full. The first LOAD would delay things while one of the shadow registers was pushed down to real memory to make room for the value of A. Then the second LOAD would delay again while another shadow register was pushed to memory.
Similarly, if the code started popping lots of values off the stack, that would work quickly until all the shadow registers were popped. At that point the next pop operation would delay while the word was brought up from memory.
If I recall correctly, the operations of emptying and filling shadow registers to and from memory were called QDOWN and QUP.
It was some time after machines started shipping to customers before the lab had any time to do serious performance analysis. When they did, they found that an immense slice of processor time was being spend doing QDOWN and QUP operations, and that it was a real bottleneck.
Most of the designers of the Tandem architecture came straight out of the HP 3000 lab right at the end of 1974. I straggled along four months later but got to watch most of the fun.
One of my bedrock tenets of software architecture, one which I preach in my NM Tech class about The Cleanroom software development methodology, is that when you are designing nontrivial software systems, a lot of small, well-defined modules are better than big modules. A general rule of thumb is to be suspicious of any module that doesn't fit on one page or less.
Consequently, it's important that the machine instructions for calling a function are fairly efficient.
One of the defects of the HP 3000 that the Tandem designers sought to remedy was the cost of the PCAL instruction. It took 50 microseconds on the first-generation HP 3000, and that's a lot when you can do an ADD in a handful of microseconds.
On the first Tandem system, a PCAL took only 5 microseconds. This came at the cost of a bit more code to set up the new procedure's environment, but with suitable compiler optimization, this took less time because it set up only what was needed, while the HP set up a lot of things that in most cases were not used.
To avoid the QUP/QDOWN syndrome of the HP 3000, the Tandem architecture had a classic memory stack, but it had eight real registers, and arithmetic and logic operations were performed there. These registers were also organized as a stack, but they operated much faster than main memory.
One of the standard design problems is called the make/buy decision. Is it cheaper to build something ourselves or to contract it out?
Once the first Tandem system was carving out nice big chunks of its target markets (high-uptime, non-stop applications), it was time to build the second machine. To correctly decide a make/buy scenario, we invited a team from Intel in to present their shiny new architecture, the iAPX 432, which was going to replace that nasty 4004-8008-8088-8086-80186-80286 dog's breakfast that came before.
Intel's slide show was most impressive. There were buses everywhere. There were interprocessor buses, and memory buses, and I/O buses, and every place that the buses crossed or connected to anything there was another Intel part number doing the connecting.
I think it was the aforementioned Tom Blease who stood up and interrupted all this glitz with one simple question.
“How long does it take to execute a procedure call?”
The presenter looked it up. “Two hundred and fifty microseconds.”
Tom immediately walked out, followed by the majority of the Tandem software department. The presenter was poleaxed. “What did I say?”
Needless to say, Tandem did not use that chip set for its next processor. The Wikipedia article on the iAPX 432 mentions performance problems. When I read that, I remembered Tom and his heroic question that saved us all precious time, listening to the Intel marketroid and watching his shiny slide show.